August 31, 2016
A data-cleaning tool for building better prediction models
Researchers develop interactive system for cleaning massive data sets
Columbia University School of Engineering and Applied Science
IMAGE: Tested on a dirty, real-world data set, ActiveClean (in red), was able to clean just 5,000 records to bring the researchers' prediction model to a 90 percent accuracy level. The... view more
Credit: Eugene Wu
Big data sets are full of dirty data, and these outliers, typos and missing values can produce distorted models that lead to wrong conclusions and bad decisions, be it in healthcare or finance. With so much at stake, data cleaning should be easier.
That's the inspiration for software developed by computer scientists at Columbia University and University of California at Berkeley that hands much of the dirty work over to machines. Called ActiveClean, the system analyzes a user's prediction model to decide which mistakes to edit first, while updating the model as it works. With each pass, users see their model improve.
"Dirty data is pervasive and prevents people from doing useful things," said Eugene Wu, a computer science professor at Columbia Engineering and a member of the Data Science Institute. "This is our first step towards automating the data-cleaning process."
The team will present its research on Sept. 7 in New Delhi, at the 2016 conference on Very Large Data Bases. Wu helped develop ActiveClean as a postdoctoral researcher at Berkeley's AMPLab and has continued this work at Columbia.
To see the entire article, please click here.