I recently came across two very useful articles on data cleaning: Hadley Wickham’s “Tidy Data” (Journal of Statistical Software 59 ) and Jean Francois Puget’s blog post “Tidy Data in Python” (IBM developerWorks).
Wickham extends the concept of normalization to allow for easy analysis in an in-memory system like R:
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases. Messy data is any other arrangement of the data (4).
He then discusses five common problems with real-world data, and defines three methods for tidying it–“melting, string splitting, and casting” (5). Several examples of messy and tidy datasets, as well as a case study in R, follow.
Puget’s post, as is evident from the title, expands on Wickham’s article, giving the Python equivalents of Wickham’s R code. Pandas, as it turns out, has a
melt() function–corresponding to
gather() in Wickham’s tidyr or
melt() in his reshape2–which forms a concise basis for data tidying within Python.
Over at Data Science Central is this interesting article on “data janitor work”: the fact that the biggest hurdle to large-scale data analysis is wrangling the data into a usable form. It is, of course, directly applicable to doing text analysis in the “Million Book Library.”
Data scientist[s] spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the [New York] Times article estimates 50%-80% of a data scientists’ time is spent on data preparation. We agree. . . .
Before you start your project, define what data you need. This seems obvious, but in the world of big data, we hear a lot of people say, “just throw it all in”. If you ingest low quality data that is not salient to your business objectives, it will add noise to your results.
The more noisy the data, the more difficult it will be to see the important trends. You must have a defined strategy for the data sources you need and the particular subset of that data, which is relevant for the questions you want to ask.