I recently came across two very useful articles on data cleaning: Hadley Wickham’s “Tidy Data” (Journal of Statistical Software 59 ) and Jean Francois Puget’s blog post “Tidy Data in Python” (IBM developerWorks).
Wickham extends the concept of normalization to allow for easy analysis in an in-memory system like R:
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases. Messy data is any other arrangement of the data (4).
He then discusses five common problems with real-world data, and defines three methods for tidying it–“melting, string splitting, and casting” (5). Several examples of messy and tidy datasets, as well as a case study in R, follow.
Puget’s post, as is evident from the title, expands on Wickham’s article, giving the Python equivalents of Wickham’s R code. Pandas, as it turns out, has a
melt() function–corresponding to
gather() in Wickham’s tidyr or
melt() in his reshape2–which forms a concise basis for data tidying within Python.