Tidy Data in R and Python

I recently came across two very useful articles on data cleaning: Hadley Wickham’s “Tidy Data” (Journal of Statistical Software 59 [2014]) and Jean Francois Puget’s blog post “Tidy Data in Python” (IBM developerWorks).

Wickham extends the concept of normalization to allow for easy analysis in an in-memory system like R:

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This is Codd’s 3rd normal form (Codd 1990), but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases. Messy data is any other arrangement of the data (4).

He then discusses five common problems with real-world data, and defines three methods for tidying it–“melting, string splitting, and casting” (5). Several examples of messy and tidy datasets, as well as a case study in R, follow.

Puget’s post, as is evident from the title, expands on Wickham’s article, giving the Python equivalents of Wickham’s R code. Pandas, as it turns out, has a melt() function–corresponding to gather() in Wickham’s tidyr or melt() in his reshape2–which forms a concise basis for data tidying within Python.

Leave a comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s