Being a ‘Dumb’ Girl in Computer Science

Leave a comment

Filed under Uncategorized

Tidy Data in R and Python

I recently came across two very useful articles on data cleaning: Hadley Wickham’s “Tidy Data” (Journal of Statistical Software 59 [2014]) and Jean Francois Puget’s blog post “Tidy Data in Python” (IBM developerWorks).

Wickham extends the concept of normalization to allow for easy analysis in an in-memory system like R:

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This is Codd’s 3rd normal form (Codd 1990), but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases. Messy data is any other arrangement of the data (4).

He then discusses five common problems with real-world data, and defines three methods for tidying it–“melting, string splitting, and casting” (5). Several examples of messy and tidy datasets, as well as a case study in R, follow.

Puget’s post, as is evident from the title, expands on Wickham’s article, giving the Python equivalents of Wickham’s R code. Pandas, as it turns out, has a melt() function–corresponding to gather() in Wickham’s tidyr or melt() in his reshape2–which forms a concise basis for data tidying within Python.

Leave a comment

Filed under Uncategorized

“Sorry if you cannot see anything in this big hairball”

Yannick Rochat has posted a critique of mandatory network visualization over at his blog. It’s worth a read.

Thanks to the facilitated access to network analysis tools and the growing interest in many disciplines towards studying the relations structuring datasets, networks have become ubiquitous objects in science, in newspapers, on tech book covers, all over the Web, and to illustrate anything big data-related (hand in hand with word clouds.). Unfortunately, the resort to networks has reached a point where in a conference I heard a speaker say:

Since this is mandatory, here is a network visualisation of these data. Sorry if you cannot see anything in this big hairball.” . . .

You would expect in a conference that everything presented has a purpose. Sadly, it seems that there is underlying pressure in scientific communities to create such horrors.

“Visualizing Networks, Part 1: A Critique” | Mathematics and Digital Humanities

(ht: Matthew Lincoln, on the Digital Humanities Slack)

Leave a comment

Filed under Uncategorized

“Maintaining the Duality of Closeness and Betweenness Centrality”

Ulrik Brandes, Stephen Borgatti, and Linton Freeman have an interesting paper in the latest volume of Social Networks: “Maintaining the Duality of Closeness and Betweenness Centrality.” Here’s the abstract:

Betweenness centrality is generally regarded as a measure of others’ dependence on a given node, and therefore as a measure of potential control. Closeness centrality is usually interpreted either as a measure of access efficiency or of independence from potential control by intermediaries. Betweenness and closeness are commonly assumed to be related for two reasons: first, because of their conceptual duality with respect to dependency, and second, because both are defined in terms of shortest paths. We show that the first of these ideas – the duality – is not only true in a general conceptual sense but also in precise mathematical terms. This becomes apparent when the two indices are expressed in terms of a shared dyadic dependency relation. We also show that the second idea – the shortest paths – is false because it is not preserved when the indices are generalized using the standard definition of shortest paths in valued graphs. This unveils that closeness-as-independence is in fact different from closeness-as-efficiency, and we propose a variant notion of distance that maintains the duality of closeness-as-independence with betweenness also on valued relations.

Ulrik Brandes, Stephen P. Borgatti, and Linton C. Freeman, “Maintaining the Duality of Closeness and Betweenness Centrality,” Social Networks 44 (2016): 153-159.

Leave a comment

Filed under Uncategorized

New: List of English Personal Nouns

I’ve put together a list of English nouns that refer to people (or, less clunkily, “personal nouns”). I plan to use it alongside existing text analysis tools (like David Bamman’s excellent BookNLP) to detect unnamed characters in the Gospels and other ancient biography. It should also, hopefully, make automatic social network extraction easier and more accurate.

The list, along with the code and sources I used to generate it, is available on my GitHub.

1 Comment

Filed under Uncategorized

“Preparing Data is Most of the Work”

Over at Data Science Central is this interesting article on “data janitor work”: the fact that the biggest hurdle to large-scale data analysis is wrangling the data into a usable form. It is, of course, directly applicable to doing text analysis in the “Million Book Library.”

Data scientist[s] spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the [New York] Times article estimates 50%-80% of a data scientists’ time is spent on data preparation. We agree. . . .

Before you start your project, define what data you need. This seems obvious, but in the world of big data, we hear a lot of people say, “just throw it all in”. If you ingest low quality data that is not salient to your business objectives, it will add noise to your results.

The more noisy the data, the more difficult it will be to see the important trends. You must have a defined strategy for the data sources you need and the particular subset of that data, which is relevant for the questions you want to ask.

Leave a comment

Filed under Uncategorized

Troubleshooting: Installing Gephi on Ubuntu

After spending the better part of two days wrestling with Gephi to get it installed on Ubuntu, I wanted to share the solution I found.

# Edit the sources file and add this to end:
# deb precise main
# Then, in Terminal, run:
sudo gedit /etc/apt/sources.list

# You can't just install gephi because it's missing libgoogle-collections-java
# And that one's not packaged with Ubuntu anymore as of Trusty at least

# Download all three files to a temp folder:
# Then navigate to that folder:
cd /path/to/folder

# Download keys for libgoogle-collections-java
gpg --keyserver --recv-keys 974B3E96

# Extract a source package
dpkg-source -x *.dsc

# Now try to build package.
cd libgoogle-collections-java-1.0/
dpkg-buildpackage -us -uc

# Fail, needs dependencies
# For me, it needed:
# maven-repo-helper maven-ant-helper cdbs default-java libjsr305-java
sudo apt-get install [missing dependencies]

# Now, build the package.
dpkg-buildpackage -us -uc

# This builds an actual .deb in folder above.
cd ..
sudo dpkg -i libgoogle-collections-java_1.0-2_all.deb

# Now install gephi.
sudo apt-get update; sudo apt-get install gephi

# Done

Modified from this AskUbuntu answer.

Leave a comment

Filed under Uncategorized