Over at Data Science Central is this interesting article on “data janitor work”: the fact that the biggest hurdle to large-scale data analysis is wrangling the data into a usable form. It is, of course, directly applicable to doing text analysis in the “Million Book Library.”
Data scientist[s] spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the [New York] Times article estimates 50%-80% of a data scientists’ time is spent on data preparation. We agree. . . .
Before you start your project, define what data you need. This seems obvious, but in the world of big data, we hear a lot of people say, “just throw it all in”. If you ingest low quality data that is not salient to your business objectives, it will add noise to your results.
The more noisy the data, the more difficult it will be to see the important trends. You must have a defined strategy for the data sources you need and the particular subset of that data, which is relevant for the questions you want to ask.
Here’s an interesting think piece from Frank Pasquale in Aeon on the nature and role of data in society today.
Regulators want to avoid the irrational or subconscious biases of human decision-makers, but of course human decision-makers devised the algorithms, inflected the data, and influenced its analysis. No ‘code layer’ can create a ‘plug and play’ level playing field. Policy, human judgment, and law will always be needed. Algorithms will never offer an escape from society. . . .
An inference . . . may not be worth much on its own. But once people are so identified, it could easily be combined and recombined with other lists – say, of plus-sized shoppers, or frequent buyers of fast food – that solidify the inference. A new algorithm from Facebook instantly classifies individuals in photographs based on body type or posture. The holy grail of algorithmic reputation is the most complete possible database of each individual, unifying credit, telecom, location, retail and dozens of other data streams into a digital doppelganger.
However certain they may be about our height, or weight, or health status, it suits data gatherers to keep the classifications murky. A person could, in principle, launch a defamation lawsuit against a data broker that falsely asserted the individual concerned was diabetic. But if the broker instead chooses a fuzzier classification, such as ‘member of a diabetic-concerned household’, it looks a lot more like an opinion than a fact to courts. Opinions are much harder to prove defamatory – how might you demonstrate beyond a doubt that your household is not in some way ‘diabetic-concerned’? But the softer classification may lead to exactly the same disadvantageous outcomes as the harder, more factual one.