Cross posted at mathbabe.
I’m a bit late to this conversation, but I was reminded by Cathy’s post over the weekend on open data – which most certainly is not a panacea – of my own experience a couple of years ago with a group that is trying hard to do the right thing with open data. The UN funded a new initiative in 2009 called Global Pulse, with a mandate to explore ways of using Big Data for the rapid identification of emerging crises as well as for crafting more effective development policy in general. Their working hypothesis at its most simple is that the digital traces individuals leave in their electronic life – whether through purchases, mobile phone activity, social media or other sources – can reveal emergent patterns that can help target policy responses. The group’s website is worth a visit for anyone interested in non-commercial applications of data science – they are absolutely the good guys here, doing the kind of work that embodies the social welfare promise of Big Data.
With that said, I think some observations about their experience in developing their research projects may shed some light on one of Cathy’s two main points from her post:
1.) How “open” is open data when there are significant differences in both the ability to access the data, and more important, in the ability to analyze it?
2.) How can we build in appropriate safeguards rather than just focusing on the benefits and doing general hand-waving about the risks?
I’ll focus on Cathy’s first question here since the second gets into areas beyond my pay grade.
The Global Pulse approach to both sourcing and data analytics has been to rely heavily on partnerships with academia and the private sector. To Cathy’s point above, this is true of both closed data projects (such as those that rely on mobile phone data) as well as open data projects (those that rely on blog posts, news sites and other sources). To take one example, the group partnered with two firms in Cambridge to build a real-time indicator of bread prices in Latin America in order. The data in this case was open, while the web-scraping analytics (generally using grocery-story website prices) were developed and controlled by the vendors. As someone who is very interested in food prices, I found their work fascinating. But I also found it unsettling that the only way to make sense of this open data – to turn it into information, in other words – was through the good will of a private company.
The same pattern of open data and closed analytics characterized another project, which tracked Twitter in Indonesia for signals of social distress around food, fuel prices, health and other issues. The project used publicly available Twitter data, so it was open to that extent, though the sheer volume of data and the analytical challenges of teasing meaningful patterns out of it called for a powerful engine. As we all know, web-based consumer analytics are far ahead of the rest of the world in terms of this kind of work. And that was precisely where Global Pulse rationally turned – to a company that has generally focused on analyzing social media on behalf of advertisers.
Does this make them evil? Of course not – as I said above, Global Pulse are the good guys here. My point is not about the nature of their work but about its fragility. The group’s Director framed their approach this way in a recent blog post:
We are asking companies to consider a new kind of CSR – call it “data philanthropy.” Join us in our efforts by making anonymized data sets available for analysis, by underwriting technology and research projects, or by funding our ongoing efforts in Pulse Labs. The same technologies, tools and analysis that power companies’ efforts to refine the products they sell, could also help make sure their customers are continuing to improve their social and economic wellbeing. We are asking governments to support our efforts because data analytics can help the United Nations become more agile in understanding the needs of and supporting the most vulnerable populations around the globe, which in terms boosts the global economy, benefiting people everywhere.
What happens when corporate donors are no longer willing to be data philanthropists? And a question for Cathy – how can we ensure that these new Data Science programs like the one at Columbia don’t end up just feeding people into consumer analytics firms, in the same way that math and econ programs ended up feeding people into Wall Street jobs?
I don’t have any answers here, and would be skeptical of anyone who claimed to. But the answers to these questions will likely define a lot of the gap between the promise of open data and whatever it winds up becoming.