Of course, I’m well over halfway into writing my Big Important Thinkpiece about Capital in the 21st Century and the FT decides to throw a grenade. Smarter and more knowledgeable people than I have gone back and forth on the specific issues, and my sense seems to align with the general consensus with there being specific issues with some of the data, but that the FT criticisms were at least somewhat overblown and that there is not nearly enough to overturn some of the central empirical conclusions of Piketty’s work.

What strikes me about this episode most is just how unbelievably hard true data and methodological transparency is. The spreadsheet vs. statistical programming platform debate seems to me to be a red herring – at least as the paradigm stands, each has their uses and limitations, as well as common pitfalls, and for the kind of work Piketty was doing, which didn’t rely on more complex statistical methods but mostly careful data aggregation and cleaning, etc, a spreadsheet is probably as fine a tool as any.

The bigger issue is that current standards for data transparency, while certainly well-advanced by the power of the internet to make raw data freely available, are still sorely lacking. The real problem is that published data and code, while useful, is still the tip of a much larger methodological iceberg whose base, like a pyramid (because I mix metaphors like The Avalanches mix phat beats), extends much deeper and wider than the final work. If a published paper is the apex, the final dataset is still just a relatively thin layer, when what we care about is the base.

To operationalize this a little, let me pick an example that’s both a very good one and also one I happen to be quite familiar with, as I had to replicate and extend the paper for my Econometrics course. In 2008, Daron Acemoglu, Simon Johnson, James A. Robinson, and Pierre Yared wrote a paper entitled “Income and Democracy” for American Economic Review in which they claimed to have demonstrated empirically that there is no detectable causal relationship between levels of national income and democratic political development.

The paper is linked; the data, which is available at AER’s website, are also attached to this post. I encourage you to download it and take a look for yourself, even if you’re far from an expert or even afraid of numbers altogether. You’ll notice, first and foremost, that it’s a spreadsheet. An Excel spreadsheet. It’s full of numbers. Additionally, the sheets have some text boxes. Those textboxes have Stata code. If you copy and paste all the numbers into Stata, then copy and paste the corresponding code into Stata, then run the code, it will produce a bunch of results. Those results match the results published in the corresponding table in the paper. Congratulations! You, like me, have replicated a published work of complex empirical macroeconomics!

Except, of course, you haven’t done very much at all. You just replicated a series of purely algorithmic functions – you’re a Chinese room of sorts (as much as I loathe that metaphor). Most importantly, you didn’t replicate the process that led to the production of this spreadsheet full of numbers. In this instance, there are 16 different variables, each of which is drawn from a different source. To truly “replicate” the work done by AJR&Y you would have to go to each of those sources and cross-check each of the datapoints – of which there are many, because the unit of analysis is the country year; their central panel alone, the 5-Year Panel, has 36,603 datapoints over 2321 different country-years. Many of these datapoints come from other papers – do you replicate those? And many of them required some kind of transformation between their source and their final form in the paper – that also has to be replicated. Additionally, two of those variables are wholly novel – the trade weighted GDP index, as well as its secondary sibling the trade-weighted democracy index. To produce those datapoints requires not merely transcription but computation. If, in the end, you were to superhumanly do this, what would you do if you found some discrepancies? Is it author error? Author manipulation? Or your error? How would you know?

And none of these speaks to differences of methodological opinion – in assembling even seemingly-simple data judgment calls in how they will be computed and represented must be made. There are also higher level judgment calls – what is a country? Which should be included and excluded? In my own extension of their work, I added a new variable to their dataset, and much the same questions apply – were I to simply hand you my augmented data, you would have no way of knowing precisely how or why I computed that variable. And we haven’t even reached the most meaningful questions – most centrally, are these data or these statistical methods the right tools to answer the questions the authors raise? In this particular case, while there is much to admire about their work, I have my doubts – but to even move on to addressing those doubts, in this case, involves some throwing up of hands in the face of the enormity of their dataset. We are essentially forced to say “assume data methodology correct.”

Piketty’s data, in their own way, go well beyond simply a spreadsheet full of numbers – there were nested workbooks, with the final data actually being formulae that referred to preceding sources of raw-er data that were transformed into the variables of Piketty’s interest. Piketty also included other raw data sources in his repository even if they were not linked via computer programming to the spreadsheets. This is extremely transparent, but still leaves key questions unanswered – some “what” and “how” questions, but also “why” questions – why did you do this this way vs. that way? Why did you use this expression to transform this data into that variable? Why did you make this exception to that rule? Why did you prioritize different data points in different years? A dataset as large and complex as Piketty’s is going to have hundreds, even thousands of individual instances where these questions can be raised with no automatic system of providing answers other than having the author manually address them as they are raised.

This is, of course, woefully inefficient, as well as to some degree providing perverse incentives. If Piketty had provided no transparency at all, well, that would have been what every author of every book did going back centuries until very, very recently. In today’s context it may have seemed odd, but it is what it is. If he had been less transparent, say by releasing simpler spreadsheets with inert results rather than transparent formulae calling on a broader set of data, it would have made it harder, not easier, for the FT to interrogate his methods and choices – that “why did he add 2 to that variable” thing, for example, would have been invisible. The FT had the privilege of being able to do at least some deconstruction of Piketty’s data, as opposed to reconstruction, the latter of which can leave the reasons for discrepancies substantially more ambiguous than the former. As it currently stands, high levels of attention on your research has the nasty side-effect of drawing attention to transparent data but opaque methods, methods that, while in all likelihood at least as defensible as any other choice, are extremely hard under the status quo to present and defend systematically against aggressive inquisition.

The kicker, of course, is that Piketty’s data is coming under exceptional, extraordinary, above-and-beyond scrutiny – how many works that are merely “important” but not “seminal” never undergo even the most basic attempts at replication? How many papers are published in which nobody even plugs in the data and the code and cross-checks the tables – forget about checking the methodology undergirding the underlying data! And these are problems that relate, at least somewhat, to publically available and verifiable datasets, like national accounts and demographics. What about data on more obscure subjects with only a single, difficult-to-verify source? Or data produced directly by the researchers?

On Twitter in discussing this, I advocated for the creation of a unified data platform which not only allowed users to merge the functions and/or toggle between spreadsheet and statistical programming GUIs and capabilities, but also to create a running annotatable log of a user’s choices, not merely static input and output. Such a platform could produce a user-friendly log that could either be read in a common format (html, pdf, doc, epub, mobi) or uploaded by a user in a packaged file with the data and code to actually replicate, from the very beginning, how a researcher took raw input and created a dataset, as well as how they analyzed that dataset to draw conclusions. I’m afraid that without such a system, or some other way of making not only data, but start-to-finish methodologies, transparent, accessible, and replicable, increased transparency  may end up paradoxically eroding trust in social science (not to mention the hard sciences) rather than buttressing it.

Income and Democracy Data AER adjustment_DATA_SET_AER.v98.n3June2008.p808 (1) AER Readme File

Advertisements