The report defines the Big Data phenomenon. It describes some of the technologies that can be used to derive insights from high-volume, high-velocity, and high-variety government data assets. But it doesn’t provide leadership where leadership is needed.
Crucially, Demystifying Big Data ignores the pressing need for the U.S. government to adopt standardized models, interchange formats, and identifiers throughout its data portfolio. In fact, it contains no discussion of data architecture at all. Instead, it advises federal agencies to “start with a specific and narrowly defined business or mission requirement, versus a plan to deploy a new and universal technical platform to support perceived future requirements” (p. 7). It also recommends a data “notional flow” – “Understand, Cleanse, Transform and Exploit” – that presumes a future without standardized reporting (p. 26).
Moreover, Demystifying Big Data recommends a five-step project process – “Define, Assess, Plan, Execute, Review” (p. 29) that keeps data’s uses tightly bound to original designs. In reality, the promise of Big Data lies in the fact that vast, fast, varied data assets are becoming accessible for analyses for which they weren’t originally designed.
To be sure, neither the Data Transparency Coalition nor anyone else recommends a universal, utopian architecture for government data. But all data analysis relies on structure – whether imposed ad-hoc at the time of the analysis or pre-existing. And government operations, in the U.S. as elsewhere, are rife with concepts, forms, and relationships that could be structured and should be standardized but aren’t. Treasury and OMB use incompatible means of identifying federal agencies. Dozens of regulators use separate, non-interoperable codes to for regulated entities, contracts, people, locations, and events. Of the SEC’s six hundred reporting forms, two have been partially converted into XBRL and three into XML; the rest are untagged text. The government’s failure to adopt consistent vocabularies for regulatory text results in needless ambiguity. The list goes on. Even where standard identifiers and formats have been imposed, they frequently are not built on any underlying data model – which means future changes will be unnecessarily traumatic.
The government’s adoption of standardized identifiers and formats, supported by common data models, would allow many Big Data projects to skip right over TechAmerica’s “Understand” and “Cleanse” steps, or, at any rate, dramatically reduce the time and money those steps require. Standardization – not universal, but incremental! – would vastly improve the U.S. government’s Big Data capabilities. And it’s already happening: Treasury’s Office of Financial Research is working to implement one standard identifier for regulated entities, the White House has promised to pursue another for federal awards, and Congress is poised to require a data architecture for federal spending that could incorporate both. Our Coalition supports data standardization, in federal spending and in other areas, in part because it’ll unleash the power of Big Data.
The high-tech industry must take the lead in advocating government data standardization, both because it has a civic responsibility and because it has a business opportunity. Unfortunately, Demystifying Big Data doesn’t provide that leadership.