No APIs Please: Just Publish the Data

It may seem counter-intuitive, but it will make Recovery.gov more open if it can avoid creating APIs for making data available. Please just publish the data, preferably in XML with XSL, XHTML or XHTML + RDFa, directly to the web at easily discovered and/or patterned URLs. The web already has a straightforward protocol, provide a URL and receive a document. And resulting sets of documents can be treated as an object database. Unlike SQL/relational databases, the sets of documents are not private, locked up in a binary system that may only be addressed by the data owners. Sets of documents in XML at predictable and/or discoverable URLs with similar structures can be queried by anyone on the web in real time in sophisticated ways that the publisher may not have envisioned when creating an API for revealing data locked in an SQL database.

Another aspect which is vital to making the information more accessible is that when publishing the data in XML it is possible to make the data human readable by using XSL and/or CSS. Data formats that are not human readable in a web browser force data for software and humans segregated unnecessarily. This may seem esoteric, but it is an important for trust and to make documenting the data the norm. And also the data becomes citable as a URL that people can actually understand. There is currently a web site, XMLDatasets, which allows XML documents that may not be human readable (despite being a text document, XML is code and I do not consider it to be human readable except when shown in a browser or viewer). XMLDatasets even shows XML Schemas in an approachable manner.

Since APIs do provide a big advantage in allowing simple coding that might seem less obvious than if the data was just posted on a web site. For this reason I have been developing a way to standardize the method of querying and extracting atomic data from large sets of data. One piece of that standardization, I call a Repository Schema, maps out all the data in a set or repository of documents. The first part of the Repository Schema is for discovering all of the objects/documents in a repository through URL discovery. The second part is to map out all of the usable parts of the document. The third part is a listing of transformations, like between XML languages or to more printable formats. And the fourth is a listing of external index files that allows fast access without pulling down and indexing all of the repository documents.

Unlike APIs, anyone could develop their own Repository Schemas for any set of web documents. There could be competing or overlapping Repository Schemas, like one for all country pages on Wikipedia with population breakouts or another of Wikipedia pages for all places that shows photos. If the standard tools using standards like XQuery were created, there could be full and open auditing of the data from the source to a resulting mashup. The audits could give people confidence in data pulled together into mashups from government documents.

Note: there is also another standard called Rosetta Stone that is complementary to Repository Schema that allows related repositories to be linked together.

Why is it important?

If there were no resource limits, there would be no reason to not both publish data in human readable formats that are also machine processable along with using APIs. But once information is published at permanent URLs with all of the data included, all of the work is really done (and many government agencies just publish raw data which is much easier and better than creating an API). And making all of the data human readable will have lots of unforeseen benefits. Also, as certain data structures become standardized, such as through RDFa and Microformats, it will be possible to create lightweight Repository Schemas that pull documents across multiple publishers. And one example might be for web content management tools to standardize now informal microformats like finding navigation in @class="nav" or page content in @id="content". Creating APIs for open data (as opposed to commercial sources that purposely want to restrict access) will be a costly and, I think, unnecessary step.