Let’s Move Beyond Open Data Portals

xpe · on Jan 3, 2016

To me, the article missed the point.

The central claim of the article is: "I actually think it’s time we abandon data portals altogether." I didn't find any compelling argumentation to support this claim. The author mentioned some trends; however, these trends do not suggest that source data is not valuable. Rather, they are suggesting that we need to build on top of data. No surprise here!

Sure, particular applications and services may appear to be more valuable than the raw data, but this does not suggest that the raw data is not valuable. Quite the opposite. I don't think any savvy person would suggest that a data portal is the end goal -- they are just the beginning.

I'd suggest something simple and probably non-controversial: data portals make sense to the extent that they provide value over a long time frame (perhaps 5 to 20 years -- or longer).

Here is a simple way to look at how data can add value.

1. Availability first (it has to exist).

2. Discoverability second (people and services need to find it).

3. Applications third (higher value can be extracted here).

(Note about me: I've worked on a now-defunct data catalog a while back. I don't have illusions about them, and I think many can be improved in key ways. Also, I know they require more maintenance than some would admit.)

6d0debc071 · on Jan 3, 2016

As long as I can get the data, it's fine. What I'd be worried about would be if apps become the only way that an end user can get access to information about their environment. That would seem like a massive step backward for all that those applications, as part of a broader context, may enhance services.

echochar · on Jan 3, 2016

Agree. The webmasters, app developers, are free to follow the trends and pursue whatever presentation and delivery mechanisms they like. But as a backup the bulk data needs to also be mirrored on at least one FTP server and available to the public in an open, accesible file format (e.g., CSV, SQL tables, XML, JSON, etc.).

ThomPete · on Jan 3, 2016

"...A good example of this is Foursquare. Beforehand you’d do everything in one app. Now there’s Foursquare and Swarm. Facebook has Messenger as a separate app. Google has like 17 different apps. You’re seeing this shift from just one specific application that does everything to many different applications designed for a particular experience...

This is true in the west but is the same true for everywhere else? As far as I remember China or Japan have apps to do everything in one and no indication that this is going to change.

danso · on Jan 3, 2016

Sorry...I just have to disagree with the OP. Several years ago, Socrata stopped by where I worked (a news organization) and told us of their idea to build a portal of city governments everywhere that would host datasets. They were new at the time and I just thought they were bonkers.

Now, I can't believe what you can find on the various data portals. There's a lot of shit data but that's because lots of organizations collect shit data. But for the organizations that do have data, Socrata is such a huge step from what existed before.

I'll ignore the many situations in which agencies just didn't put out data at all. Dallas, Texas is one exception. It has been posting its crime data for years. Except it was on a FTP site with a convoluted structure. And it wasn't all in one file. So you had to write a script that spidered the subdirectories, downloaded the files, unzipped them, and concatenated them (and I don't think they were a straight-up concatenation).

Now, it's just all on one page from which you can export the data as bulk CSV [1]. Because Socrata's REST API is so straightforward, you can just script your data requests to hit the right end points. But not only is there the incident data, there's the narrative data [2] (which had also been on the FTP site, but required its own spidering), and there's tables I hadn't seen before, such as [3] suspects and real-time active calls [4]

On top of it, the police department has even decided to put up the data for their officer-involved-shootings. Mind you, they were already ahead of the game, nationwide, last year when they created a parseable (via scraper) webpage with HTML tables and PDFs. They certainly didn't have to make their data even easier to get, but they did [5]

Texas has always been generally good about public records because of their broad sunshine law. But it's not that the law turned them all into free-data-hippies right away...the agencies just have a tradition of doing it -- it helps alot if you're a Texas employee and you've seen how everyone else just agrees to potentially damaging records requests, and yet no one gets stressed out.

I have to think that Socrata, just by being there as an option, not just in Dallas, but everywhere, has made bureaucrats more aware of how data sharing can just be...done. Certainly, there are always officials who will push back, because they're power-control-freaks or because they have something to hide. But plenty of bureaucrats don't really care...they've just been told by their IT people that putting up data in an easy way would cost too much and be too much of a security compromise. Now that there's an option of a general data portal, there's fewer reasons to say no.

Just to give you an idea of how technically clueless many bureaucrats are (and I don't really blame them, but their agencies for not prioritizing tech training)...it is still not unheard of to be denied access to machine-readable data -- e.g. they print out a spreadsheet and fax it to you, instead of just sending you the XLS -- because they think that if they give you the spreadsheet, you can "alter the data".

Yes, it really is that dumb.

edit: to the author's credit, he's not saying that open data portals should be closed, just that governments should move beyond them. That's a nice sentiment, but in reality, it's an idea that takes away resources from improving data portals.

From TFA:

> Now we actually give that directly to Waze, so they can reroute people dynamically. Indeed, this is a good open data story — taking the data to where people are —but there’s something more interesting: it’s a two-way street. Not only does Waze now share pothole and road condition data it collects regularly through its app, they went one step further. They began to proactively collect and share data in the interest public safety.

But that can already be done with the existing LA data portal and its REST API. Why does the city of Los Angeles have to give Waze anything other than the GET endpoint, from which Waze engineers can download as they like. And not just Waze, but everyone else, in equal measure. So there's nothing wrong with what the author wants, he just doesn't appear to think that with APIs, developers can create far better and far more resources than what the city could do itself.

And no, the city (unless it has a magical source of revenue) can't do both building out more "human" data applications while improving their open data pipelines. The latter has much, much further to go before the city can spend IT money on building out new apps.

[1] https://www.dallasopendata.com/Police/Dallas-Police-Public-D...

[2] https://www.dallasopendata.com/Police/Bulk-Police-Narrative/...

[3] https://www.dallasopendata.com/Police/Dallas-Police-Public-D...

[4] https://www.dallasopendata.com/dataset/Dallas-Police-Active-...

[5] https://www.dallasopendata.com/Police/Dallas-Police-Public-D...

programnature · on Jan 3, 2016

The faster people realize this the better.

What does the technology look like for achieving these kinds of goals?

pandacam · on Jan 3, 2016

I would say it's actually fairly simple from a technology perspective: good, well-documented APIs, as much SaaS apps as reasonable, and CTOs in government who get it.

BMarkmann · on Jan 3, 2016

I think the first two are, indeed, fairly simple from a technology perspective. CTOs in government (or whatever other role is responsible for making the data in question available) isn't simple or technology-related.

From my perspective -- both working with groups mandated to make data available and researchers consuming public datasets -- those responsible for making the data available DON'T get it in the vast majority of instances. It's tough to tell if it's obtuseness, incompetence, or... call it what you will, but if you are mandated to make information available that can directly assess your performance or the performance of the organization you lead, you might not have the right incentives.

A recent experience: the federal (US) government releases data regarding clinical trials conducted by drug companies and universities available for download in a format that they basically made up. OK, no problem, I've written lots of parsers. Ingest the data from the source files, but wait! There's no data dictionary or even a vague description of the relationships between the contents of the (many) files they publish. You can make pretty good guesses, but it definitely doesn't follow a well-documented API (or schema, whatever). Just a recent gripe that's stuck in my craw, but it's not an isolated case in my experience. I have come across a few that are very good and follow the best practices you note, but most I've worked with do not. I would guess that the former have your third characteristic; the latter likely do not.

programnature · on Jan 3, 2016

If you have a bunch of tiny apps, how do their data models relate? Who runs them, and owns the data? Theres a lot of assumptions baked into the status quo described in the article, that technology has been developed around. Sweet spot for building this stuff might be a little different.