Transit data and relational databases

What makes working with the data I described in the last post manageable is that there are tools – excellent, free tools – for doing exactly that. A lot of the work in understanding the data is about cross-referencing: a line in one file contains an ID that identifies a line in another, which probably also contains IDs that identifies more lines in more files.

An excellent tool for working with this kind of data is a relational database so most of the time I’ve spent so far has been loading the data into one of those. That gives you a way to look through the data much more easily than the flat files.

What’s even better is that if you load the data right it understands the relationship between what’s in the different files and will let you ask questions, really involved questions too. Here, for instance, is the question “which bus routes have at least one stop at a stop with the name Dokk1:

It turns out that there’s 20 and, as expected from the last post, 13 is on the list. 

This is really useful for me working with the data on my machine and you might think you could just put a relational database in the mobile app and it would solve all all your problems. That doesn’t work. When you load the data into the database it grows a lot so it would be too big, and while it can answer really involved questions it takes time – finding the list of busses took about 6 seconds and made my laptop whir up and sound like a vaccum cleaner. That wouldn’t work on a phone. So it’s a super useful tool but doesn’t solve everything, far from it.