A couple of weeks ago I completed splitting stops and timetables into packages such that to find information for a particular place you only have to look at some of the data, just the packages that cover where the place is, not all of the data. That helped a lot to make the app faster – but alas not enough. Building the departures for a given stop still took ages.
Part of the time it took I can fix but part of it I can’t. Before I start using the timetables from a package I need to read it from a file and parse it so I know what is located where in the package. That takes an uncomfortably long time and the problem is, it’s not my code that does that so it’s not something I can just fix. Plus the code as it is is actually quite efficient, it’s limited how much faster I can make it. In other words: the problem is not that the code is inefficient at parsing the data, it does it more or less as efficiently as you could, the problem is that the data has to be parsed in the first place. That’s a result of how I’ve chosen to store the data, using protocol buffers. They make the data small but at the cost of making it more difficult to figure out where everything is.
What I’ve worked on for the last two weeks is changing how I store data completely, from using protocol buffers as my storage framework to cap’n proto. It makes some different tradeoffs where data is structured much more straightforwardly so the parsing step goes away completely, but on the other hand it takes up much more space. I work around that by compressing zipping the data afterwards – so in some sense I’ve traded unzipping for parsing but I actually zipped the data before too, and unzipping is much much faster than parsing. It’s taken a fair bit of work to make the transition but at this point everything is cap’n proto-ified and it works just as I’d hoped, the big expensive parse step has all but vanished.
Beside this I’ve started to benchmark the basic operations and I’ve made some significant improvements to my own part of the code. One improvement I’ve made is, it took a long time for a given stop to find the routes that visit the stop because I had to visit all routes an all packages that cover the stop and for each route see if it visited that stop. To speed that up I’ve given each route a bloom filter that allows you to ask, without looking through all its stops, does the route visit this stop? So you still have to scan through all the routes but the check for each route is lightning fast, where before it took a huge amount of work. The hitch with bloom filters is that they’ll sometimes say “yes this route visits your stop” when it actually doesn’t but this is rare. It just means that you can’t take the bloom filter’s word for it when it says yes, you have to check whether it really does visit the stop, but you can always trust a no and that’s what it says in the vast majority of cases.
While I’ve been changing data formats I’ve also been working on an update mechanism. I need to be able to update the timetables as the transit companies release new data, which they do every couple of weeks. The data stays almost the same between updates, usually it’s small tweaks, so I’d like to only send out the changes, not a completely new set of timetables that are almost identical to what you already have. And I’d like the update mechanism to be simple because if I come up with something elaborate it’s more likely to contain bugs, so it’s more likely there’ll be errors in your timetables, and people will end up missing their busses or waiting for trains that don’t exist. Which is just the absolute worst – it’d be better if I made no app at all than make one that people can’t trust.
Anyway, with the new data format I came up with a really super simple approach to updating where updates are small if not much has changed. The trick is that even though the update itself will be large – even the empty update where there are no changes will be as large as the data being updated – the fewer changes there are to the data the better zip will be able to compress it. That’s the idea. I’ve implemented the basics of it and it works – I can generate updates, apply them, and the result is what it should be – but the part where the updates compress well so far hasn’t worked out. Tiny updates don’t actually compress into tiny zip files, the zip files are still pretty big. I don’t have a good understand of why that is yet though and I’m relatively confident when I figure that out I can solve the problem by tweaking what I already have, I won’t have to scrap the approach completely.
Another thing I’ve spent some time on is checking the departure boards I generate against the ones rejseplanen generates. I can now do that and it’s revealed some problems. Specifically, one thing I hadn’t been aware of was that sometimes two different stops should actually be considered the same one. Often when you’re at a stop there’ll be another stop with the same name just across the road, one for busses going one way and the other for busses going the other way. When you ask for a departure board for that stop you actually want to get results for both – at least, that’s what rejseplanen does. The data I get doesn’t explicitly say “this stop has a twin across the road”, I have to figure that out myself by noticing that they have the same name and are close together. And that’s actually not as easy as it sounds when there’s 60.000 stops and I can’t be sure stops that are together occur near each other in the list I’m given. I call this stop clustering because it’s more general than just twin stops across the road from each other. Sometimes a single place has more than one stop, for instance a train station may have multiple platforms that are listed as separate stops, but often you’ll want to view them all as a single station.
For the rest of this week I’ll keep tweaking the departure board code – I know I can make it a lot faster still – and the update mechanism to make updates smaller. And finally I want to finish stop clustering.