Testing departure lists

One thing I’ve spent a fair amount of time on is generating departure boards, lists of departures from a particular stop, and then comparing them to rejseplanen’s departure boards. Obviously I want the information I show to be correct so any difference between the two is a potential error with what I’m showing so I have to look into it. Sometimes it’s my code that’s wrong, sometimes – believe it or not – my response is correct and rejseplanen is wrong. I know because in some cases I’ve asked the transit companies and they agree with me, not them.

The images show what I’m looking at a lot of the time. The first one is an error: rejseplanen has a departure at 07:32 on the 10th of March from Rolighedsskolen (Thisted) but it’s missing from mine, the next one I have is at 8:32. The second image is how I visualize my data, it’s showing all routes that visit that stop according to the data I’m basing my response on, and there is indeed a bus visiting at 07:32. In this case the problem was that rejseplanen shows the departure twice, as if there were two busses arriving at 07:32, and I only showed one. So the bus wasn’t missing, it just wasn’t duplicated, but it’s something I have to check every time because just as often there would really be a problem with my response that I’d have to fix.

Obviously, I’ll keep doing these tests until I’m at a point where it’s consistently differences that I don’t have to fix, and I have to wait until then before I can release the app.

Searching for stops

This week I spent a lot of time working on the new user interface. Along the way I also implemented a bunch of small things. For instance, errors are now being collected and reported – that’s important because if I don’t know that errors are happening I can’t fix them.

As I’m getting closer to something that can actually be used it’s becoming clearer what’s missing. A big missing piece is that currently you can’t search for anything, you can only see nearby stops. I haven’t spent any time on text search so far but over the last week I started thinking about how I would implement that.

It’s still something I’m thinking about but there are some elements that I know I’d like my solution to include. For instance, if you can write the same letter in more than one way – for instance æ/ae or å/aa – you should be able to use either and get a sensible response. Similarly, if you make common misspellings – like using f instead of ff or vice versa – that’s okay, you should get a sensible response as well. And finally, whether you use an abbreviation or write out a word fully – skt/sankt, st/station, etc – shouldn’t make a difference.

Both google and rejseplanen are good at handling different ways of writing the same letter and typos but they’re both struggling with abbreviations. Google has no idea that st and station are the same and rejseplanen thinks it makes a big difference whether you abbreviate sankt. I’d like that to work more smoothly in my implementation.

This week I’ll keep fixing errors and finishing the new user interface and hopefully I’ll have some time to start implementing text search.

New UI live demo

This week I’ve spent a lot of time working on the UI. So far I’ve implemented some basic elements of what I designed last week and I’m not happy with how it works in practice – which is fine, it just means that I have to iterate some more on the designs.

I’ve also been trying to formulate some underlying principles of how the UI should work. One is to never require the user to provide input to the app. They can of course, but it should always be optional and the app should do its best to give a meaningful response regardless of how much or little it knows about what the user wants. One thing I find frustrating with other transit apps is to have to type a bunch of stuff – where I am, where I’m going, when I’m going, blah, blah, blah – before I get any response. I’d like the app to happily take any input you want to give it, but if you don’t it should still tell you something based on whatever it knows.

User interface design

This week I’ve changed gears completely.

I’ve spent all my time for months working on how to store timetables compactly and then how to retrieve the data again correctly and quickly. I also did some of that this week, notably I implemented calendar exceptions. Before, if a trip runs say monday to friday except on christmas and new year I would get the monday to friday bit right, it wouldn’t show up on weekends, but I was ignoring the christmas and new year exceptions and would show it as if it ran there too. Not anymore since this week though, that has now been fixed. I also finished stop clustering and it seems to be working just like you’d expect.

Finally, I spent some more time working on making updates smaller and picked some more low-hanging fruit there. At this point stops that haven’t changed basically cancel out and take hardly any space in the updates but trips still don’t cancel out right. I have a number of ideas for how to fix that, I’ve tried the easiest of them and they didn’t work, but I have some that will take more work but that I’m confident will do the trick.

Most of the week though I’ve been working on designing the user interface for the android app. I have literally no experience in android app design so I’ve spent a lot of time watching instruction videos where super friendly and helpful people have told me all about the tools and methods for that kind of thing. And in-between I’ve started working on the design. It’s a lot of work and I’m learning as I’m going but I’m enjoying it and I’m not unhappy with that I’ve ended up with so far. It’s really super motivating to see the actual app take shape, not just the placeholder stuff I’ve been using to test the data. Even if the final version probably won’t look anything like these first drafts.

Next week should be more of the same, more videos and design, and I plan to start implementing the designs as well.

Switching from protobufs to cap'n'proto

A couple of weeks ago I completed splitting stops and timetables into packages such that to find information for a particular place you only have to look at some of the data, just the packages that cover where the place is, not all of the data. That helped a lot to make the app faster – but alas not enough. Building the departures for a given stop still took ages.

Part of the time it took I can fix but part of it I can’t. Before I start using the timetables from a package I need to read it from a file and parse it so I know what is located where in the package. That takes an uncomfortably long time and the problem is, it’s not my code that does that so it’s not something I can just fix. Plus the code as it is is actually quite efficient, it’s limited how much faster I can make it. In other words: the problem is not that the code is inefficient at parsing the data, it does it more or less as efficiently as you could, the problem is that the data has to be parsed in the first place. That’s a result of how I’ve chosen to store the data, using protocol buffers. They make the data small but at the cost of making it more difficult to figure out where everything is.

What I’ve worked on for the last two weeks is changing how I store data completely, from using protocol buffers as my storage framework to cap’n proto. It makes some different tradeoffs where data is structured much more straightforwardly so the parsing step goes away completely, but on the other hand it takes up much more space. I work around that by compressing zipping the data afterwards – so in some sense I’ve traded unzipping for parsing but I actually zipped the data before too, and unzipping is much much faster than parsing. It’s taken a fair bit of work to make the transition but at this point everything is cap’n proto-ified and it works just as I’d hoped, the big expensive parse step has all but vanished.

Beside this I’ve started to benchmark the basic operations and I’ve made some significant improvements to my own part of the code. One improvement I’ve made is, it took a long time for a given stop to find the routes that visit the stop because I had to visit all routes an all packages that cover the stop and for each route see if it visited that stop. To speed that up I’ve given each route a bloom filter that allows you to ask, without looking through all its stops, does the route visit this stop? So you still have to scan through all the routes but the check for each route is lightning fast, where before it took a huge amount of work. The hitch with bloom filters is that they’ll sometimes say “yes this route visits your stop” when it actually doesn’t but this is rare. It just means that you can’t take the bloom filter’s word for it when it says yes, you have to check whether it really does visit the stop, but you can always trust a no and that’s what it says in the vast majority of cases.

While I’ve been changing data formats I’ve also been working on an update mechanism. I need to be able to update the timetables as the transit companies release new data, which they do every couple of weeks. The data stays almost the same between updates, usually it’s small tweaks, so I’d like to only send out the changes, not a completely new set of timetables that are almost identical to what you already have. And I’d like the update mechanism to be simple because if I come up with something elaborate it’s more likely to contain bugs, so it’s more likely there’ll be errors in your timetables, and people will end up missing their busses or waiting for trains that don’t exist. Which is just the absolute worst – it’d be better if I made no app at all than make one that people can’t trust.

Anyway, with the new data format I came up with a really super simple approach to updating where updates are small if not much has changed. The trick is that even though the update itself will be large – even the empty update where there are no changes will be as large as the data being updated – the fewer changes there are to the data the better zip will be able to compress it. That’s the idea. I’ve implemented the basics of it and it works – I can generate updates, apply them, and the result is what it should be – but the part where the updates compress well so far hasn’t worked out. Tiny updates don’t actually compress into tiny zip files, the zip files are still pretty big. I don’t have a good understand of why that is yet though and I’m relatively confident when I figure that out I can solve the problem by tweaking what I already have, I won’t have to scrap the approach completely.

Another thing I’ve spent some time on is checking the departure boards I generate against the ones rejseplanen generates. I can now do that and it’s revealed some problems. Specifically, one thing I hadn’t been aware of was that sometimes two different stops should actually be considered the same one. Often when you’re at a stop there’ll be another stop with the same name just across the road, one for busses going one way and the other for busses going the other way. When you ask for a departure board for that stop you actually want to get results for both – at least, that’s what rejseplanen does. The data I get doesn’t explicitly say “this stop has a twin across the road”, I have to figure that out myself by noticing that they have the same name and are close together. And that’s actually not as easy as it sounds when there’s 60.000 stops and I can’t be sure stops that are together occur near each other in the list I’m given. I call this stop clustering because it’s more general than just twin stops across the road from each other. Sometimes a single place has more than one stop, for instance a train station may have multiple platforms that are listed as separate stops, but often you’ll want to view them all as a single station.

For the rest of this week I’ll keep tweaking the departure board code – I know I can make it a lot faster still – and the update mechanism to make updates smaller. And finally I want to finish stop clustering.