This week was all about taking the raw scheduling data and processing it and putting it into a form that can be used by a mobile app. To give a sense of what that means it’s maybe useful to look at what the scheduling data looks like.
Below is a bus schedule I printed from the midttrafik website. I’ve highlighted some features of a particular bus departure: bus 13 visiting Dokk1 at 6.34 on weekdays.
This is simple for a human to understand. There’s a bus route called 13. This bus makes a number of trips every weekday, one of which starts at 6.10 from Vejlby Nord, passes 16 other stops before ending up at Frydenlund at 6.54. Along the way it visits a stop called Dokk1 at 6.34.
The data provided by the transit companies for machines to read, the data I need to use, looks much different. There all the different concepts: routes, trips, stops, stops times, etc, are split into separate flat text files. Let me give you an example. Here’s what the data you get about the stop, Dokk1, looks like,
…
38400,“Viborgvej/Frydenlunds Allé”,,“56.163513”,“10.174106”,0
38600,“Dokk1. Europaplads”,,“56.153984”,“10.212759”,0
38700,”Brendstrupgårdsvej/Skejby Busvej",,“56.187187”,“10.171541”,0
…
The stops are given in a file called stops.txt which is simply a list, one line for each stop, which gives an ID number for the stop, the name, the geographic position, and various other bits of information. The current file contains around 64,000 stops.
The route, 13, lives in a separate file, routes.txt, which looks just like stops.txt,
…
19133_3,281,“775”,“”,3,,
19308_3,281,“13”,“”,3,,
19309_3,281,“1A”,“”,3,,
…
Again, it gives an ID number for the route and its name. There’s around 1,800 of those.
How do we know that bus 13 stops at Dokk1 at 6.34? That’s in another file, stop_times.txt, which is – you guessed it – a list of times busses stop at a particular stop, one per line, around 2,700,000 lines in total.
…
39185170,6:32:00,6:32:00,03300,21,0,0,“”
39185170,6:34:00,6:34:00,38600,22,0,0,“”
39185170,6:36:00,6:36:00,09200,23,0,0,“”
…
This says that the bus arrives at the stop with ID 38600 at 6:34 and leaves again also at 6:34. It doesn’t mention route 13 though so how do we know it’s the right bus and not some other one that happens to be there at the same time? For that we have to use the trip ID that’s given in the same line, 39185170. Trips are listed in another file – you guessed it, trips.txt – which I’m sure looks familiar:
…
19308_3,357,39185099,“Frydenlund”,“”,“0”,,
19308_3,18,39185170,“Frydenlund/Fuglebakkevej”,“”,“0”,,
19308_3,18,39185171,“Frydenlund/Fuglebakkevej”,“”,“0”,,
…
What this says is that there’s a bus trip belonging to the route with ID 19308_3, which we happen to know from before is route 13, that goes to Frydenlund/Fuglebakkevej. The entry for Dokk1 at 6:34 we found in stop_times.txt says it belongs to this trip so that tells us that that entry does indeed belong to route 13.
Now, based on a handful of files like this – and there’s more than the ones I’ve mentioned now – we need to be able to answer questions like: I’m at some position, what’s the nearest bus stop? I’m at bus stop X, what’s the next bus that arrives? What’s the last bus that goes to stop Y today? What I’ll do is take the files as input but reorganize the data completely into a form that allows those questions to be answered efficiently. That’s what this week has been all about.