the past weekend, Bodo, Lucas, Tobi and I participated in the 2012 Rails Rumble. We decided to write a geocoder based on OpenStreetMap data, since Google likes to change their terms of service from time to time and can start charging you if you exceed a certain number of requests and since the geocoders that are currently out there sort of suck for Germany.
Attention: This post is quite long. Here’s the TL;DR: We created a geocoder based on OSM that just works, it’s awesome already, it uses an awesome open source database, and it’s going to become better and better!
Before I give you an introduction to OSM data and why the geocoder can’t work, try it out for yourself. Go to any of the services above and search for ”Marburger Straße, Gießen”.
All of them will give you around 50 results, only one of them in Gießen. The worst thing is, that the street in Gießen itself, which is the one the user is most likely looking for, is weighted so low in the results that on Nominatim you have to go to the 10th result page to find it. When you do the same query on Google Maps, it will give you exactly one result that is in Gießen. Exactly what we’ve been looking for.
Let me tell you why that’s the case
Intro to the OSM data format
If you have ever looked at the data, please skip this part :D
OpenStreetMap data basically consists of three types of information: Points (called Nodes), Ways (which can also be Areas) and Relations (I’ll get to it). All of these types of data can be tagged with certain information. This is the fourth type (Didn’t I say there are only three? Oh, well …)
Nodes are basically points. They have a latitude and longitude and maybe some information attached to them in the form of a tag. A tag of a node might be that at this exact location is a trash can, a post box, a highway intersection, the entrance to a building or whatever…
In OSM-XML format it looks like this (examples are reduced to an absolute minimum)
1 2 3 4 5
A way is a group of points that somehow belong together. It can be an open way (like a street, footpath, etc) or a closed way (called area) like a forest, the outer walls of a building, a lake, etc. A way consists of a lot of references to nodes and usually always has a tag describing it’s purpose
1 2 3 4 5 6
A relation is a collection of ways and points that serve a shared purpose. This might be something like a bus routes (ways for the route the bus takes, and nodes for the bus stops).
1 2 3 4 5 6 7 8
Aaaaaaand we’re done. That’s all there is to OSM. The only thing is … OSM has a huge amount of that data … No it’s really, really huge amount! The XML file for my state, is around 5 GiB! The file for Germany is 19 GiB and contains more than 110,000,000 nodes.
How is this pile of data usually handled? Well, the OSM people that are using that data for computation, map rendering and stuff put it into a Postgres database with the PostGIS extension. But the datastructure is pretty much the same. There are three main tables that hold the data I described above.
As I already said, there’s a huge pile of data, so searching is really hard. So those guys thought about this and created some meta tables that hold some information that is optimized for searching. They put together a string in this format “Street, Town, District, State, Country” (actually there is more info in there, let’s keep it short) and put them into the database with a reference to the way-entry that holds the geo-information about the street. When a search is issued, there is a fulltext search being issued on these strings. This is fast and gives results out quickly. But this is also where a lot of things go wrong in Germany.
Well, in Germany there’s a special thing about the district (I don’t know if it’s really special or if this might also happen in your country). The district (or Regierungsbezirk) is usually named after the biggest city in that district. Let me give you some examples:
- Marburger Straße, Gießen, Regierungsbezirk Gießen, Hessen, Deutschland
- Marburger Straße, Münchhausen, Regierungsbezirk Gießen, Hessen, Deutschland
So, for the first entry, the city is Gießen and the district is Regierungsbezirk Gießen. For the second entry, the city is Münchhausen but the district is Regierungsbezirk Gießen, too. So, when doing a full text search on that field with “Marburger Straße, Gießen”, I would get both results. So, if there is no additional filtering done, which I guess there isn’t (please correct me if I’m wrong), you would get the results in the order in which they are stored in the database. That is not really what the user wants…
And also, in Germany nobody ever uses the district for anything (except bureaucrats…) so searching for the district is just wrong. If I would go outside and ask 20 people if they know in what district they were, maybe 5 of them would know what their Regierungsbezirk is.
Enough ranting already, what are you doing about it?
So, for the RailsRumble project we decided to write our own geocoding service using OSM data (also: with Black Jack and hookers!)
Some decisions were made upfront: We wanted to use the NoSQL database ArangoDB, since we think the data that OSM uses with the undetermined number of tags is an excellent fit for NoSQL. Also, the guys at triAGENS have quite the history of making excellent database products. One of the developers was with us for most of the rumble time, helping us with the tiny optimization tweaks we didn’t know about so we could scale the database to OSM-level (Thanks Frank!). And of course, Lucas and Tobi are the developers and maintainers of Ashikawa, the Ruby driver for ArangoDB, so we could use their knowledge, too.
Enough marketing :D Let’s talk data
We decided that we are not going to go with the OSM data format I described above! Why not? It is too general purpose. Some people use the data to create maps, others do geocoding, others extract data from it and so on. We want to do one thing here: GeoCoding!
Our data format looks a little like this:
1 2 3 4 5
You might notice three things:
a) “How awesome is Germany laid out??? Their cities, states and zip code areas are circles?” - Sorry, no. But ArangoDB currently only supports geo-queries like “is this point within a given area” for circular areas. So what we did here is: Find the centroid of the polygon that describes the state or city, use this as the center of the circle and find the distance to the point that is the furthest away and use this as the radius. We know, this is a really bad and inaccurate approximation, but it was the fastest thing to do, and actually, it works ok. Which doesn’t mean we are going to stick with it. We are already talking with the ArangoDB people how we can improve this :)
b) “That is totally not usable for my case” - Yes, I said before, we transpose the data to make it fit our case :)
c) “Where are the house numbers?” - We know, and we are very sorry. A geocoder is not really good, when you have a 10km long street and we give you a point in the middle. We just didn’t have the time to figure out a good way to extract the house numbers in a meaningful manner and instead of faking it for the judges and putting the data in there for some sample queries, we left it out. But we promise, we will work on that.
Just as a side note: We only imported Nordrhein-Westfalen (the state we all live in), since we had to figure out the extraction and transposing of the data, and we simply didn’t have the resources to parse through 20 GiB of XML, transpose everything and upload it to our Linode. More data is coming!
Instead of doing a simple fulltext search, we decided to have a parser for the user input that has information about the cities, zip codes and states in Germany and splits the user input into the right fields. So when a user searches for “Im Sohlgraben 20, 35043 Marburg”, the parser extracts that Marburg is the city, 35043 is the zip code, Im Sohlgraben is the street and 20 is the housenumber. It also does some normalization so that we remove ambiguities.
The parser is also open source and can be found in our happy-geocode-organization on GitHub. This is optimized for Germany and only works with our address format. PLEASE, if you want to help out and participate, fork the project, add a parser for your country and send a pull-request! We are really looking forward to bring this thing to other countries as well!
With the parsed data, we can now do some very detailed queries in AQL, the ArangoDB Query Language
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
We have split the queries into more than one because this was the quickest (and imho also slightly dirty) implementation we could come up with, and it works. We will talk to the ArangoDB folks where we can take advantage of more indexes and more AQL features in the future.
So, the final result is a random point on the street you’ve been searching for! Until we include housenumbers that’s a fun little thing to do. We know this makes the service not really useful for anybody, but hey … 48 hours. I’m impressed with what we did there.
Parsing a huge pile of data takes CPU, RAM and TIME!
I knew this before, but I have a whole new appreciation for this, now that I had to experience it myself. At the beginning we were talking about “Let’s do Germany and if we have time, let’s get started with the US.” Well, I started with Germany, realized it’s too much and reduced the dataset to one state! And even with that, we imported the last pieces of data around 11pm, 3 hours before the end of the competition.
Creating our own format … Such a good idea?
Since we didn’t start planning anything until 10am of saturday (8 hours after the competition started), we started to implement the first solution we thought about. We refined that throughout the day but it has cost me some nerves (and I think I saw some gray hairs on my head this morning) to get everything going so that the others could work with the data.
Another problem I see is that we have a snapshot of data from friday. We cannot really link our data back to any of the original OSM data. So if we want to upgrade our dataset, we have to throw everything away that we have and start a new import.
So I think we have to rethink our data format.
Having local support beats everything
Having Frank at our site for most of the days helped so much. He helped to setup and tweak the database on our servers and whenever we hit a speedbump, he was there to help, with all the details he knows about the inner workings of the database. And even after the competition, there was a lot of mails going on between us and some of the triAGENS engineers that are eager to help to improve everything.
I have never worked so close with the creators of a framework or piece of infrastructure, and I LOVE it! Support your local companies guys!
Working with OSM is hard
The OpenStreetMap community has some great tools like the Osmosis command line utility to extract data from datasets. It uses all CPUs, all the available memory and was a huge asset in our process. But still, working with OSM data is hard. If you want to extract a certain category of data (like all gas stations in Germany) you have to download the whole OSM file for Germany and extract the data.
We want to make a lot of those tasks easier and I hope that one day, OSM will be the first goto-point for people whenever they need geodata or other information instead of Google maps
If you really made it here, thanks for reading! It was an awesome weekend and we feel really good about what we’ve created. We will continue to improve the service!
Oh … One more thing: Eventhough we are not quite sure where we are going with this, one thing will be sure: Everything we do will be open source. So please contribute, give feedback and let us know how we can improve!