Wednesday, January 23, 2013

Too Much of a Good Thing

What makes OpenStreetMap great? Well there may be multiple answers from different people. But one thing is certain: Our geodata is awesome. I mean it is kind of the reason that we exist. So more is better, right? We are, after all, trying to map the entire planet.

But it turns out, there can actually be too much of a good thing. One of my first encounters with such a situation was while doing some fixing up of imported TIGER data somewhere out west. I came across this:

Notice the scale. We are looking at a little over 100 meters of country highway and there are over 300 nodes in this image. This is obviously beyond ridiculous and I can assure you that this particular instance was cleaned up long ago and those nodes met their appropriate end in a special level of /dev/null.

But this is not an isolated incident. I keep coming across such over-noding. So I decided to use my local planet database (which I previously blogged about setting up) to try and find out how much this really happens and where such things come from.

Well, the "ways" table in the database is 85 GB in size and has over 165 million entries in it. Doing anything with the whole data set tends to be slow. So I decided to filter out some things. While discussing it on IRC, Andrew Buck suggested starting by ignoring ways with fewer than 10 nodes in them. After some consideration this actually seemed reasonable. It is hard to over-node something with 10 nodes. So I made a new table with only these ways in it and with only a subset of columns to make it smaller and more manageable. This actually ended up eliminating a lot more ways than I anticipated. It turns out, 75% of the ways in the OSM database have fewer than 10 nodes in them. This left me with about 40.5 million ways to investigate.

The next question is what does it mean to be "over-noded" anyway? Since this is a geospatial database, I can use functions to determine the length of a given way. And I also know how many nodes each way has in it. So a logical unit is nodes per meter.

Next up is to determine what is "normal" for typical mapping. Taking the 40 million ways with more than 10 nodes, I ran some basic statistical functions and found the following numbers:
Minimum nodes/meter: 0.00000282
Median nodes/meter: 0.043
Average nodes/meter: 0.088
Maximum nodes/meter: 65.23

Well that is quite a range. Let's see what is going on here. The way with the fewest nodes/meter turns out to be way 105818922 which is part of the border of Indonesia. It has 14 nodes and stretches almost 5,000 kilometers, averaging one node every 350 kilometers. It could maybe use a little more refinement but it's a national border in the middle of the ocean... it is probably not too bad.

On the other end of the spectrum is way 44342937. It supposedly represents a round "building" with a diameter of about 1.5 meters. It has 339 nodes. What the... WHAT?! What is this, the world's most accurately mapped dog house? Since it is likely that this way is going to be deleted in the very near future, I include here a screen shot of it loaded up in JOSM:

Yes, that is a tiny circle in the middle of someone's back yard. All that red around it is JOSM trying to draw its normal way direction arrows on such a ridiculous object.

Now that we have examined both ends of the spectrum, what's in the middle? Well, there is way 101521179. 174 meters long with 14 nodes which yields a nodes/meter reading of 0.08000019. That is... well... entirely reasonable. The fact that the median is half of the average would seem to indicate that the numbers are skewed towards the high end by a comparatively few number of ways with a very high number of nodes - like that one with 65 per meter.

So if 65 nodes/meter is the extreme upper limit and most people map at less than 0.1 then what exactly is "too much"? Well I don't really know but I'm just going to throw out a number for further analysis. Let's make it... 1 node per meter. This limits the number of ways down to about 105 thousand.

Since my database has geographic knowledge about these ways and I have the number down to a reasonable level, let's hook up QGIS to it and make a map of the ways with more than 1 node per meter. Let's start with the US:

Well that's actually not as dense as I was expecting. It looks like some of the TIGER over-noding might actually come in at under 1 node/meter. Which kind of makes sense if you think about it. A lot of TIGER ways are really long and the over-noding is sometimes clumped into one section so on average the nodes/meter ratio will be less than 1. But for fun let's look at what other interesting things might be in this data set. How about Europe?

Well now that's different. What's with the outbreak in France?! I have my suspicions. Let's see how this works out...

There is a "source" tag that is often used in OSM data to indicate where a certain map object came from. It is especially often used during imports. Technically it usually makes more sense to put the source tag on the changeset instead of the map object itself, especially for large imports but a lot of people still put it on the map objects anyway. So let's see what these over-noded ways reveal in their source tags. Here are the results of a query that groups things by source tag and counts how many ways with a node/meter ratio of over 1 have that source tag. I truncated the source information at 70 characters for display purposes.

 62995 | cadastre-dgi-fr source : Direction Générale des Impôts - Cadastre. Mis
 22713 | 
  3535 | extraction vectorielle v1 cadastre-dgi-fr source : Direction Générale 
  1313 | Bing
  1045 | 3dShapes
   744 | bing
   687 | Kolding Kommune
   628 | NHD
   608 | dcgis
   602 | WakeGIS
   585 | WroclawGIS
   526 | Planimetria de Vitoria
   482 | MassGIS Buildings (
   454 |
   393 | NextView
   390 | Regione Emilia Romagna
   369 | cadastre-dgi-fr source : Direction Générale des Impôts - Cadastre ; mi
   349 | kapor2
   336 | Bing Sat
   324 | SO!GIS Import
   260 |
   244 | Regione_del_Veneto_LR28_16.7.1976_Formazione_CTR_auth_39164-5700-1100_
   230 | SO!GIS Import
   225 | MGC
   214 | Kreis_Viersen_Katasteramt_2012_06
   214 | lukr
   206 | Ajuntament de Girona
   184 | CCH
   183 | NRCan-CanVec-10.0
   175 | Orthophotos 2011 du SITG (Système d'Information du Territoire Genevois
   133 | DEP Wetlands (1:12,000) - April 2007 (
   128 | vuv:dibavod:a05
   125 |
   115 | City of Kamloops
   112 | cadastre-dgi-fr source : PaysDeBrest - 20100331
   105 | OS_OpenData_VectorMapDistrict

Well then. I recognize that top source tag as the ongoing import of french building outlines and some other features from some cadastre data they got their hands on over there. A spot check shows that some of this isn't actually because of excessive nodes being used but rather the odd way in which the import is creating building geometries. For example, here is way 67157454.

Note the wall=no tag. I believe this implies that it is some kind of porch or veranda. It has a rounded front which is where all the nodes are that make it have a nodes/meter ratio of just over 1.0. If a human mapper had mapped this building it would have likely been a single way that included the porch as well as the rest of the house instead of the 8 individual areas that the import created. This would have resulted in a single 70 meter long way with 47 nodes which comes out to 0.67 nodes/meter.

Checking a few more of these ways in France, it does look like some care was taken to prevent over-noding. Most of them are just barely over 1.0 nodes/meter. And at this point it should also be noted that closed ways are actually having their first/last node counted twice because of the way the database stores the node membership information. Since I was originally looking for over-noded highways I didn't take this into account. So technically a lot of these French buildings might be just under 1 node/meter. But the fact that there are so many of them right on this (arbitrary) limit is still interesting.

There are a few other source tags I recognize as well. CanVec is imported from Canadian government data. NHD is the National Hydrography Dataset here in the US from which some people have imported rivers and lakes. I know several of these water feature imports have had problems with ridiculous over-noding. Some of it has been fixed but a lot remains. MassGIS is the Massachusetts GIS office which has been used for some local imports. In particular, the most over-noded way that I showcased above is from this MassGIS building import.

Of course the second most popular source tag is blank which doesn't tell us much. A tiny random sample shows there are a couple of imports that didn't use any source tag and some are just very detailed manual mapping.

I think finding what I was originally looking for (over-noded TIGER ways) is going to take some more digging. This post kind of got hijacked by the big red blob in France but I think I'll call it quits for now and do some more poking to find the TIGERs I'm after. If there is a lesson so far, I think it can be summed up as:

  1. Sanity check your imports! Tiny objects with a large number of nodes are an obvious sign that there is something weird going on in your conversion to .osm format. It is possible that the object was represented as a parametric curve in the source and the conversion to OSM format tried to recreate that as closely as possible within the constraints of our x/y coordinate system.
  2. Imports are not human data. Even in France where this building import is generally viewed as a good thing and a lot of checking and verifying of the data has taken place, the data is still very distinctly different from most of the OSM database that has been created by hand. This may not be a bad thing in every case but it is definitely something to think about when proposing and executing an import.
Lastly, I leave you with one more way. As I was randomly browsing the 105 thousand ways, I happened to click on way number 129485933 which has 1.00526 nodes per meter. (again, with the first/ last node double counted) If you look back at version 1 and 2 you will see a certain someone's user name. Turns out I am the culprit! This way was originally created on my phone during the baseball game we attended during SOTM 2011 in Denver. How random is that?


  1. I like that smallest house. You also notice that the nodes are aligned to a grid, far less precise than JOSM's precision.

    (also note the scale, I'm working on cm scales here)

    1. Yeah, I saw that. I was wondering if that was actually approaching the accuracy limit of the API. I'm not quite sure. I think it is somewhere around 1cm at the equator, getting smaller towards the poles.

  2. How about adding a Douglas-Peucker or similar line generalization routine for Potlatch and other tools?

  3. I was just editing around Kilimanjaro and spotted this (a way about 3km and over 1000 points, mostly just heading south)

    I've come across these before, where I suspect the original uploader has taken their GPS log and turned every GPS trackpoint into an OSM node.

    Running the JOSM Tool->Simplify Way reduces this to just 87 nodes.
    [And is this way is now comparable to the other paths in this area and no loss of detail]

    I assume the Simplify Way uses the Algorithm mentioned above.

    1. Yes, JOSM has a Douglas-Peucker simplification feature. You can even set how aggressive it is via the simplify-way.max-error setting in advanced preferences. Ideally it would be better to prevent this over-noding from getting into the database in the first place though :)

  4. That round object is a large sewer pipe on the U. of North Carolina campus. I am the one responsible for that bad import.

    I have noticed unwanted result with the simplification algorithm that may reflect user error. Many of the buildings that I imported will have say 5-10 nodes per straight side (rather than 2). When I simplify, the building is no longer right angled and orthogonalize does not help. Is there a simplify that just gets rid of extra nodes on straight parts of ways?

    Also, it looks as though your first example is just Xeno's paradox of dichotomy (always getting halfway to your destination).

    1. Nevermind about the simplification algorithm. I just realized that if you set the max-error setting to very small, it does the job