Saturday, June 25, 2011

Cleanup on the Map Aisle

Spills happen. A sticky honey jar in the grocery store. A glass of orange juice at breakfast. Oil spills in the Gulf of Mexico. Node spills in OpenStreetMap. Some can be cleaned up with a mop and some water. Others require a more technical solution. What is a "node spill" you ask? It's when nodes get uploaded to OSM that don't have any tags on them and aren't part of a way. These tagless, unconnected nodes add no useful information to the map and are just dead weight in the database. Where do they come from? There are at least a couple of common sources of node spills. One is editor bugs and simple user error. These are generally pretty small spills of 10s or maybe 100s of nodes. The bigger problem is imports.

The topic of imports deserves its own post. For now I'll just say that badly performed imports and insufficient checking after imports can lead to 10s of thousands of empty nodes. The basic problem is that nodes get uploaded first and don't become part of a way until the way is uploaded later. So if something goes wrong with the way upload, the nodes end up in an orphaned state.

I have come across a few of these nodes before but what really caught my attention was a failed NHD (National Hydrography Dataset) import in Oklahoma. It happened to poke up into Kansas a little bit where I noticed it next to a state highway I was editing. After a lengthy thread on the talk-us mailing list I eventually found a good way to detect these nodes. Since then I have made my way across the US and some of Canada deleting these useless nodes from the database. As detailed in this message it goes something like this:

  • Use the XAPI to perform a query of the form /node[not(way)][bbox=a,b,c,d]
  • Open the result of this query in JOSM and apply a filter to hide all nodes with tags
  • Perform some checks to make sure there really is no useful data
  • Delete nodes and upload

XAPI Query
The [not(way)] xapi predicate is documented on the OSM wiki XAPI page. In addition you need to add a bounding box to the query. An easy way to come up with a bounding box is to use the uixapi page found at Hold down shift and draw a box around the area you wish to query. Then tick the "Search by Area" checkbox above the map.  The URL at the bottom of the page will change to include a [bbox=...] portion. Note that the bbox won't update automatically if you draw another box. You need to uncheck and recheck the box for it to update.

If you are using one of the public XAPI services then the size of the bounding box will be limited to 10 square degrees. Since I am running these queries against my own XAPI installation, I removed that limitation so I've just been doing entire states at a time. Depending on how much data is in a given area, the queries take very different times to return. All of Alaska took under 10 minutes while some of the small states on the east coast took about an hour.

I started out downloading the data straight into JOSM using the "Open Location..." feature in the File menu but for particularly dense areas this caused some strange problems so I switched to using wget to download the data to a file and then opened the file in JOSM. I learned something about wget: it has a default timeout of 15 minutes. So if the query takes longer than 15 minutes, wget will give up and try again which will result in two identical queries running simultaneously and slowing things down even more. So remember to use the --timeout=0 option to disable the timeout in wget. Here is a sample wget command for Kansas:
wget --timeout=0 -O KS.osm "http://localhost:8080/xapi/api/0.6/node[not(way)][bbox=-102.12,36.87,-94.46,40.11]"

Loading into JOSM
The previous step gives you a .osm file that contains all nodes that are not part of a way in the area you queried. Obviously this includes all POIs that contain useful data. The file can be rather large and may require that you give JOSM a couple of GB of memory to use. Here is a screen shot of Maine:

Now let's filter out all nodes with useful information. In this case you can see a filter in the bottom right of the screen of the form "-untagged" which hides all nodes with tags except for nodes that only have the following tags: attribution, created_by, source, fixme and note. (I won't guarantee that this is a complete list) While nodes with only these tags aren't really useful map data, you do need to be careful because nodes with a fixme or a note tag could be useful information to other mappers. The other stricter JOSM filter that I used sometimes was "tags:1-999" which filtered out all nodes with any tags on them. After filtering, things are a little more manageable:

Now for a bit of data analysis. Over on the right side of the window you can see I have the "Authors" panel open which lists who the last person was to touch the currently selected objects. The most frequent user in the US was woodpeck_fixbot. After some digging I found out why. The Census Bureau TIGER data was originally imported with several tags such as source and an upload_uuid tag on every single node. After the import, it was determined that these tags on the 10s of millions of nodes in the TIGER data were unnecessary and bloated the database and the planet file so much that it was decided to remove them. This removal was done by woodpeck_fixbot. So all these nodes are originally from the TIGER import and I deleted these without a second thought since they have been sitting in the database for several years.

However for other cases I did more checking. I investigated the changesets that left a lot of empty nodes to determine what was being done in these uploads and whether the user noticed the failure. Sometimes it was obvious that the failure was noticed and the data was re-uploaded so that the empty nodes were actually duplicates of data that was successfully uploaded later. In this case deletion was also obviously warranted. One way to check this is to just zoom in on some of the nodes with the OSM tiles in the background:

Obviously these nodes were intended to be part of the border of this lake which exists. In this case it was doubly confirmed by looking at the changeset comment where it stated that this was an upload of NHD data.

Unfortunately sometimes the uploader did not notice the error so no re-uploading took place and there were no features present. In some of these cases I contacted the user to make sure they knew about the error and see if they had a way of recovering from it. A surprising number of them knew there was a problem during the upload process and thought they had fixed it but they didn't get everything. Just about everyone I contacted said to go ahead and delete the nodes.

The other thing you really have to watch out for is currently running imports. I hit one of those in St Louis. Someone was actively importing NHD data while I was cleaning up empty nodes in the area. If I hadn't noticed this and blindly deleted the nodes it would have caused their upload to fail horribly and led to all kinds of headaches including even more empty nodes. One good way to do this is to select all the nodes you are thinking about deleting and then using the JOSM search feature with the string "timestamp:2011" to find nodes last touched in 2011 within your current selection. These can then be more thoroughly investigated.

After sufficient checking had been performed, I deleted the nodes and uploaded them with a helpful changeset comment.

There is one final kink that popped up a couple of times during upload. While all these nodes are not used in a way, a few of them were members of a relation. Most of these cases were clear errors. For example some empty node in the middle of Nebraska was a member of an administrative boundary in France. No clue how that happened. I'm assuming some editor or import bug. There were several in California that were members of turn restriction relations with a "location_hint" role. Not sure what that's supposed to mean but of course I left them alone. In these cases the API will return an error saying that the node is still in use by a relation and you can't delete it. To resolve the conflict you can take the relation ID from that error message and download the relation using JOSM's "Download Object" feature. It will complain that there is a conflict which can then be resolved in the conflict editor. Then the upload can continue.

Here is a shot of LiveMapViewer  after a day of work near the Great Lakes. Obviously this includes the work of other mappers as well but most of that red is the result of my deletions.

As stated above, I have covered the entire US with this cleanup. In most places I was pretty conservative in what I deleted in that I only deleted those nodes that I positively identified as part of an import or otherwise determined what had happened and why. Other nodes that I wasn't sure about I left in place. I have yet to receive a single complaint about any of these edits so I think I did ok. Despite the conservative deletion I have nuked just over 411,000 nodes. This translates to a space savings of about 50 MB in the planet.osm file assuming each empty node takes about 125 bytes of space. Guess that's not even a drop in the bucket. But still a worth while cleanup effort. It even got me pretty high up on the monthly list of most active OSM users on the data stats page!

Some of the more extreme cases of bad data I found were a failed upload of parcel data in Arkansas that was supposed to have been cleaned up by the original uploader but was not due to a bug in the editor. It took three changesets to clean that up. Then there was the case of ridiculous over-noding. There were 14,000 nodes that were mostly contained in about a 4 block area near San Bernadino in California. That amount of bad data can actually make it difficult for casual mappers to do anything in the area since some editors may not handle that volume of data very well.

Time to go get some fresh mop water and continue the cleanup.

No comments:

Post a Comment