Wednesday, February 5, 2014

Workflow For Fixing County Borders

Well, my last post got some requests for workflow and I got another snow day off from work today so why not! Follow along as I fix Wyoming's county borders.

Two things you will need to have open: JOSM with the mirrored_download plugin installed and the Wikipedia page for the list of counties in the state. Do a search on wikipedia for "List of counties in <state>" to find it. Here is the one for Wyoming.

Next you need to download all the county relations and their members. I do this using the Overpass API. JOSM has a "Download from Overpass API" option in the File menu (Update: turns out this is a feature of the mirrored_download plugin, not in JOSM core) but we aren't going to use that one just yet because it always adds a bbox parameter to the query which we don't want in this one. Instead, use the "Open Location" option and paste in the following URL after changing the 56 to the appropriate FIPS code for your state which should be listed on the Wikipedia page.
http://overpass-api.de/api/interpreter?data=relation["nist:state_fips"="56"];>>;out meta;

This tells the Overpass API to return all relations with a nist:state_fips=56 tag plus all of the relation members (>>) and finally to add OSM metadata that is needed to edit and upload (out meta). This includes things like version number, last edit timestamp and last user to touch the object. By default it excludes this metadata to reduce download size.

After that finishes you should have a nice outline of the state and all its counties. Make sure the number of downloaded relations matches the number of counties in the state. If not, it probably means the nist:state_fips tag is missing from some counties. If you can mange to find the relation through another means, download it by ID using the "Download object" option in JOSM. Or... I don't know... Ping me on IRC I guess :)

Next, let's load up all the place nodes in the state. For this I use the "Download from Overpass API" option in JOSM. Start off by drawing a box around the desired state on the map. So far I have only done mostly square states. If your state is an odd shape then you can download the bounding box and then select things outside of your state and use the JOSM "Purge" option in the edit menu to remove them from your dataset. Or just ignore them. Then paste this in this text into the query box:
[timeout:300];node["place"];out meta;
Now the screen will look a little busy so I set up a filter to only show me place=county nodes until I need the other stuff. To do this, create a new filter with the filter string
place=county or type:way or type:relation
This will initially hide the county borders and nodes that you want to see so after the filter is created, make sure all three checkboxes in the filter panel are checked. The far right one will invert the filter so instead of hiding whatever matches the filter string it will hide everything except the things that match the filter string. The second checkbox on the left makes them completely disappear instead of just making them inactive but still visible. The first checkbox can be used to quickly disable the filter when you need to see everything.

At this point my JOSM window looks like this:

Note the "Relations: 23" on the relation toolbox matches the expected number of counties in Wyoming.

At this point I usually remove the is_in tag from all the place=county nodes just to get that out of the way. Do a search (CTRL-F) for "type:node place=county" and it should select all of the county nodes. If you have some for a neighboring state loaded that you don't want to touch you can deselect them by holding CTRL while clicking. Then just delete the is_in tag.

Now for the real meat. I just go down the list of counties in Wikipedia. Bring up the first county's Wikipedia page and copy the page title for use in the wikipedia=* tag and note the name of the county seat.

Now in JOSM hit CTRL-F to bring up the search dialog and search for the county seat. If you are lucky it will find exactly one match. Note that because of the filter, you won't actually see the selected object but it is still selected! This could be considered a bug but in this case it works out. If the search found multiple things then you will have to disable the filter and find the right one.

Now:
  • Hold down shift and select the county node in addition to the city node that is already selected
  • Select the county relation in the relation list and click the "Edit" button
  • Add the two selected nodes to the bottom of the relation member list using the 4th button down on the lower right panel of the window
  • Add the role "admin_centre" to the city node
  • Add the role "label" to the county node
  • Add the alt_name tag to the relation in the top panel
  • Add the wikipedia tag. Remember how you copied the page title from Wikipedia? Just type "en:" and then paste the title.
Here is a shot of Albany county (Wyoming, not New York!) when I'm done with it:



And that's pretty much it. Now just repeat the process through the list of counties.

After I'm done I go back and look for any place=hamlet or place=village county seats and bump them up to place=town. I do this by doing CTRL-A in the relation list, then right clicking and doing a "Select members" which selects all the relation members. Then I use the find dialog to find all place=village and make sure to select the "find in selection" radio button so you are only finding things that are already selected. Otherwise you will promote ALL villages in the whole state to towns! As a sanity check, use the selection toolbox to make sure the number of objects selected matches your expectation. It should be fewer than the number of counties in the state.

I will leave you with a quick video of me editing the next county in Wyoming.

County Borders in OpenStreetMap

I have had a fairly steady relationship with county borders in OSM. They were originally imported before I joined the project in 2010. However they were imported as overlapping closed ways which is not ideal. In some places people had started working on de-duplicating the ways and turning them into relations. However this is tedious work and some of the people who attempted it didn't quite understand how relations worked so it was pretty much a big mess.

Throughout 2011 and 2012 I periodically came back and worked on county borders and eventually got to a place where there was a relation for every county in the nation. There are still some minor differences in tagging but I wasn't as concerned with that and tried not to change the work of other mappers as long as the relation existed and had valid geometry. At the same time I also reduced complexity at state borders. There used to be a mess of admin boundary ways. One for the state, one for each of the counties on either side of the state border and possibly also some city boundaries. They usually didn't match up very well either so it looked terrible. I didn't mess with cities but at least I made the state and county borders share a single way to reduce clutter.

At some point I also noticed (thanks to the MapQuest Open rendering which renders county names) that the place=county nodes were all placed at the extreme eastern border of all the counties. This caused the county names to be rendered in unexpected locations. So I ended up fixing those as well. Here is a screenshot of the last changeset where I moved over 1,100 nodes to the middle(ish) of their respective counties:



The Effects

So that was nice. But there was still some weirdness with counties, particularly when geocoding in Nominatim. Especially if there was a city with the same name as a county. For example, I searched for "Kearney, NE" and the first result was the place=county node for Kearney County in Nebraska. What I was looking for and expected to find first was the city of Kearney which is in Buffalo county. Then there is the issue of Nominatim finding both the county node and the boundary relation when you search for a county. The same thing often happens with cities. Really, they are the same thing and should show up as one result. And sometimes it might. Nominatim tries to be smart about this and does remarkably well but sometimes it needs help.

You might say we should just delete the node since we already have the boundary information. But very few renderers use boundary relations to render place names from. They all rely on these nodes. The same goes for cities, states and even countries I believe. There is also the oddness of what to use for county name. The relations all have a "name=Kearney County" tag while the nodes just have "name=Kearney".

After poking around a little and bugging lonvia (maintainer of nominatim.osm.org) about it a couple of times, I learned a few things about Nominatim and was able to make some fixes. It was my intent to come back to this some day and figure it all out and fix up county relations once and for all. I still hadn't gotten around to that but just in the last few days, new user revent was getting annoyed at the geocoding problems in Texas and really dove into Nominatim to figure out the best practices. He wrote up a diary entry on the subject and we have since then talked together some more on IRC and since I had work cancelled today because of snow I thought I would write up this post about my current understanding of things:
  • Nominatim is able to "link" multiple elements into a single feature. It tries to do this automatically but doesn't always get it right.
  • Nominatim uses Wikipedia to help determine the importance of a place.
  • Nominatim makes use of the alt_name tag.
  • After editing the map, Nominatim does not always fully reindex an area. It might depend on current workload of the server so things can seem inconsistent. But a manual reindex can be forced by an admin after a large area has undergone a lot of edits.

The linking is important because this allows the county border and the node to be linked into a single feature so that only one result is returned when you search for a particular county. It also improves addressing since Nominatim no longer has to choose between two different things to associate addresses with. This linking can be accomplished by adding the node to the county border relation with a role of "label".

The use of Wikipedia is interesting and (I think) the source of my problems with Kearney, NE that I mentioned above. Nominatim was linking the county node to the Kearney, Nebraska Wikipedia page based on name tag. This helped bump its importance up to where it was returned before the city. Now that I have added the Kearney county node to the Kearney county relation and specified a "wikipedia=en:Kearney County, Nebraska" tag, it links things to the correct Wikipedia page and the search results return the city first, as expected.


The Fix

So to summarize, I believe revent and I have settled on the ideal way to map a county. Using Kearney county as an example:
One relation with the following tags:
type=boundary
boundary=administrative
admin_level=6
name=Kearney County
alt_name=Kearney
wikipedia=en:Kearney County, Nebraska

In addition, the following tags are in widespread use:
border_type=county
nist:state_fips=31
nist:fips_code=31099

I'm not sure the border_type tag is actually used for much but it has a fair amount of existing use and seems to further differentiate borders that are the same admin_level based on local conventions - county vs borough and such. The two nist: tags were part of the import and I have actually used the state_fips as an easy way to download all county relations in a given state to work on. These FIPS codes have been withdrawn by NIST recently but are still commonly used. We may need to look at replacing or augmenting these tags with an ISO standard at some point but for now they work.

In addition to the relation tags the following relation memberships are recommended:
  • Obviously the ways that make up the border with a role of "outer" in the correct order so that they form a closed ring
  • The node with the following two tags: place=county and name=Kearney with a role of "label"
  • The place=city/town node of the county seat with a role of "admin_centre"

The county seat city node as an admin centre (yes, use the British spelling) is kind of a new thing to me. I'm not 100% sold on it but it does seem like useful information to have.

If you have looked at very many county nodes you might notice that the "is_in" tag is conspicuously absent. It turns out, it causes more problems than it solves and should probably be left off or removed. I have always found it a bit silly anyway. OSM is a geodatabase - there is no need to indicate location in a tag.

I have also been promoting any county seats that were place=village to place=town. I know I saw this as a thing on the OSM wiki at one point although I can't find it right now. While they may not meet the population guidelines of a "town" according to the wiki, I think county seats carry extra importance due to their governmental functions. It also means that at least some city names are rendered on maps at low zoom levels in low population areas. Without this promotion you could pan around in the Kansas/Nebraska/Dakotas area for several screens at zoom level 9 or 10 and not see a single city name on the map on osm.org. In some ways this is a rendering issue but I don't feel like this is an egregious case of tagging for the renderer.

Your Turn!

So if you are looking for something to do, jump in and make sure your county is correctly linked and tagged. I have already done Kansas, Nebraska and South Dakota but I probably won't have time to do everything in the near future and I will concentrate on lower population areas first so please feel free to jump in! If there is interest, I may do another post about my workflow to do a state-wide edit. Otherwise you can always find me on the #osm or #osm-us IRC channels if you have questions.

UPDATE! I have since written another post detailing my workflow: http://ksmapper.blogspot.com/2014/02/workflow-for-fixing-county-borders.html

Incidentally, a lot of the same concepts apply to cities as well. If you do a Nominatim search for "Los Angeles, CA" right now it will return two results. One for the node and one for the boundary relation. Making the node a member of the relation should fix that.
(sigh. In doing that search I just saw that the relation for LA is completely broken anyway. Admin boundaries suck.)

In closing, here is a screen shot of the completed county relations in South Dakota: