Wednesday, December 15, 2010

Updated TIGER map

For a full history of this post you may want to read these three posts first: 12 and 3.

When I first presented my nationwide TIGER map at wherecamp5280, Samat mentioned that he was surprised that his county wasn't more heavily edited since he has done a lot of work there. A few days ago we were chatting on IRC (#osm on oftc.net) and he asked a question that made me realize that the map did indeed have a serious flaw in it. Earlier today a discussion came up on the talk-us mailing list that made me notice that the same error exists in the TIGER edited map that MapQuest (re)set up. So I decided to look at correcting it.

The error is that I was only looking at the latest version of the ways. This means that any edits between the original import and the mass edit that expanded street names were not taken into account. Both the import and the abbreviation expansion happened well before I started mapping so I didn't even consider this fact. As Antony Pegg says later in the mailing list thread, it is too expensive to go back and look at all previous versions. Especially since you would have to use a FULL planet file since the regular ones only contain the current version of objects, not a full history.

But we don't really care about the contents of the edits or who performed them... we just care that they happened. The initial import obviously created version 1 of the ways and if nothing else changed, then version 2 was created by balrog-kun in the name expansion edit. If any edits happened between the import and the name edit, then the ways will be on version 3 or higher. Thus any TIGER way with a version higher than 2 must have been edited by someone other than these users, even if one of them was the last to touch a way.

So here is the updated map which takes this into consideration:



Click for bigger version

There are indeed several noticeable differences. For one, Samat's county (Dona Ana, NM) turned two shades lighter! Also, a large chunk of Wyoming came out of the "virtually untouched" classification. The southern tip of Texas as well as the Dallas area lightened up a bit as well. But at the end of the day, there is still a LOT of untouched data out there.

This map still isn't perfect. For example, if someone split an imported way, creating a new version 1 way as well as version 2 of the existing one and then the name change edit happened, the new way would be counted as not having been touched by a human. Also, people deleting the tiger:county tag on ways will obviously throw numbers off a little. But this should be much better at least. I'm hoping this change can be incorporated into the MapQuest version as well.

For giggles I threw the data up on OpenHeatMap as well. I'm not sure what happened to Alaska and Louisiana. For some reason OpenHeatMap didn't recognize the county names. It also did its own data classification so it won't match my map exactly. But you can zoom in and mouse over counties and get a pop-up with the name and percentage of unedited ways so that is kind of neat.

The Data
For your further enjoyment, I am providing the CSV file that was produced as the result of my SQL query. The map is nice and all but some people might want cold hard numbers. I'm thinking some statistical analysis might be interesting. I originally tried to make a graph of the distribution of values but OpenOffice Calc wasn't cooperating that night and I had to pack for a trip so I dropped it.

Click here to download the file.

The first column is the value from the tiger:county tag. Following that is the percentage of unmodified ways in that county followed by the raw numbers of edited, unedited and total ways. The last three columns were used by me to manipulate the data into an acceptable form to join the data to the shapefile I had. I had to split out state code and county name and then expand the state to its full name using a table I generated from the source of a wikipedia page that lists all states and their abbreviations.

I didn't clean it up at all so there are some nonsensical values in there. ArcMap just ignored the values it couldn't match to the shapefile. The first line has a county name of "30 days late" - I have no clue what that is doing in a tiger:county tag. Feel free to do an XAPI request and clean it up!  There is also data there for Puerto Rico and D.C. that didn't get matched up. Puerto Rico wasn't in my shapefile and I don't think D.C was correctly matched to the abbreviation.

Enjoy!

4 comments:

  1. Southern New Mexico gives you its thanks. Great re-analysis! It's too bad the same analysis won't be available for the Mapquest's TIGER-edited map.

    ReplyDelete
  2. Toby, nice work!

    Your timing is perfect. We start a new sprint today, and I'll see if we can get your suggestion in.

    Ant

    ReplyDelete
  3. I'm going to speculate that the reason LA and AK didn't work in OpenHeatMap is that in LA, they have "parishes" instead of counties, and in AK they have "boroughs", and somewhere there is a disconnect on this nomenclature.

    ReplyDelete
  4. @Ant: Great! This should impact areas with older mapping communities the most.

    @Ed: Possibly. Although all it is matching on is county name and state abbreviation as in "Riley, KS" so in theory this should work no matter what you call the administrative unit. Maybe I'll send Pete my file and see what he thinks.

    ReplyDelete