Update 2: Here are two more posts with nationwide maps.
If you are at all familiar with OpenStreetMap, especially in the US then you will know that most of the roads here are imported from the US Census Bureau TIGER data. This data was imported from the current data in 2007. It is fairly complete but the accuracy leaves a lot to be desired in some areas. For example look at this screen shot comparing the TIGER data to reality (USGS aerial imagery in this case). Some roads are off by over 200 meters. And there are even worse spots - this was just one I was able to find quickly.
|Example of bad TIGER data|
Well that's great but the question is how do we use the new data to improve OSM? In the software world this would be an ideal situation to use a diff utility to determine what had changed between two versions and apply patches to update the old version. But I am not aware of any geo-spatial diff utilities.
So how about just deleting the old data and re-importing the new stuff? This would obviously be a disaster in areas with active mappers. All the work they have put into correcting existing TIGER ways would be destroyed and replaced with data that is probably better than the old TIGER import but also probably worse than what local mappers have done.
I'm in Kansas. There aren't a lot of mappers here, especially in the western part of the state. Mostly because there just aren't a lot of people out there. Some counties have a population density of 6 people per square mile. Oh and the TIGER data was imported in county sized chunks. So how about handling this on a county-by-county basis? Counties that haven't been touched could just be blown away and refreshed with the new data. If only someone could determine how much TIGER data has changed since the initial import, on a per-county basis... Oh wait, I did!
I am taking a cartography class at Kansas State University this semester. After starting to contribute to OSM I decided to take advantage of the employee tuition assistance and learn more about the subject. For $150/semester, why not! 20% of the grade in the class is for a project where he turned us loose to go get our own data and make some kind of interesting map. So I got my data by importing a Kansas extract from the OSM planet file and made this map:
A couple of notes about the map:
- I let ArcMap classify the data using natural breaks. There are 105 counties in Kansas. 104 of them still have have 78% of their TIGER data in its original state. One has had 75% of it modified. Yes, this is the county I live in. Yes, most of it was done by me. Yes, this inflates my ego.
- The total count doesn't really do much for the map in my opinion but the class project called for a bivariate map, so a bivariate map I made! Obviously counties with bigger cities have more TIGER ways. This is your dose of non-surprise for the day.
- My method of detecting changes by "local" mappers is by no means fool-proof. Basically I looked at the user who last changed a given way. If the user was one of 4 users I identified that were definitely not local mappers, I counted that road as unchanged. (technical details below)
- Labels would add a lot of clutter to the map. If you want to see which counties are which, I suggest the Counties of Kansas wikipedia page.
So I guess the question is how do we use this information? This will have to be a discussion amongst the OSM-US community. Due to time limitations I only did Kansas but this could certainly be done for other states (particularly the sparsely populated ones) to help local mappers decide what, if anything, to do with the new TIGER data. My suggestion would be to do a fresh import of any counties that have above a certain threshold of unchanged data. Say 95%? In Kansas that would be 59 counties. If a threshold were decided upon the map classifications could be altered to reflect that number.
For counties that HAVE had local activity, perhaps some process could be set up using tiles from the TIGER edited map that MapQuest has provided as a background layer in JOSM or P2. Then pull up the new TIGER data on top of it and compare. Import missing roads or ones that are more accurate in the new data and haven't been touched by local mappers. That is just a thought that popped into my head as I was writing this.
I imported the October 20th planet file into an apidb using osmosis. I originally tried to do the whole planet and intended to do a larger analysis of bot activity on a worldwide basis but after 4 weeks and 350 GB it showed no signs of being anywhere close to finished so I fell back to only importing Kansas. Luckily Kansas is pretty much a big rectangle so I just used a simple bounding box. This finished in about 2 hours.
Here is the final SQL query I came up with to get me all the data I needed in one result set. Keep in mind that I only imported a bounding box around Kansas:
select v as county, sum(CASE WHEN user_id in (147510,7168,20587,293105) THEN 1 ELSE 0 END) as bot_count, sum(CASE WHEN user_id in (147510,7168,20587,293105) THEN 0 ELSE 1 END) as user_count, count(*) as total_count from ways, way_tags, changesets where ways.id = way_tags.id and ways.changeset_id = changesets.id and k = 'tiger:county' and v like '%KS%' and v not like '%;%' group by v order by v
In less SQLish terms: I am primarily looking at the tiger:county tag to group the query by county. It contains values like "Riley, KS" so basically I am looking for any way with a 'tiger:county=*KS*' tag. This excludes the few ways around the Kansas border that are from other states since my bounding box was just a little bigger than the state borders. However I exclude ways that have multiple values in the tag, separated by a semicolon. Typically this would come from two ways in adjacent counties being joined together. This is pretty rare so I ignored them. Once I find those ways I examine the user who last touched the way by joining through the changesets table. If the user ID is one of 4 values then I count the way as not having been modified by a local mapper. Those 4 user IDs belong to the following users:
- 147510 = woodpeck_fixbot (this is a bot that has performed various automated edits as documented on the OSM wiki)
- 7168 = DaveHansenTiger (this is the user who did the initial TIGER import)
- 20587 = balrog-kun (this user did some mass editing of TIGER ways by expanding abbreviations in street names)
- 293105 = NHD edits (NHD = National Hydrography Dataset. I'm assuming this user probably imported some rivers and ended up splitting some TIGER ways to make bridges or something
Looking at the data now, I should have maybe also excluded NE2. I believe he has done a lot of work on national highway/interstate routes. I don't think there is any reason to re-import ways that are part of those systems since they have since been added to route relations and such. So those should probably be excluded from the data. Hindsight and all that.
Anyway, this query gives me a result like this:
county | bot_count | user_count | total_count ------------------+-----------+------------+------------- Allen, KS | 3062 | 33 | 3095 Anderson, KS | 2879 | 64 | 2943 Atchison, KS | 1235 | 37 | 1272 Barber, KS | 2173 | 26 | 2199 Barton, KS | 1764 | 254 | 2018 Bourbon, KS | 3815 | 137 | 3952 Brown, KS | 907 | 46 | 953 . . .
From there I regex'd it into a CSV file and imported it into ArcMap and joined it to the shapefile of Kansas counties that was provided to us. Easy, right?