Tuesday, October 23, 2012

Licensed to Map (What happened to Los Angeles?!)

I recently got back from the State of the Map - USA conference. It was great and you should have been there! But for those who weren't... I presented one session on Saturday afternoon with the same title as this post. Instead of just putting my slides out there I thought I would write a blog post to tell more of the story. This is a fairly long post but don't worry, it has a lot of pictures! (click to see full resolution)

So, the license change. It happened. We lost some data. But what happened before that to try and save as much as we could? And what exactly did we lose?

First, a brief timeline:
  • Many moons ago, the OSM Foundation voted to change our license from Creative Commons to the Open Database License. This actually happened before I knew that OSM existed.
  • In order to make the change, permission had to be secured from everyone who had contributed map data to the database.
  • Contributions from anyone who did not agree to the new terms had to be removed from the database.
  • At the end of 2011 the board set April 1st, 2012 as a target date for the license switch. This turned out to be wildly optimistic but at least it was a concrete goal to shoot for instead of just an ongoing process with no end in sight.
  • The process of removing non-relicensable data was done by a bot that went through the entire database in the 2nd half of June
  • After a little more cleanup, the first ODbL planet file was finally delivered in September.
Even though the April 1st target date was not really reasonable to hit, (especially in hindsight of course) it still gave the community something to work towards. And work we did.

Contacting inactive mappers

The first priority was to contact undecided users to make them aware of the change and that they needed to log in and indicate a decision. There were several rounds of emails sent out from the foundation to undecided users, trying to get them to respond. In addition to the emails from the foundation, users actively contacted other people in their area to try and make them aware of the decision. Lastly, the foundation supplied the account email address of some of the accounts with the most map data to a small group of volunteers for more targeted contact.

The result of this contact effort can be seen in this graph that I have been presenting since the beginning of the process:

The green line corresponds to the scale on the left while the red line follows the scale on the right. Of those who responded, it was a 99.5% landslide in favor of re-licensing their data. You can see that the people who were opposed to the change were very quick to log in and enter their decision. The big bumps in the green line clearly show when the mass emails were being sent out by the foundation.

I was the volunteer who did targeted contact here in the U.S. It was an interesting project. Simon Poole came up with a prioritized list of users to contact per country/region based on his site which shows undecided users ranked by how much map data they have in the database. (http://odbl.poole.ch/)

I ended up contacting people through a variety of websites including obvious social networking sites like Facebook, Google+, Twitter, Flickr and LinkedIn. I also sent a few though more obscure sites like tripadvisor and soundcloud. Beyond that, I attempted to determine their real name and any contact info I could get my hands on. A simple web search with their OSM user name plus the location of their first edit was remarkably effective. Often it would yield a twitter account which would then give me a real name and lead to accounts on other sites. Here is a screenshot of the spreadsheet I used to track my contact with users:

I kept track of dates I sent messages, when I heard back and any accounts around the internet that were possibly related to the user in question. All in all, I attempted to contact a total of 168 users. 71% of them accepted the new contributor terms. Only one or two of those who responded did not accept the new terms. The rest I was unable to contact.

As part of my contact efforts I ended up calling 7 people on the telephone. This led to a few awkward conversations. One person said "I don't remember giving OSM my phone number" to which I replied "Well... you didn't. I stalked you across the internet." Another call landed me in a conversation with a high school student's mother. When I asked to speak to him she asked "who is this?" to which I responded that her son did not know me but that I am from a website that her son had used. I am happy to report that both of these users agreed to the new terms within 5 minutes of talking to me! However a few times I felt kind of like this:

Found on http://www.freewebs.com/jhnbytwoo/

Remapping

Besides contacting undecided users, there was also an effort to remove and replace the contributions of users who would not agree to the new terms. There were several tools to assist us in this effort. Unfortunately I didn't get good screenshots of most of them. At the time I wasn't planning on doing a presentation about this and now that the license change is over, they are either shut down or not displaying useful information.

But they basically all had a different way of highlighting map objects that were likely going to be either deleted or modified by the removal of license tainted data. Both JOSM and Potlatch 2 had a license change plugin/mode that would draw red halos around objects that were in danger of deletion and yellow ones around things that would be modified. Next was the OSM Inspector license change view. It also displayed objects that would be touched by the bot and let you click on them to get details as well as links to the object history and edit links to go fix it. These were based on the "Quick History Service" which was a pretty good approximation of the algorithm the bot would use to remove data. There was also cleanmap/badmap which was two different map renderings. Cleanmap showed a rendering of what the map would look like after the license change and badmap showed only things that needed to be remapped.

The license bot

In the second half of July, the "license bot" was run over the planet to remove all license tainted data from the database. After the full bot run there was some additional cleanup to do. For example there were some obvious cases of copy/paste remapping where people "remapped" by straight up copying the data of users who had not agreed to the new terms. This is obviously copyright infringement and needed to be taken care of. In the bot removal process, data which was originally created by decliners or non-responders was completely removed from the database. Objects which had been modified by decliners/non-responders were reverted to an older state to remove the data that couldn't be relicensed.

The official number I heard after the process was done is that OpenStreetMap lost 1.2% of its data. Of course removing 1% of the Interstate system makes it completely useless for routing across the country so the true impact of the bot was much higher but I'm not sure it is possible to put a specific number on it. The effects of the bot were also highly variable by location. Australia and Poland got hit disproportionately hard as did some cities here in the U.S.

The bot had a slight blind spot having to do with splitting ways. When a way is split in OSM, a brand new way is created and attributed entirely to the user who split the way. No link is kept to the user who originally created the long way. This issue was known about ahead of time but the effort required to try and guess on way splits was deemed too high in comparison to the difference it would have made on the data.

So, what does 1.2% of OSM data look like? Here are two graphs from http://osmstats.alltogetherlost.com/ which show the total number of nodes and ways in the OSM database over time:


You can clearly see a small bump in July. But really, it's little more than a blip. The growth of data in OSM appears to be kind of unstoppable.


The bot in America

Since I presented this information at SOTM-US, I focused on some U.S. specific issues. We have some unique considerations, mostly having to do with the fact that a lot of our road network is imported from the TIGER/Line data set which is public domain. This means that the vast majority of our road network is not impacted by license considerations. However some of the TIGER fixup work that has happened since the import was definitely impacted. For example, there was one armchair mapper from Europe who did a lot of work on the interstate system in the eastern 1/3 of the country which was lost in the license change. As I mentioned above, way splitting was not taken into account by the license bot. Since splitting ways is a relatively common thing to do when fixing up TIGER data, this did lead to some data being removed that was technically from a public domain source. Luckily it tends to be relatively easy to restore this data, especially on interstates where a lot of the splitting happened to map bridges which are clearly visible in aerial imagery.

To visualize the bot's activities, I made a map by analyzing the daily diffs that were published during the time the license bot was running in America. It only shows node edits. Red means the bot deleted a node, blue means it modified a node, likely reverting it to an earlier version.

Well that looks kind of bad. But it is exaggerated a bit because of the rendering style and low zoom. Let's look at some individual locations. First up: Kansas. 


Well that's pretty good. I got most of the license problems here in Kansas taken care of either by contacting users and getting them to agree to the change or by remapping things.

Next up is Dallas. On the national map it looks like a fairly decent sized blob.
Upon closer inspection, it isn't really that bad. There are a few neighborhoods that were removed from the map and some damage to that highway heading east. But the overall structure of the road network isn't too bad.

Next, South Carolina:
That's a bit more serious. Obviously there was a lot of damage to the interstate system here.

And finally, Los Angeles:
OK, now that's pretty heavy. Most of it is due to a single, very prolific, mapper who refused to agree to the new terms.

So we know where the license bot was most active. But what exactly did it do to our data? Turns out, many different things. Here is an example of some pretty obvious damage:
The original TIGER import was pretty terrible in this area. A declining user corrected a bunch of the road geometry. Then an accepting user came along and added some details like turning circles and gates. When the bot ran, it did not touch the new additions made by the accepting user but moved all the nodes touched by the declining user back to their original positions, shearing the map rather severely.

Here is a somewhat amusing example:
In this case, the original TIGER had this onramp going the wrong direction. A declining user reversed the direction of the way. Then an accepting user came along and refined the shape of the way by adding more nodes to it. The bot reversed the order of the original nodes but not of the new nodes added by the accepting user, leading to the odd "zig" in the middle.

Now for a little more subtle damage:
You can see some obviously weird zigs and zags in the roads but nothing seems to be completely out of whack here. However, when viewed in JOSM, the damage becomes clearer.
In the upper left, there are nodes that used to be part of a way. Both the nodes themselves and the way made it through the license bot. But the fact that the nodes were a part of the way was lost leaving empty nodes and a misshapen way. In the lower part of the image you can see that there is a way which does not actually connect to any of the ways it crosses. So much for routability.

Fixing the damage

After the license bot ran, the community immediately started focusing on how to fix the damage. Several tools were developed or adapted to help guide and focus this effort. At the end of the day, we have risen to the challenge and I have actually been surprised at how quickly most of the damage has been fixed. 

I made a rough graph of remapping activity based on changeset comments. I used the weekly changeset dump along with my my ChangesetMD tool to create a database of changeset metadata which I then queried for these keywords in the changeset comments: odbl, license, remapping, redaction. 
It starts back in December 2011 when remapping started in earnest. The orange line follows the axis on the left and indicates the number of objects touched in these remapping changesets. This includes deletes, modifications and additions. The blue line follows the axis on the right and is just the number of remapping changesets created per day. 

I believe the huge spike in the orange line at the end of January was me and a few other people really putting our backs into getting the coastlines cleaned up. Paul Norman came up with a good way to highlight shoreline ways that were in danger of being damaged by the bot and it turns out there were a lot of them. I nuked most of the shoreline along the west coast and replaced it with NHD data. Some others did the same or manually remapped the shoreline around the Great Lakes and the east coast.

In May and June, remapping activity seems to have slowed to a crawl. I think people were just getting tired of it. I certainly was. The big blue spike on the right happened a couple of days after the license bot started running. It was actually still in progress at the time. But since Europe was hit first, it seems that the Europeans made quick work of cleaning up after it.

My own contribution to the remapping effort was a map that highlighted ways that seemed to be missing their highway=* tag. I noticed that a fair number of interstates and on/off ramps had a lot of their tags stripped but one of the following tags remained: oneway, lanes, bridge, tunnel. Any of these tags without a highway tag is rather suspicious so I put them on a map to highlight and guide people to them for remapping. Once in the area of such a way I often found other problems in addition to the missing highway tag. Here is an animation of how my map evolved between August 8th and September 30th:

It started out only showing ways with a oneway tag. You can see some new red appear on the map when I added the other tags to the list. By August 28th, the U.S. was completely clear on this map. The Europeans still have quite a mess to clean up. You can find the map still online here: http://ni.kwsn.net/~toby/OSM/maps/redaction.html

Another effort to detect problems on major roads was the routing grid that Kai Krueger set up. It queried OSRM periodically for a route between 40 cities and compared the distance to a reference distance. If it was off by a lot, it indicated a likely problem along an interstate or US highway between the two cities. The difference was color coded and put into a display matrix. Here is another animation showing how the routing grid evolved between July 21st and September 30th:
Again, you can see that things got fixed up pretty quickly. Also, I'm not sure why everything went orange again right at the very end.

One other strategy I used to fill in hard hit areas was to re-import the area from new TIGER 2011 and even some 2012 data when it became available. To do this, I downloaded both the shapefile and the feature names relationship file from the Census Bureau. The feature names file allowed me to do a better job of expanding street names from their abbreviated forms. I joined the two files in QGIS and then ran them through ogr2osm using my own translation file which extracted only a few of the shapefile attributes into OSM tags and expanded abbreviations. This gave me a .osm file which I was able to load into JOSM and then selectively copy roads from there into OSM.

Here is a screen shot of JOSM while I had some TIGER data loaded in the background with my "inactive" color set to a bright green to contrast the red of selected ways in the OSM layer on top of it. This highlighted areas that were missing in OSM. My apologies to any red-green colorblind readers.
After zooming in on the desired area I used the search string "type:way allinview" to automatically select all the ways in the TIGER layer, copied them and then pasted them into the OSM layer. Then of course I had to do some quality assurance to connect the new ways to the existing data and remove some common TIGER problems such as over-noded ways and duplicate ways that were created when TIGER had more than one name for a road. Typically this was just things like "Drive" vs "Avenue" or some other minor spelling difference.

One excellent tool that Martijn van Exel came up with is called "Remap-a-tron." I did not talk about it at the conference at all because Martijn did a presentation about it right before my presentation. But you can read about it in his own blog post. Basically, he did some digging to find ways that had been deleted by the license bot. Then when you hit the web page, it would display one deleted way at a time and give you the opportunity to go remap it in your favorite editor using aerial imagery, surrounding data that wasn't removed or whatever other source of license acceptable data you could find. After remapping the way you would come back and indicate that you had remapped it. It would record the remapping and immediately show you the next way that needed attention. This tool has now been rebranded as "Maproulette" and is available at http://maproulette.org where it has been retasked to find other common map errors that are easy to fix.

Finally, there is also a "Redaction Bot" view in the Geofabrik OSM Inspector tool. It shows all objects that were touched by the bot. You can click on them to get details about what the bot did to a particular object. I don't think it tries to match things that were deleted by the bot to new mapping done to replace them so over time the signal to noise ratio of this tool will decrease. Here is a screenshot of it with one way selected.


The End

In conclusion, the license change definitely had a major impact on OpenStreetMap. Both our map data and the community were impacted, sometimes to great detriment. However we are getting past it remarkably quickly thanks to the hard work of mappers like you. Some damage will remain in the map for a long time to come. Deleted points of interest aren't nearly as obvious as broken interstates and will take much longer to fill back in. But at the end of the day, we will survive and continue to thrive. In some ways, 2012 really was the year of OpenStreetMap, despite the license change. Let's make 2013 even better!

Map the planet!


No comments:

Post a Comment