Friday, April 8, 2011

Keeping a database up to date with osmosis

This is a follow-up to my previous post about importing a planet file into a pgsnapshot database schema. However the same basic process applies to other schemas that osmosis can write to.

To briefly set the stage: After downloading the weekly planet file and processing/importing it into a database, the data is now probably 3 or 4 days old. OpenStreetMap provides both minutely and hourly change files that can be applied to the database to bring it up to date with what is live right now in the master OSM database. If you think about it, that is actually a pretty sweet deal. Someone on the other side of the planet can spot a new restaurant while on their lunch break, use an application on their phone to upload it to OSM and you will get the change applied to your local database in less than  120 seconds. Does this blow anyone else's mind just a little? Open data is awesome like that.


Initializing the working directory

So what does it take to pull of this feat of mad wizardry? The "Replication Tasks" section of the osmosis usage page is of interest. Again, not the easiest read in the world. Let's break it down. First, you need to set up a working directory for osmosis to keep some information about the current state of the data. To do this, create a directory. I chose to put this in a hidden directory in my home dir: ~/.osmosis-minutely/ Next, enter this directory and execute this command (again, I am assuming you have osmosis installed in your home directory):

~/osmosis/bin/osmosis --rrii

This will create two files: download.lock and configuration.txt. You can ignore download.lock. It is used by osmosis to ensure that only one copy is running at a time. The interesting bits are in configuration.txt. First of all, this is where you have to decide if you want to do hourly or minutely updates. Minutely allows you to stay in sync with live changes to the map as people upload them as described above. However this means that your computer will always be spending anywhere from maybe 10 seconds to a full minute every minute importing OSM data. This won't make it unusable for other things but will certainly have an impact on the system. In theory using the hourly files might be slightly more efficient. I haven't played with this to determine if this is true or not but Paul Norman indicated that it didn't seem to be for him. Once you have decided which to use, put either http://planet.openstreetmap.org/minute-replicate or http://planet.openstreetmap.org/hour-replicate in the baseUrl line of the configuration.txt file.

Now you need to determine the time stamp of your planet file. It is in the 2nd line of the planet file. You can get it like this:

bunzip2 -c planet-110309.osm.bz2 | head

In my case it reads timestamp="2011-03-09T01:11:06Z"

Now you can do it the easy way or the hard way. The easy way is to use the replicate-sequences tool. Note: this tool only supports minutely updates. As mentioned in the osmosis documentation, I would suggest taking the time stamp from above and subtracting an hour and using that time just to make sure you don't miss an update. After you enter the time and click the button, take the text it returns and copy/paste it into a file named state.txt in the working directory (~/.osmosis-minutely/)

If you want to know what is going on behind the scenes, read on here. Otherwise feel free to skip to the next section. Every minute (and every hour) files are created on http://planet.openstreetmap.org that contain the changes that have been uploaded to the database since the last minute/hourly file was generated. These files contain only the new attributes of objects that have been changed. So applying a change twice is fine. It just updates the data to the same values it had before. This is why it is safe to go back an hour from your time stamp. However if you miss an update, all the objects affected by that update will be out of sync with the master database until (if ever) they are updated again.

Along with the change file itself, there is also file containing metadata about the change. For example, this one. It has a sequence number that increments for each change file plus a time stamp of when exactly the last change in the associated change file was applied. If you want to find the correct state file without using the tool listed above (or if you want to use hourly updates) you will need to browse through the directories to find an appropriate state file to start off from, again going back in time at least a few minutes or an hour. The time stamps listed next to the files are one hour off from the time stamp inside of the file (the time stamp in the file is in UTC and the server is one time zone away from UTC) so you can use it as a rough guide for locating the correct file but be sure to check the time stamp inside of the file before using it! As above, paste the contents of the appropriate state file into your ~/.osmosis-minutely/ directory as state.txt.

Catching up

You should have everything in place now! Time to fire up osmosis! Paul reached this point long before I did (because I was playing around with things) so I swiped this command from him:

~/osmosis/bin/osmosis --rri workingDirectory=~/.osmosis-minutely/ 
                      --sc 
                      --wpc user="xapi" database="xapi" password="xapi"

Let's look at the command:
  • --rri or "read replication interval:" Not much to add to the wiki description. This makes osmosis download a set of change files and tells it where to look for the state.txt file to know which files to download.
  • --sc or "sort change:" I'm not 100% sure about the function here. I'm assuming it puts the changes in chronological order so that you don't end up applying a newer change before an older change which would lead to an outdated object in your database.
  • --wpc or "write pgsql change:" writes the change to the database using the given credentials. (modify to match your database/username/password) This task can be replaced by other "write change" tasks like --wdc or --wxc to write changes to other destinations.

By default osmosis will download one hours worth of changes (even if you are using minutely updates - it will just download 60 at a time) and then sit there applying them. This will take anywhere from a few minutes to a full hour. It just depends on how many changes are in the files it is processing and how complex the changes are. Applying changes to huge multipolygons is known to be slow and weekends are always a lot busier than week days.

If you want quicker feedback to make sure it is working, open the ~/.osmosis-minutely/configuration.txt file and change the maxInterval value to something lower. This controls the maximum number of changes that osmosis will download in one go. As mentioned above, the default is one hour (3600 seconds) so you can set it to 60 seconds to just download one minute at a time which should only take a few seconds to apply. Once the process finishes you can check the state.txt file. The time stamp and sequence number should have changed to reflect the update that was applied.

Due to my tinkering around and multiple runs I did with osmosis, I had to apply 488 hours worth of updates to my database. I did it by hand in 10 and 48 hour chunks by changing maxInterval to 36000 and 172800, respectively. After it was all said and done, it took about 163 hours to apply 488 hours of updates. That averages out to about 20 minutes of processing time per hour of updates. But I saw variation, especially in the 10 hour runs. Anywhere from 1.8 to 4.5 hours to apply 10 hours of updates. As before, disk I/O tends to be the bottleneck and I'm using a software RAID5 composed of three 1TB 7,200RPM drives so not exactly tuned for speed.

Now all you need to do is put the osmosis command into a minutely or hourly cron job and you're done! And really you can do this before the database is all caught up. I just did the catching up by hand to monitor and record progress. As I mentioned above, it is safe to call osmosis every minute. Osmosis puts a lock on the download.lock file and if the previous run is still executing, the next one will exit immediately without doing anything. I chose to put the command in a script and then put a call to the script into cron. You might want to redirect stdout and stderr to either a log file or /dev/null to avoid cron sending out emails every time the script runs. My cron job looks like this, minus the newline put in for readability here:

* * * * * /home/toby/bin/osmosis-update.bash >> 
          /home/toby/.osmosis-minutely/osmosis.log 2>&1

And that's all, folks! You now have an automatically updating OSM database to query as you please. Have fun!

5 comments:

  1. Sometimes when the OSM servers have downtime replication can be delayed and these delays can sometimes cause osmosis to get stuck on a particular changeset. The solution to this is to kill the associated processes and let them restart automatically. To catch these problems you can periodically check the date in state.txt and if it's more then a day off (to account for time zone differences) something is stuck.

    What I found when comparing the time for 60 minutes of minutely updates vs. 1 hourly update was the update to update variation was more then the difference. I found that each minute worth of updates took 10-90 seconds on my 4 disk 7200 RPM RAID 10 array. Larger minutely files were from spikes from imports and it would catch up over the next two files.

    ReplyDelete
  2. Indeed, I have had osmosis hang on an update occasionally. I THINK most of those were caused by a bug in osmosis that has since been fixed although I haven't updated my copy yet. Apparently it wasn't setting any kind of timeout so under certain network conditions it would just sit there waiting forever for a connection to be established.

    ReplyDelete
  3. Thanks for explaining this so well, Toby. Much appreciated. Did you put a link to this article on the Osmosis wiki?

    ReplyDelete
  4. Thanks Martijn. No I haven't linked it in the wiki. Is linking to your own blog post in a wiki considered a conflict of interest? :)

    ReplyDelete
  5. The replicate sequence url is broken. Use this instead, you can get minutely, hourly or daily there: https://osm.mazdermind.de/replicate-sequences/

    ReplyDelete