Sunday, April 3, 2011

Parse the Planet! (into a pgsnapshot database)

What makes OpenStreetMap... well... Open? In a link: http://planet.openstreetmap.org/

Yes, of course there are the legal issues of having an open license, the transparent community processes to govern the project and a whole host of open source software to create, manipulate, consume and display map data. But at the end of the day, none of that would matter if the raw data wasn't available.

As is obvious from my previous posts, I have dabbled in manipulating this raw data in the past, with varying degrees of success. After learning from my previous attempts and hanging around on the mailing lists and IRC channel, picking up information here and there I decided to take another crack at processing a planet file. I ended up finding a partner in crime: Paul Norman. He was wanting to import a planet file for use in a private XAPI server using the new java based XAPI implementation written by Ian Dees. This sounded like a good idea so I decided to follow along. We bounced commands and results off of each other for a while. I ended up going a slightly different route in some of the processing because I had some different hardware available to me.

The first thing to know about processing planet data is that there are several different database schemas that can be used to hold the data. This topic probably deserves its own post at some point. Here is a brief list of the ones I am at least minimally familiar with. I know there are some others (like the gazetteer database used by nominatim):
  • API database: Creates a duplicate of the master OSM database. Does not use any GIS features. Nodes are stored with lat/lon values as simple numbers. Contains tables for: ways, tags, users, changesets. Can track history of objects once you start applying updates.
  • "pgsnapshot" schema: creates a schema that utilizes the GIS features of postgis that allows you to run spatial queries against OSM data. Still maintains all tag information but they are in a single "hstore" column, not in their own relational tables. Changesets are not stored nor is any information about users except the ID and username. History is not maintained as updates are applied. There is also a related "pgsimple" schema that seems to be an older version of the pgsnapshot schema that may have some advantages. Differences are discussed in the "pgsnapshot_and_pgsimple.txt" file in the osmosis script directory.
  • osm2pgsql rendering schema: This schema is created by the osm2pgsql tool and used by mapnik to render map tiles. It uses GIS features as well but it is lossy and very de-normalized. Only objects and tags that you care about rendering are imported. This means it is also smaller and faster to get set up.

The XAPI uses the pgsnapshot schema so this is what I set up. The instructions on how to get going are provided in the xapi readme file but I will go ahead and copy it here with some additional notes.



Setting up the database

First of all, you will need a postgresql server with some spare drive space. If you are processing the entire planet file, you will end up creating a database that is about 430 GB in size. However during processing, you may end up needing anywhere between 200 and 400 GB of additional space for temporary files. Some of this depends on exactly how you execute the process. And some of the storage doesn't need to be fast. I used an external USB drive for some of the temporary files and it worked great.

Next, you might want to take a quick look at the "Configure the PostGIS database" section in this post by Richard Weait. I am pretty new to postgresql so I have no tuning recommendations of my own. I used the values he suggested and they seem to work.

As the postgres user on your database system execute the following commands from a shell:
createdb xapi
createlang plpgsql xapi
createuser xapi
Note: this may not be the best way to do things, but answer "yes" to the superuser question. Otherwise you have to modify postgres ownership which wasn't quite straightforward. Mostly because I'm new to postgres.

Now to set up the GIS part of postgis and the hstore for tags. This step sets up the database to hold spatial data instead of just being a bland relational database. Note that your exact path may vary depending on what Linux distro you are using. You can always try a quick locate command. On Ubuntu you have to install the postgresql-8.4-hstore-new package first. On Gentoo I actually had to compile the hstore-new plugin myself.
psql -d xapi -f /usr/share/postgresql/8.4/contrib/postgis.sql
psql -d xapi -f /usr/share/postgresql/8.4/contrib/spatial_ref_sys.sql
psql -d xapi -f /usr/share/postgresql/8.4/contrib/hstore-new.sql

Now we set up the actual OSM tables. You will need to download osmosis. The pgsnapshot schema was just recently introduced. I had to use a development snapshot but as of April 2, it is included in the release of osmosis 0.39. In this post I will assume that you have osmosis unpacked in your home directory.
psql -d xapi -f ~/osmosis/script/pgsnapshot_schema_0.6.sql
psql -d xapi -f ~/osmosis/script/pgsnapshot_schema_0.6_linestring.sql


Getting the data

Now that the database is ready, let's get some data to work on. For the boring people, just go to http://planet.openstreetmap.org/ and download the file planet-latest.osm.bz2. It will likely take several hours to get the 15 GB file. For the cool kids, check out http://osm-torrent.torres.voyager.hr/ where you can find a torrent file with the latest planet dump in it. Planet files are typically produced every Wednesday. The torrent is updated a day or two later. I downloaded several planet files via the torrent and each time I truly did get a torrent of data. During most of the transfer, my downstream bandwidth was maxed out at 2 megabytes/sec.

If this is your first time working with OSM data I would strongly recommend testing the waters with a country or state extract. Extract providers are listed on the Planet page on the wiki. I would look at geofabrik and cloudmade. Geofabrik has the data available in a format called PBF which is more compact and quicker to work with than the bzip2 compressed XML file you get from planet.osm.


Processing with Osmosis

You might want to go refill your drink. This is going to be long.

Osmosis is an amazing tool. Also, amazingly confusing. Just take a look at the "Detailed Usage" page on the wiki. The page summary is two full screens tall in my browser. And once you start to actually read things, it just gets worse. It talks about tasks, pipes and streams as casually as most people would discuss the weather. These are all useful things if you're an osmosis expert wanting to look up how to do a specific task like splitting a planet file into 16 polygon defined extracts, all at the same time - while filtering out relations. But it is a little daunting to new users.

One thing to keep in mind: argument order matters. For example, If you want to use the --log-progress option to output progress information, it must go between the read and write tasks - you can't simply add it to the end of the command line. There are also multiple ways to do some things. This can be a good thing as it lets you optimize the process based on what kind of hardware you have available to you.

Osmosis is a java program. As such, you may need to specify certain options to the JVM to make it do what you want. These JVM level options are controlled by a file. On linux this file is named .osmosis and needs to be in your home directory. On windows 7 it needs to be at C:\Users\<user name>\osmosis.bat. The most common option you may need to use is to give java more memory using the -Xmx option. The other one I needed for one of my attempts was -Djava.io.tmpdir=/some/other/path/ to set the location of the temporary directory to something other than /tmp/

So my ~/.osmosis file in linux looks like this to give it 2 GB of memory and redirect the temp directory
export JAVACMD_OPTIONS="-Xmx2G -Djava.io.tmpdir=/mnt/database/tmp/"

The format is identical in windows except you replace "export" with "set"

One more important note: Osmosis claims to be able to read bzip2 compressed files natively and will happily set about doing this for a few minutes. But do not attempt this! It will end in tears and very odd "invalid XML" errors being thrown. The reason is that java's native bzip2 implementation can not read files with multiple streams in them. So it reads the first stream and thinks it is done, truncating any XML elements it might be in the middle of at the time. Instead, use an external utility to unzip the file to standard out and then tell osmosis to read the uncompressed data from standard in. There is a note about this at the bottom of the osmosis page on the wiki but it is easy to miss (I did at first) so it bears repeating.

Now that we have some basics down (I told you this was going to be long) let's look at what it takes to actually process the data. The first thing we need to do is read the XML using the --read-xml task.

Next, there is a decision to make. There are two basic ways to get the data into the database. One is to let osmosis do it all automatically using the --write-pgsql task. This will cause osmosis to read the XML into some temporary files in your tmp directory and then shove it into the database. The first step may take up to several days (depending on if you are using memory node storage or not - keep reading) and create close to 200 GB of temporary files before it even touches the database. If something fails during the database load, the temporary files are deleted and you start over from scratch. Why the temporary files? Osmosis has to do some processing of the data to correctly construct ways out of nodes and relations out of ways and nodes before it can put them in the database.

The other option is to use the --write-pgsql-dump task to process the XML into a series of files that can be loaded into postgres using the \copy command which is much faster than issuing INSERT queries. It also gives you a fall back point if something goes wrong during the database import. The primary reason I chose to use it is that it lets you easily run this process on another computer and then take the resulting files and import them into postgres on a different computer. This is also the only reason I have been mentioning Windows. My file server that has the disk space and is running postgres is an older machine with only 3 GB of memory and a slower CPU. My windows desktop is a much newer machine since I use it for gaming. So it has a quad core CPU and 12 GB of memory. I found a good use for the memory.

I will focus on the --write-pgsql-dump task that I used but some of the same things apply to the --write-pgsql task. I will start out with the full command line I used and then explain the parts. As I mentioned above, I ran this on my windows box because it has the memory. So please excuse the ugly file paths :)
c:\bzcat.exe o:\planet-110309.osm.bz2 | bin\osmosis.bat 
  --read-xml file=- --log-progress interval=30 
  --write-pgsql-dump directory=g:\osmosis 
    enableLinestringBuilder=yes 
    nodeLocationStoreType=InMemory
Now for the parts:
  • c:\bzcat.exe o:\planet-110309.osm.bz2 | bin\osmosis.bat: As I mentioned, it is a bad idea to let osmosis unzip the file so we must do this using an external utility and pipe the result into osmosis
  • --read-xml file=-: This tells osmosis to read an XML document from standard in. It is easy to miss, but don't forget the - at the end!
  • --log-progress interval=30: This will cause osmosis to report its progress every 30 seconds. The output simply tells you which node/way/relation it is working on and how many per second it is processing so you can make sure it is actually doing something and possibly estimate completion time.
  • --write-pgsql-dump directory=g:\osmosis: Tells osmosis to format its output as a series of postgres dump files in the given directory. The following three sub-options affect the behavior of this task.
  • enableLinestringBuilder=yes: This builds a geometric representation of the way which is then stored in a "geometry" type column in postgres that can be used for spatial queries. Building the linestring can be done in the database but you may not live to see it finish if you are processing the entire planet. Seriously. 
  • nodeLocationStoreType=InMemory: This tells osmosis to hold geometric information about every node in memory. By default it writes this to a temporary file on disk. If you have enough memory you really, really, really want to do this as it cuts processing time from about a week down to, in my case, about 8 hours.

I mentioned "enough" memory. I did this on a computer with 12 GB of memory. The first time I tried this command I gave java 12 GB (-Xmx12G) and java threw an out of memory exception even though my system monitoring (I have zabbix set up) showed that there was still 1GB free. So the second time I gave java 16GB of memory. Yes, that is more memory than the system has and I would not recommend doing this on a regular basis... but it worked in this case. In fact, zabbix still showed that there was almost 1 GB free except for a brief dip to 200 MB. I'm guessing there is some interplay between the JVM and the operating system going on that complicated things a bit. Also, keep in mind that my 12 GB was able to process the planet file today with about 1 billion, 100 million nodes in it but the map is constantly growing (at about a million nodes per day!) so hardware requirements will go up over time. If you don't have enough memory, you can still use osmosis but you will need to use a different node location store type and this process will take several days - maybe up to a week - instead of 8 hours. Anyway, here is a picture of memory usage as this process ran:


The initial downhill slide is where osmosis reads all the nodes into memory. After that, memory usage is relatively constant as it processes ways and relations using the nodes it has in memory. The first upwards bump in the graph is when I closed Starcraft 2 and went to bed :)

One tip if you are doing this across the network like I was: have your output directory be a local disk or even an external USB drive. My windows system didn't have enough disk space so at first I tried putting the output on a network mounted drive that pointed to my file server which is running a samba share. Bandwidth was not a problem but network latency killed performance. Switching to an external USB drive almost tripled the speed at which osmosis processed objects. The planet file can reside on the network. But the output directory needs to be something local.

Here are the resulting files after this 8 hour long memory sucking excursion:

nodes.txt            105 GB
relation_members.txt 231 MB
relations.txt        165 MB
users.txt              2 MB
way_nodes.txt         26 GB
ways.txt              79 GB
---------------------------
total:               210 GB


Loading into postgres

Now for the long part. These 6 files need to be loaded into a database. Luckily there is another script in the osmosis distribution that will assist you. However it will need some modifications. The file is ~/osmosis/script/pgsnapshot_load_0.6.sql. If you open this file you will see that it first drops all constraints and indexes to make the mass inserts go faster. Then it actually imports the data, rebuilds indexes and constraints and finishes up with some database maintenance. The first thing I did was enable query timing just so I could see how long things took. To do this, insert the following line. All of the index/constraint drops will be instantaneous so you can put it after those:
\timing

Next, comment out the next two lines that start with SELECT DropGeometryColumn. The bbox one isn't needed for the xapi and, assuming you followed my advice above about building the linestring in osmosis, that should already be there. You will also need to comment out the two SELECT AddGeometryColumn statements and the two bigger UPDATE queries below that. These sections are all preceded by comments indicating that they may need to be commented out. To comment them out, just add two dashes at the beginning of the line.

I am also unconvinced as to the necessity of the CLUSTER queries. I let the one for the nodes table run and it took almost 3 days. Then I lost power during a spring thunderstorm while the one on the ways table was running. I didn't go back and re run it and my database still seems to be working fine. From what I learned in the postgres documentation on the subject, in theory it makes things faster by copying the entire table in an ordered fashion so that nodes that are geographically near each other are also close to each other in the database. This can increase the chances that a single disk read operation will load many of the nodes that you need into memory at once. In practice my use of the database via the XAPI tends to touch a wide area at once so I doubt I'm seeing much benefit from this. You will also see that the postgres documentation states that this needs to be run on a regular basis to maintain the theoretical benefit. As objects are added/deleted/changed, the clustering is slowly lost. And the CLUSTER query puts an exclusive lock on the table so your database will effectively be offline for at least a day any time you do this. So feel free to comment these out as well if you want to save some time. You can always run them by hand later if you determine that they are actually needed.

I also did not run the VACUUM command. VACUUM is a database maintenance task that compacts your tables to remove "holes" left by deleted and changed records. In my case, the CLUSTER query completely copied my nodes table, so I don't think that a VACUUM immediately following this would have done anything. Even without the CLUSTER, I'm not sure it would have done much to a brand new database. I may be wrong about that but I have yet to see any adverse effects from not running it. Again, you can always run it by hand later - and may need to do so periodically anyway.

However it is very important to run the ANALYZE query! This will cause postgres to do a quick (10 minutes or so) analysis of all the tables in the database and will store some basic information that postgres can use later to optimize your joins. I initially didn't do this either and some simple queries I tried looked like they were going to take days to finish. After running the ANALYZE query, they finished in seconds. It is THAT important!

But back to the import process itself. Now that you have tweaked the pgsnapshot_load_0.6.sql file, it is time to wield it. Change to the directory where the .txt files from osmosis are. Then execute:
psql -U xapi -d xapi -f ~/osmosis/script/pgsnapshot_load_0.6.sql

At this point you should go to sleep or spend time with your family or something. Here are the timings from my load:
users.txt:            20 seconds
nodes.txt:            13 hours
ways.txt:             20 hours
way_nodes.txt:         2 hours
relations.txt:        40 seconds
relation_members.txt: 80 seconds

Creating indexes:
node primary key:              1.5 hours
way primary key:               30 minutes
way_nodes primary key:         1.5 hours
relations primary key:         40 seconds
relation_members primary key:  50 seconds
nodes geometry index:          25 hours
way_nodes node ID index:       2 hours
realtion_members member index: 45 seconds
ways bounding box index:       1.7 hours
way linestring index:          1.6 hours

This ads up to about 70 hours. And that's without the CLUSTER query. If something goes wrong, you can truncate the tables (only the 6 affected by the load... leave the other GIS tables alone!) and then just rerun the script. It will spit out errors when it tries to drop the indexes at the beginning because they don't exist but that is fine.

You can monitor the progress of the \copy commands in linux by using
lsof -o -p <pid> 
where <pid> is the process ID of the pgsql process. It will give you a readout the position of the current read within the file in bytes - in hex. Convert to decimal and divide by 1073741824 to get it in gigabytes.

If you are tight on space, it is safe to delete or compress the input files after pgsql has finished the \copy command on that file.

Here is a pretty picture of my disk usage during the import process:


As you can see, this covers a 6 day period. 3 days (from some time on the 16th to pretty much the end) is the CLUSTER query that I'm not sure is needed. The occasional drops in space usage towards the end is me deleting old files from previous osmosis test runs to make sure it didn't fill up the partition. The big drop at the end is where it finally finished the CLUSTER query and was able to drop the temporary nodes table. Total partition size is 670 GB.


Conclusion

And that's it! Once that process finishes, you will have a complete pgsnapshot database containing the OpenStreetMap database! As I mentioned up towards the top, one use for this is to run your own XAPI server to do specific map queries however you could also use this database to do other things with the OSM data. Depending on what you are doing, you may need to set osmosis up to keep your database in sync with new edits. Obviously by the time this process finishes, the data will already be several days out of date. I will cover the syncing process in a separate post soon.

The total processing time for me was about 80 hours. To make that at least somewhat meaningful I should list the specs of the computer doing all the importing. I already mentioned that the computer I ran osmosis on was a quad core with 12 GB of memory. The computer that is running the database and that actually ran the import process is a little older. It has an AMD 4400+ (first generation dual core) with 3 GB of memory. But the most important part for the database processes is disk. Mine aren't really spectacular: 3 1TB Samsung Spinpoint F1 drives (7,200 RPM) in a software RAID-5 (so total of 2TB usable storage). I am also running LVM on top of the RAID which splits the space into two partitions. Pretty much all of the database related processes are slowed down by disk access times. Faster disks or a better RAID setup (0 or 10) would drop these times. Or SSDs. Donations welcome :)

As I said above, the total size of the database after it is all said and done is 430 GB. However if you execute the CLUSTER query on the nodes table, you will need an additional 180 GB or so because it makes a complete copy of the nodes table. Plus remember the temporary files created by osmosis need another 210 GB although they can be deleted/moved/compressed as the load script finishes using them.

Check back soon (or subscribe!) to catch Check out my next post about keeping the database up to date.

5 comments:

  1. "This can be done in the database but you may not live to see it finish if you are processing the entire planet. Seriously."

    Yes! OK i'll cancel this 17.5 day import and start over.

    ReplyDelete
  2. Yeah... I'm not sure about this but I THINK it was doing like 2-3 ways per hour. This would work out to over 3,500 years to do the whole database (100 million ways). However I think if an ANALYZE statement were put right before the big update query that does this, it might actually make it feasible to do. Didn't test that though and I can virtually guarantee that it will still be quicker in osmosis if you are using memory storage for the nodes.

    ReplyDelete
  3. Hey Toby, thanks for these great Planet Parse instructions. I think there's one line missing in the code snippet:
    You write:
    psql -d xapi -f ~/osmosis/script/pgsnapshot_schema_0.6_linestring.sql
    You should add:
    psql -d xapi -f ~/osmosis/script/pgsnapshot_schema_0.6_bbox.sql

    Otherwise the import for the ways table will fail due to wrong number of columns.

    ReplyDelete
  4. You are correct. However the bbox column actually isn't required (at least for the xapi). When I imported my database I thought it was required but then I found out it wasn't. It seems my blog post got caught in an in-between state. Guess I should probably edit it a bit to make it consistent.

    So if you DO need the bbox column, then yes, run the bbox.sql file. If not, leave off the enableBboxBuilder=yes option from osmosis. Having the bbox in there doesn't hurt anything although I guess it might slow down updates and increase database size a little.

    ReplyDelete