13 July 2008

Note this is an incomplete document. I have yet to confirm that this works 100%, but here are the things I noted while installing PostGIS' TIGER Geocoder. First and foremost, the source code is available via svn co http://svn.refractions.net/postgis/trunk/extras/tiger_geocoder. A lot of this information is an incomplete copy of what's available from : http://postgis.refractions.net/support/wiki/index.php?PostgisOnUbuntu

What follows are the instructions as present in the source code. I'll interpolate the bits that were a little unclear and required assistance for me so you can benefit from them.

  1. Install postgresql-8.2-postgis (For various GIS bits). For Ubuntu Hardy users, this was a mere sudo apt-get install postgresql-8.3 postgresql-contrib-8.3 postgresql-8.3-postgis. Invariably, you'll need some extra tool support so: sudo apt-get install gdal-bin python2.5-gdal (Obviously, you should have Python installed, too.). Interestingly, my installation has thus far been a failure since I chose to use Postgres 8.3. If you're on Gutsy or prior, you may have better luck (in fact, I'm almost 100% sure you will) with Postgres 8.2, against which the geocoder code seems to have been tested. I can't guaranteee these instructions are the same, however. They should be.. I can't imagine any of these dependencies not existing in the version of the distro that's just 6 months old :-), but the versions may be different. Anyway: take away is that you should install 8.2 but be cautious. The rest of these instructions are referencing my 8.3 installation, so please change them as necessary.
  2. Install PostGIS functions and spatial ref definitions. This enables PostGIS itself on your database.
    \i /usr/share/postgresql-8.3-postgis/lwpostgis.sql \i /usr/share/postgresql-8.3-postgis/spatial_ref_sys.sql
  3. Install postgresql-contrib-8.3 for soundex (from fuzzystrmatch)This looks something like: psql -U postgres your_database_name < /usr/share/postgresql/8.3/contrib/fuzzystrmatch.sql . There's all sorts of worthy things in the contrib modules, so take your time and do some research and install anything you think useful.
  4. Obtain the TIGER data from the Census site. I used wget -r http://www2.census.gov/geo/tiger/tiger2006se
  5. You need to modify, and then run, import/load_tiger.sh. . What this will do is load all the TIGER data from http://www2.census.gov/geo/tiger/tiger2006se/ into the database in a sort of raw unfettered way so that the processing to setup the Geocoder works. Here are the modifications I made before running it:
    #!/bin/bash

    FIRST=""
    for file in /home/jlong/Desktop/gis/www2.census.gov/geo/tiger/tiger2006se/*/*.ZIP; do
    ./tigerimport.sh append MYDATABASE $file $FIRST
    if [ -z "$FIRST" ]; then
    FIRST="-a"
    fi
    done
    You see? All I change was the value for MYDATABASE and the value for pointing tot he directory where I had downloaded all the data fromthe census site. Also, I modified tigerimport.sh itself. Whenever I saw lines starting with ogr2ogr, I modified the connection paramaters as appropriate: ogr2ogr $PRE $OGROPTS $SRS "PG:dbname=$DATABASE user=MYUSEr password=MYPW host=127.0.0.1" $TIGER CompleteChain $POST. For my setup, I needed to specify the host. Which I guessed was done using a 'host=127.0.0.1' parameter. That seemed to work. Simialarly, before you start importing everything, you may run into encoding issues down the line after this step. You can 'undo' it afterwards. This was necessary as the client's talking to my database were trying to send non UTF8 data to my database which was UTF8. ALTER DATABASE lemanger SET client_encoding=latin1; . Alright, finally, run the script above once you're confident everything's changed the way it needs to be! Warning On my personal workstation with a 3GHz Pentium 4 (Not dual core, but still, no slouch) and 2 GB of RAM, this process took 28 hours and 30GBs or so of space. This is not a 'grab a cup of coffee' situation. This is, 'go visit the family ... in the next state... and stop all other concurrently running programs on the workstation for a day' situation. In total -- by the way -- so you don't get in over your head, I needed upwards of 100 GB of free space to complete this process. Some of that space is eventually recoverable, because it's temporary. But still, if you want to preflight this process with just one state, that may be advisable if you have a small hard disk. Another thing to note by the way is that a 'State', in this case, is a state of the United States and includes in its definition 59 entities that aren't from the 50 proper US 'States', including Puerto Rico, for example. Very cool.
  6. Here's where I'm only a little sure of what to do and I could -- in fact -- be wrong. I did get pointed in the right direction (thank you Stephen Woodbridge!), so hopefully this is correct. The next step consists of creating tables that the geocoder will need from another set of data, the US Census Shape Boundry files. They are are at http://www.census.gov/geo/www/cob/bdy_files.html. You don't need them all, just a few. They organized according to a code, as described here: http://www.census.gov/geo/www/cob/filenames.html. Thus, each file is a data set, and some correspond to granularities smaller than 'national'. When you see some of the tables that you're expected to have installed, it's looking for 'national' data sets, and I couldn't find them for all of them. It needs tables of the following type: Mouse over the links to verify that data you're looking at has the code as the data you want. Since 99 means national, and 00 means year 2000, and the part in the begining is the abbreviation, then you know that cs99_d00 is probably county subdivisions from the year 2000 for the national level, which is http://www.census.gov/geo/www/cob/cs2000.html. Notice that there's no 99. And that there's no single set of data! in this case I think we're after just the .SHP files. For me, it was easier to download those listing pages to my local disk somewehre and then extract the URLs and then download them from that list. Here's I did that.
    #!/bin/bash
    cat cs_listing.html |grep -i shp |grep geo|cut -f2 -d\"| while read url;
    do
    nurl="http://www.census.gov$url"
    echo $nurl;
    done > base_of_zip_files.txt
    That got me a listing of the zip files that I wanted. The next trick was to download those zip files:
    #!/bin/bash
    cat base_of_zip_files.txt | while read url;
    do
    nurl="http://www.census.gov$url"
    echo $nurl;
    wget $nurl
    done
    That will take a few minutes. I couldn't get the data all in one for anything but the counties dataset. So, I had to download all the data and forge from them one aggregate data. It helps to be meticulous and create a 'boundry' folder inside of which you create separate 'cs', 'co', and 'pl' folders for these datasets.
  7. Once you've downloaded all the datasets, you'll need to install them. Here's my slightly modified version of the script from the installation instructions that come witht he Tiger geocoder. I created one for each dataset and ran it in the same folder as the particular dataset itself:
    #!/bin/bash
    find . -iname "*ZIP"|while read l;
    do
    mkdir -p x;
    rm -rf x ;
    mkdir x ;
    cd x;
    unzip .$l;
    BASE=`echo *.shp | cut -f1 -d.` ;
    ogr2ogr -append -f PostgreSQL -a_srs EPSG:4269 "PG:dbname=DATABASE host=127.0.0.1 \

    user=USER password=PASSWORD" $BASE.shp $BASE -nln TABLE_NAME -lco DIM=2
    cd ..;

    done
    Basically, TABLE_NAME is the name of the table (one of the three I mentioned above, dependant on which dataset you're dealing with.) and the rest is just the information needed to get into your database. By the way, if you didn't do the client encoding step I mentioned above, this is where it may bite you in the butt.
  8. Create geometry indexes on all the tables using GiST, eg:
     CREATE INDEX completechain_geom_idx ON completechain 
    USING GIST (wkb_geometry public.gist_geometry_ops);

    Naturally, since this is operating on the 66 million record behemoth you created in the beginning, this will take many hours. Go watch a movie... or two.
  9. NB: the following steps are pretty straight forward, if a bit tedious. In particular, one of them is REALLY slow and another one is going to break if you're trying this on 8.3 due to cast errors, complaining of comparisons of integers to char/text/varchar data types. The errors are pretty self explanatory and will show you what needs to be fixed. Typically it's just a matter of changing the expression in question to what it was, plus "::text" to incur a cast.
  10. Create lookup tables using tables/lookup_tables.sql.
  11. Create roads_local using roads_local.sql . I think this one is the one that does a derivation from a dataset produced by a join on one of the bigger tables, and so it too takes an incredibly long time. 5 hours, or so.
  12. Create 'tiger_geocode_roads' using tiger_geocode_roads.sql
  13. Run create_geocode.sql to load the functions
  14. Profit!