Indexing GeoNames into Solr
This post walks through a quick and easy way to index GeoNames.org locations into Solr 5.2.1. It uses the Solr default configuration for the gettingstarted
collection.
For more on Solr collections vs cores.
Getting Solr 5.2.1 up and going
Download and unzip Solr 5.2.1
Start Solr
You should now be able to successfully navigate to http://127.0.0.1:8983/solr
Formatting GeoNames.org data for Solr
GeoNames provides several data download types available on their website. This post will focus on indexing allCountries.txt
which includes all features from GeoNames. This file unzipped is ~1.2 GB which could be troublesome for some. Beginning users may want to start with a smaller dataset such as cities1000.txt
which is a smaller subset of the GeoNames data.
Someone out there probably could do all of this in an awesome one liner. These steps are broken up for better understanding of whats going on. We first need to format the GeoNames data into something that is indexable into Solr.
Download and unzip allCountries.zip
Download available from GeoNames.
allCountries.txt
comes in a tab-delimited text file in utf-8 encoding. The following fields are provided:
Field | Description |
---|---|
geonameid | integer id of record in geonames database |
name | name of geographical point (utf8) varchar(200) |
asciiname | name of geographical point in plain ascii characters, varchar(200) |
alternatenames | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) |
latitude | latitude in decimal degrees (wgs84) |
longitude | longitude in decimal degrees (wgs84) |
more … | we don’t need the rest of these |
We won’t use most of these columns, so let’s get rid of the ones we don’t need.
Get rid of columns we don’t need
We only need the 1st, 2nd, 5th, and 6th columns.
Add a header row
Add in a header row to the tsv text file. Note, whitespace delimiters (between id
, title_t
, lat
, lng
) should be tab literals.
Add a WKT column
This command requires the csvpys version of csvkit software. Running the command will create a new WKT point column loc_srpt
using the existing lat
and lng
columns. *_srpt
is a Spatial Recursive Prefix Tree Field Type dynamic Solr field shipped with the default gettingstarted
Solr schema.
Only keep the columns we need
Get rid of the lat
and lng
columns
Convert the tsv to json
Index into Solr
If you are doing this using the full allCountries.txt
file, this command can take a while (at least 5 minutes). This command will index over 10 million records into your Solr index. You can check the status of this command by seeing if the document counts in your Solr collection are increasing. You can see this by using the Solr admin interface.
You should now have your GeoNames data indexed in Solr!
Checkout a Solr query.
You can now do all sorts of fun spatial search things in Solr!