Converting shapefiles into GeoJSON (for Elasticsearch)

This blog is all about finding, converting, and adding geo data from a commonly found format (shapefiles) to a format that can be used by json engines like Elasticsearch.

Finding Data

Open geo data can be found in a lot of places. Open city data is a great source of geo data in many jurisdictions. Searching for “open data <cityname>” can yield a lot of results. For example, https://datasf.org/opendata/ is San Francisco’s open data portal. Some jurisdictions will have dedicated GIS portals.

You’ll often find geo data in a few formats:

  1. A CSV of geo points
  2. Shapefiles for geo points
  3. Shapefiles for geo shapes
  4. WKT (Well-known text)
  5. GeoJSON

Elasticsearch natively supports WKT and GeoJSON and I’ll leave the work to import CSVs as an exercise to the reader for now. I’m going to focus this on how to convert/import shapefiles. Sometimes GeoJSON has a full FeatureCollection which is essentially a full layer of features, that, at least for Elasticsearch, needs to be converted to a list of individual Features (points and shapes). I will cover that here in Breaking a GeoJSON FeatureCollection up

In this example, we’ll use the counties in Atlanta, which can be found at http://gisdata.fultoncountyga.gov/datasets/53ca7db14b8f4a9193c1883247886459_67. You can go to Download -> Shapefile to get the shapefile zip file. In this counties example, this looks like this once I’ve unzipped:

$ ls  
Counties_Atlanta_Region.cpg Counties_Atlanta_Region.dbf Counties_Atlanta_Region.prj Counties_Atlanta_Region.shp Counties_Atlanta_Region.shx

The Wikipedia article on shapefiles has a breakdown of what all of these files actually contain, but we only need 1 file for the rest of this exercise: the .shp file, which contains the feature geometries itself.

Converting Shapefiles to GeoJSON

After you have a shapefile, the next step is to get the data into GeoJSON format.

Looking at the Atlanta counties again, the Counties_Atlanta_Region.shp file is the one that’s interesting to us. We’ll use a tool called ogr2ogr to convert .shp files to GeoJSON. ogr2ogr is part of GDAL and can be installed on a Mac if you have homebrew installed by using:

brew install gdal

Alternatively, you can install it manually. ogr2ogr is a wonderful tool to have on your laptop for using/testing geo data. Once you have it, continuing with our example, you should be able to run:

ogr2ogr -f GeoJSON -t_srs crs:84 output_counties.json Counties_Atlanta_Region.shp

This means:

  • -f GeoJSON: Output to GeoJSON format
  • -t_srs crs:84: There are a lot of coordinate reference systems. If you know you need data in a different coordinate system, you can override this with something else, though that’s going to generally be a highly specialized case. We’re telling ogr2ogr to use WGS84 on the output, which is the same system that GPS uses.
  • output_counties.json is the output file
  • Counties_Atlanta_Region.shp is the input file.

After you run this, you now have a GeoJSON file! You can inspect this file if you like to see what a GeoJSON file looks like if you’ve never had a look at one.

Breaking a GeoJSON FeatureCollection Up

If we look at the resulting GeoJSON file in the previous step, we see at the top of it:

“type”: “FeatureCollection”, “name”: “Counties_Atlanta_Region”, “crs”: { “type”: “name”, “properties”: { “name”: “urn:ogc:def:crs:OGC:1.3:CRS84” } }, “features”: [ …

Elasticsearch, like several systems that handle geo data, handles most GeoJSON, but FeatureCollections are essentially an array of objects that want to separately index for most purposes. FeatureCollections are sort of like a “bulk” dataset and we need to get individual points/shapes (Features) so that we can search, filter, and use them. In this example, the individual features are individual counties in Atlanta. This is where jq comes in handy.

jq can also be installed via homebrew:

brew install jq

Afterwards, you can do “select the array of features[] from output_counties.json, and output 1 feature per line” by

jq -c ‘.features[]’ output_counties.json

The -c flag means “compact” — it outputs 1 feature per line, which can be useful for what we’re about to do next…

Simultaneously Extracting Features and Converting to Bulk Format

We can do 1 step better than just extracting the features array by simultaneously converting the output to Elasticsearch’s bulk format with sed:

jq -c ‘.features[]’ output_counties.json | sed -e ‘s/^/{ “index” : { “_index” : “geodata”, “_type” : “_doc” } }\

/’ > output_counties_bulk.json && echo “” >> output_counties_bulk.json

The sed bit just adds a bulk header line (and a newline) per record and the echo "" >> output_counties_bulk.json makes sure the file ends in a newline, as this is required by Elasticsearch.

Change geodata to an Elasticsearch index name of your choosing.

Set Up Elasticsearch Mappings

At this point, I’d set up the Elasticsearch mappings for this “geodata” index (or whatever name you want to give it). Metadata related to the shape is often in .properties and geo shape data is often in .geometry. The county data here looks typical:

jq -c ‘.features[].properties’ output_counties.json

Shows us a list of properties like:

{“OBJECTID”:28,”STATEFP10″:”13″,”COUNTYFP10″:”013″,”GEOID10″:”13013″,”NAME10″:”Barrow”,”NAMELSAD10″:”Barrow County”,”totpop10″:69367,”WFD”:”N”,”RDC_AAA”:”N”,”MNGWPD”:”N”,”MPO”:”Partial”,”MSA”:”Y”,”F1HR_NA”:”N”,”F8HR_NA”:”N”,”Reg_Comm”:”Northeast Georgia”,”Acres”:104266,”Sq_Miles”:162.914993,”Label”:”BARROW”,”GlobalID”:”{36E2EA48-1481-44D7-91C9-7C51AC8AB9E9}”,”last_edite”:”2015-10-14T17:19:34.000Z”}

At this point, you can add any mappings around these fields and/or use an ingest node pipeline to manipulate the data prior to indexing. For now, I’m just going to set up the geo_shape field, but you can add extras.

PUT /geodata  
{  
  "settings": {  
    "number_of_shards": 1  
  },  
  "mappings": {  
    "_doc": {  
      "properties": {  
        "geometry": {  
          "type": "geo_shape"  
        }  
      }  
    }  
  }  
}

Bulk Loading Data to Elasticsearch

And at this point, you can bulk-load the data

curl -H “Content-Type: application/x-ndjson” -XPOST localhost:9200/_bulk –data-binary “@output_counties_bulk.json”

And then you can set up or reload Kibana index patterns for your index to make sure it shows up. Make sure to change any time filters to be appropriate with any visualizations you use. I often turn off the “time” field for quick demos as it can often be inconsistent/missing dates (as I found this Atlanta county data to be).

Recap / TL;DR

Get a shapefile

ogr2ogr -f GeoJSON -t_srs crs:84 your_geojson.json your_shapefile.shp

jq -c ‘.features[]’ your_geojson.json | sed -e ‘s/^/{ “index” : { “_index” : “your_index”, “_type” : “_doc” } }\

/’ > your_geojson_bulk.json && echo “” >> your_geojson_bulk.json

Set up your mappings. Often the following works, but you may need to check field names:

PUT /geodata  
{  
  "settings": {  
    "number_of_shards": 1  
  },  
  "mappings": {  
    "_doc": {  
      "properties": {  
        "geometry": {  
          "type": "geo_shape"  
        }  
      }  
    }  
  }  
}

curl -H “Content-Type: application/x-ndjson” -XPOST localhost:9200/_bulk –data-binary “@your_geojson_bulk.json”

Set up (or refresh) Kibana index patterns to include your_index.

Voila!