Elasticsearch has offered Hadoop
OutputFormat implementations for quite some time. These made it possible to process Elasticsearch indices with Spark just as you would any other Hadoop data source. Here’s an example of this in action, taken from Elastic’s documentation:
However, the latest versions of the Elasticsearch client offers more idiomatic Spark support to generate RDDs (and Spark data frames) directly from Elasticsearch indices without explicitly dealing with the Hadoop formatting machinery. This support is extremely cool, but the documentation is still a bit sparse and I hit a few snags using it. In this post, I’ll cover the problems I encountered and explain how to make it work.
Native Spark support is available in
elasticsearch-hadoop version 2.1.0. If you’d tried to set this up last week, you’d have needed to run against a snapshot build to get support for Spark data frames. Fortunately, the most recent beta release of
elasticsearch-hadoop includes this support. Add the following library dependency to your project:
Since we won’t be using the lower-level Hadoop support, that’s the only library we’ll need.
We need to supply some configuration in order to create RDDs directly from indices.
elasticsearch-spark expects this configuration to be available as parameters on our Spark context object. Here’s an example of how you’d do that in a vanilla Spark application:
Note that since the default value for
localhost, you don’t need to set it at all if you’re running against a local Elasticsearch instance. If you were running Elasticsearch behind a reverse proxy running on
proxyhost on port 80, you might specify
proxyhost:80 instead (and you would probably also want to set
false so your client wouldn’t discover nodes that weren’t reachable outside of the private network). There are other options you can set, but this is the most critical.
elasticsearch-spark offers implicit conversions to make RDDs directly from Elasticsearch resources. (Careful readers may have already seen one in action!) The following code example assumes that you’ve initialized
spark as a
SparkContext with whatever Elasticsearch configuration data you need to set:
This is far more convenient than using the Hadoop
InputFormat, but the interface still leaves a little to be desired. Our data may have a sensible schema, but the result we’re getting back is as general as possible: just a collection of key-value pairs, where the keys are strings and the values are maps from strings to arbitrary (and arbitrarily-typed) values. If we want more explicit structure in our data, we’ll need to bring it in to Spark in a different way.
Generating data frames
In order to benefit from the structure in our data, we can generate data frames that store schema information. The most recent release of
elasticsearch-spark supports doing this directly. Here’s what that looks like, again assuming that
spark is an appropriately-configured
Once we have a data frame for our ES resource, we can run queries against it in Spark:
Elasticsearch’s Spark support offers many additional capabilities, including storing Spark RDDs to Elasticsearch and generating RDDs from queries. Check out the upstream documentation for more!