Elasticsearch and Spark 1.3

Elasticsearch has offered Hadoop InputFormat and OutputFormat implementations for quite some time. These made it possible to process Elasticsearch indices with Spark just as you would any other Hadoop data source. Here’s an example of this in action, taken from Elastic’s documentation:

val conf = new JobConf()
conf.set("es.resource", "radio/artists")
conf.set("es.query", "?q=me*")
val esRDD = sc.hadoopRDD(conf,
                classOf[EsInputFormat[Text, MapWritable]], 
                classOf[Text], classOf[MapWritable]))
val docCount = esRDD.count();

However, the latest versions of the Elasticsearch client offers more idiomatic Spark support to generate RDDs (and Spark data frames) directly from Elasticsearch indices without explicitly dealing with the Hadoop formatting machinery. This support is extremely cool, but the documentation is still a bit sparse and I hit a few snags using it. In this post, I’ll cover the problems I encountered and explain how to make it work.

Dependencies

Native Spark support is available in elasticsearch-hadoop version 2.1.0. If you’d tried to set this up last week, you’d have needed to run against a snapshot build to get support for Spark data frames. Fortunately, the most recent beta release of elasticsearch-hadoop includes this support. Add the following library dependency to your project:

libraryDependencies += "org.elasticsearch" %% "elasticsearch-spark" % "2.1.0.Beta4"

Since we won’t be using the lower-level Hadoop support, that’s the only library we’ll need.

Configuration

We need to supply some configuration in order to create RDDs directly from indices. elasticsearch-spark expects this configuration to be available as parameters on our Spark context object. Here’s an example of how you’d do that in a vanilla Spark application:

val conf = new org.apache.spark.SparkConf()
 .setMaster("local[*]")
 .setAppName("es-example")
 .set("es.nodes", "localhost")

val spark = new org.apache.spark.SparkContext(conf)

Note that since the default value for es.nodes is localhost, you don’t need to set it at all if you’re running against a local Elasticsearch instance. If you were running Elasticsearch behind a reverse proxy running on proxyhost on port 80, you might specify proxyhost:80 instead (and you would probably also want to set es.nodes.discovery to false so your client wouldn’t discover nodes that weren’t reachable outside of the private network). There are other options you can set, but this is the most critical.

If you’re using the Spark app skeleton provided by the Silex library, you can add Elasticsearch configuration in config hooks, as follows:

import com.redhat.et.silex.app.AppCommon
import org.elasticsearch.spark._

object IndexCount extends AppCommon {
  addConfig { conf => 
    conf
     .set("es.nodes", sys.env.getOrElse("ES_NODES", "localhost")) 
     .set("es.nodes.discovery", "false")
  }
  
  override def appName = "example ES app"
  
  def appMain(args: Array[String]) {
    args match {
      case Array(index, indexType) =>
        val resource = s"$index/$indexType"
        val count = context.esRDD(resource).count()
        Console.println(s"$resource has $count documents")
      case _ =>
        Console.println("usage: IndexCount index type")
    }
  }
}

Generating RDDs

elasticsearch-spark offers implicit conversions to make RDDs directly from Elasticsearch resources. (Careful readers may have already seen one in action!) The following code example assumes that you’ve initialized spark as a SparkContext with whatever Elasticsearch configuration data you need to set:

// bring implicit conversions into scope to extend our
// SparkContext with the esRDD method, which creates a
// new RDD backed by an ES index or query
import org.elasticsearch.spark._

// telemetry-20080915/sar is an ES index of system telemetry data
// (collected from the sar tool)
val sarRDD = spark.esRDD("telemetry-20080915/sar")

// => sarRDD: org.apache.spark.rdd.RDD[(String, scala.collection.Map[String,AnyRef])] = ScalaEsRDD[4] at RDD at AbstractEsRDD.scala:17

This is far more convenient than using the Hadoop InputFormat, but the interface still leaves a little to be desired. Our data may have a sensible schema, but the result we’re getting back is as general as possible: just a collection of key-value pairs, where the keys are strings and the values are maps from strings to arbitrary (and arbitrarily-typed) values. If we want more explicit structure in our data, we’ll need to bring it in to Spark in a different way.

Generating data frames

In order to benefit from the structure in our data, we can generate data frames that store schema information. The most recent release of elasticsearch-spark supports doing this directly. Here’s what that looks like, again assuming that spark is an appropriately-configured SparkContext:

// bring implicit conversions into scope to extend our
// SQLContexts with the esDF method, which creates a
// new data frame backed by an ES index or query
import org.elasticsearch.spark.sql._

// construct a SQLContext
val sqlc = new org.apache.spark.sql.SQLContext(spark)

// telemetry-20080915/sar is an ES index of system telemetry data
// (collected from the sar tool)
val sarDF = sqlc.esDF("telemetry-20080915/sar")

// => org.apache.spark.sql.DataFrame = [_metadata: struct<file-date:timestamp,generated-by:string,generated-by-version:string,machine:string,nodename:string,number-of-cpus:int,release:string,sysdata-version:float,sysname:string>, cpu-load: string, cpu-load-all: string, disk: string, filesystems: string, hugepages: struct<hugfree:bigint,hugused:bigint,hugused-percent:double>, interrupts: string, interrupts-processor: string, io: struct<io-reads:struct<bread:double,rtps:double>,io-writes:struct<bwrtn:double,wtps:double>,tps:double>, kernel: struct<dentunusd:bigint,file-nr:bigint,inode-nr:bigint,pty-nr:bigint>, memory: struct<active:bigint,buffers:bigint,bufpg:double,cached:bigint,campg:double,commit:bigint,commit-percent:double,dirty:bigint,frmpg:double,inactive:bigint,memfree:bigint,memu...

Once we have a data frame for our ES resource, we can run queries against it in Spark:

sardf.registerTempTable("sardf")

sqlc.sql("select _metadata from sardf")

Next steps

Elasticsearch’s Spark support offers many additional capabilities, including storing Spark RDDs to Elasticsearch and generating RDDs from queries. Check out the upstream documentation for more!