Using apache spark for analytics in the cloud

 

William C. Benton

Principal Software Engineer

Red Hat Emerging Technology

June 24, 2015

About me

  • Distributed systems and data science in Red Hat's Emerging Technology group
  • Active open-source and Fedora developer
  • Before Red Hat:  programming language research

Forecast

  • Distributed data processing:  history and mythology
  • Data processing in the cloud
  • Introducing Apache Spark
  • How we use Spark for data science at Red Hat

Recent history & persistent mythology

Data processing

What makes distributed data processing difficult?

Challenges

MapReduce (2004)

  • A novel application of some very old functional programming ideas to distributed computing
  • All data are modeled as key-value pairs
  • Mappers transform pairs; reducers merge several pairs with the same key into one new pair
  • Runtime system shuffles data to improve locality

Word count

"a b"
"c e"
"a b"
"d a"
"d b"

Mapped inputs

​(a, 1)
(b, 1)
​(c, 1)
(e, 1)
​(a, 1)
(b, 1)
​(d, 1)
(a, 1)
​(d, 1)
(b, 1)

shuffled records

​(a, 1)
(a, 1)
(a, 1)
​(b, 1)
(b, 1)
(b, 1)
​(c, 1)
​(d, 1)
(d, 1)
​(e, 1)

reduced records

​(a, 3)
​(b, 3)
​(c, 1)
​(d, 2)
​(e, 1)

Hadoop (2005)

  • Open-source implementation of MapReduce, a distributed filesystem, and more
  • Inexpensive way to store and process data with scale-out on commodity hardware
  • Motivates many of the default assumptions we make about “big data” today

“Facts”

  • You need an architecture that will scale out to many nodes to handle real-world data analytics
  • Your network and disks probably aren't fast enough
  • Locality is everything:  you need to be able to run compute jobs on the nodes storing your data

“FACTS”

  • You need an architecture that will scale out to many nodes to handle real-world data analytics
  • Your network and disks probably aren't fast enough
  • Locality is everything:  you need to be able to run compute jobs on the nodes storing your data

...at least two analytics production clusters (at Microsoft and Yahoo) have median job input sizes under 14 GB and 90% of jobs on a Facebook cluster have input sizes under 100 GB.

Takeaway #1:  you may need petascale storage, but you probably don't even need terascale compute.

Takeaway #2:  moderately sized workloads benefit more from
scale-up than scale out.

“FACTS”

  • You need an architecture that will scale out to many nodes to handle real-world data analytics
  • Your network and disks probably aren't fast enough
  • Locality is everything:  you need to be able to run compute jobs on the nodes storing your data

Contrary to our expectations ... CPU (and not I/O) is often the bottleneck [and] improving network performance can improve job completion time by a median of at most 2%

Takeaway #3:  I/O is not the bottleneck (especially in moderately-sized jobs); focus on CPU performance.

“FACTS”

  • You need an architecture that will scale out to many nodes to handle real-world data analytics
  • Your network and disks probably aren't fast enough
  • Locality is everything:  you need to be able to run compute jobs on the nodes storing your data

Takeaway #4:  collocated data and compute was a sensible choice for petascale jobs in 2005, but shouldn't necessarily be the default today.

Facts (revised)

  • You probably don't need an architecture that will scale out to many nodes to handle real-world data analytics (and might be better served by scaling up)
  • Your network and disks probably aren't the problem
  • You have enormous flexibility to choose the best technologies for storage and compute

Hadoop in 2015

  • MapReduce is low-level, verbose, and not an obvious fit for many interesting problems
  • No unified abstractions:  Hive or Pig for query, Giraph for graph, Mahout for machine learning, etc.
  • Fundamental architectural assumptions need to be revisited along with the “facts” motivating them

How our assumptions should change

Data processing
in the cloud

Collocated data and compute

Elastic resources

distinct storage and compute

Combine the best storage system for your application with elastic compute resources.

Introducing Spark

Apache Spark is a framework for distributed computing based on a high-level, expressive abstraction.

Language bindings for Scala, Java, Python, and R

Access data from JDBC, Gluster, HDFS, S3, and more

A resilient distributed dataset is a partitioned, immutable, lazy collection.

A resilient distributed dataset is a partitioned, immutable, lazy collection.

A resilient distributed dataset is a partitioned, immutable, lazy collection.

The PARTITIONS making up an RDD can be distributed across multiple machines

A resilient distributed dataset is a partitioned, immutable, lazy collection.

TRANSFORMATIONS create new (lazy) collections; ACTIONS force computations and return results

CREATING AN RDD


spark.parallelize(range(1, 1000))


spark.textFile("hamlet.txt")


spark.hadoopFile("...")
spark.sequenceFile("...")
spark.objectFile("...")

# from an in-memory collection

# from the lines of a text file

# from a Hadoop-format binary file

TRANSFORMING RDDs


numbers.map(lambda x: x + 1)


lines.flatMap(lambda s: s.split(" "))


vowels = ['a', 'e', 'i', 'o', 'u']
words.filter(lambda s: s[0] in vowels)


words.distinct()

# transform each element independently

# turn each element into zero or more elements

# reject elements that don't satisfy a predicate

# keep only one copy of duplicate elements

TRANSFORMING RDDs



pairs.sortByKey()



pairs.reduceByKey(lambda x, y: max(x, y))



pairs.join(other_pairs)

# return an RDD of key-value pairs, sorted by
# the keys of each

# combine every two pairs having the same key,
# using the given reduce function

# join together two RDDs of pairs so that
# [(a, b)] join [(a, c)] == [(a, (b, c))]

caching results



sorted_pairs = pairs.sortByKey()
sorted_pairs.cache()



sorted_pairs.persist(MEMORY_AND_DISK)


sorted_pairs.unpersist()

# tell Spark to cache this RDD in cluster
# memory after we compute it

# as above, except also store a copy on disk

# uncache and free this result

computing Results



numbers.count()



counts.collect()



words.saveAsTextFile("...")

# compute this RDD and return a
# count of elements

# compute this RDD and materialize it
# as a local collection

# compute this RDD and write each
# partition to stable storage

WORD COUNT EXAMPLE



f = spark.textFile("...")


words = f.flatMap(lambda line: line.split(" "))


occs = words.map(lambda word: (word, 1))


counts = occs.reduceByKey(lambda a, b: a + b)


counts.saveAsTextFile("...")

# create an RDD backed by the lines of a file

# ...mapping from lines of text to words

# ...mapping from words to occurrences

# ...reducing occurrences to counts

# POP QUIZ:  what have we computed so far?

petascale storage, IN-MEMORY COMPUTE

storage/compute nodes

compute-only nodes, which primarily operate on cached post-ETL data

Data Science at Red Hat

the Emerging tech data science team

  • Engineers with distributed systems, data science, and scientific computing expertise
  • Goal:  help internal customers solve data problems and make data-driven decisions
  • Principles:  identify best practices, question outdated assumptions, use best-of-breed technology

development

  • Six compute-only nodes 
  • Two nodes for Gluster storage
  • Apache Spark running under Apache Mesos
  • Open-source “notebook” interfaces to analyses

DATA SOURCES

FTP

S3

MySQL

MongoDB

ElasticSearch

interactive query

TWO case studies

role analysis

  • Data source:  historical configuration and telemetry data for internal machines from ElasticSearch
  • Data size:  hundreds of GB
  • Analysis:  identify machine roles based on the packages each has installed

budget forecasting

  • Data sources:  operational log data for OpenShift Online (from MySQL), actual costs incurred by OpenShift
  • Data size:  over 120 GB
  • Analysis:  identify operational metrics most strongly correlated with operating expenses; model daily operating expense as a function of these metrics

Aggregating performance metrics:
17 hours in MySQL, 15 minutes in Spark!

Next steps

DEMO VIDEO

See a video demo of Continuum Analytics, PySpark, and Red Hat Storage: h.264 or Ogg Theora

where from here

thanks