Using apache spark for analytics in the cloud

William C. Benton

Principal Software Engineer

Red Hat Emerging Technology

June 24, 2015

About me

Distributed systems and data science in Red Hat's Emerging Technology group
Active open-source and Fedora developer
Before Red Hat: programming language research

Forecast

Distributed data processing: history and mythology
Data processing in the cloud
Introducing Apache Spark
How we use Spark for data science at Red Hat

Recent history & persistent mythology

Data processing

What makes distributed data processing difficult?

Challenges

MapReduce (2004)

A novel application of some very old functional programming ideas to distributed computing
All data are modeled as key-value pairs
Mappers transform pairs; reducers merge several pairs with the same key into one new pair
Runtime system shuffles data to improve locality

Word count

"a b"

"c e"

"a b"

"d a"

"d b"

Mapped inputs

(a, 1)

(b, 1)

(c, 1)

(e, 1)

(a, 1)

(b, 1)

(d, 1)

(a, 1)

(d, 1)

(b, 1)

shuffled records

(a, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(b, 1)

(c, 1)

(d, 1)

(d, 1)

(e, 1)

reduced records

(a, 3)

(b, 3)

(c, 1)

(d, 2)

(e, 1)

Hadoop (2005)

Open-source implementation of MapReduce, a distributed filesystem, and more
Inexpensive way to store and process data with scale-out on commodity hardware
Motivates many of the default assumptions we make about “big data” today

“Facts”

You need an architecture that will scale out to many nodes to handle real-world data analytics
Your network and disks probably aren't fast enough
Locality is everything: you need to be able to run compute jobs on the nodes storing your data

“FACTS”

You need an architecture that will scale out to many nodes to handle real-world data analytics
Your network and disks probably aren't fast enough
Locality is everything: you need to be able to run compute jobs on the nodes storing your data

...at least two analytics production clusters (at Microsoft and Yahoo) have median job input sizes under 14 GB and 90% of jobs on a Facebook cluster have input sizes under 100 GB.

Appuswamy et al., “Nobody ever got fired for buying a cluster.” Microsoft Research Tech Report.

Takeaway #1: you may need petascale storage, but you probably don't even need terascale compute.

Takeaway #2: moderately sized workloads benefit more from
scale-up than scale out.

“FACTS”

You need an architecture that will scale out to many nodes to handle real-world data analytics
Your network and disks probably aren't fast enough
Locality is everything: you need to be able to run compute jobs on the nodes storing your data

Contrary to our expectations ... CPU (and not I/O) is often the bottleneck [and] improving network performance can improve job completion time by a median of at most 2%

Ousterhout et al., “Making Sense of Performance in Data Analytics Frameworks.” USENIX NSDI ’15.

Takeaway #3: I/O is not the bottleneck (especially in moderately-sized jobs); focus on CPU performance.

“FACTS”

You need an architecture that will scale out to many nodes to handle real-world data analytics
Your network and disks probably aren't fast enough
Locality is everything: you need to be able to run compute jobs on the nodes storing your data

Takeaway #4: collocated data and compute was a sensible choice for petascale jobs in 2005, but shouldn't necessarily be the default today.

Facts (revised)

You probably don't need an architecture that will scale out to many nodes to handle real-world data analytics (and might be better served by scaling up)
Your network and disks probably aren't the problem
You have enormous flexibility to choose the best technologies for storage and compute

Hadoop in 2015

MapReduce is low-level, verbose, and not an obvious fit for many interesting problems
No unified abstractions: Hive or Pig for query, Giraph for graph, Mahout for machine learning, etc.
Fundamental architectural assumptions need to be revisited along with the “facts” motivating them

How our assumptions should change

Data processing
in the cloud

Collocated data and compute

Elastic resources

distinct storage and compute

Combine the best storage system for your application with elastic compute resources.

Introducing Spark

Apache Spark is a framework for distributed computing based on a high-level, expressive abstraction.

Language bindings for Scala, Java, Python, and R

Access data from JDBC, Gluster, HDFS, S3, and more

A resilient distributed dataset is a partitioned, immutable, lazy collection.

The PARTITIONS making up an RDD can be distributed across multiple machines

A resilient distributed dataset is a partitioned, immutable, lazy collection.

TRANSFORMATIONS create new (lazy) collections; ACTIONS force computations and return results

CREATING AN RDD


spark.parallelize(range(1, 1000))


spark.textFile("hamlet.txt")


spark.hadoopFile("...")
spark.sequenceFile("...")
spark.objectFile("...")

# from an in-memory collection

# from the lines of a text file

# from a Hadoop-format binary file

TRANSFORMING RDDs


numbers.map(lambda x: x + 1)


lines.flatMap(lambda s: s.split(" "))


vowels = ['a', 'e', 'i', 'o', 'u']
words.filter(lambda s: s[0] in vowels)


words.distinct()

# transform each element independently

# turn each element into zero or more elements

# reject elements that don't satisfy a predicate

# keep only one copy of duplicate elements

TRANSFORMING RDDs



pairs.sortByKey()



pairs.reduceByKey(lambda x, y: max(x, y))



pairs.join(other_pairs)

# return an RDD of key-value pairs, sorted by
# the keys of each

# combine every two pairs having the same key,
# using the given reduce function

# join together two RDDs of pairs so that
# [(a, b)] join [(a, c)] == [(a, (b, c))]

caching results



sorted_pairs = pairs.sortByKey()
sorted_pairs.cache()



sorted_pairs.persist(MEMORY_AND_DISK)


sorted_pairs.unpersist()

# tell Spark to cache this RDD in cluster
# memory after we compute it

# as above, except also store a copy on disk

# uncache and free this result

computing Results



numbers.count()



counts.collect()



words.saveAsTextFile("...")

# compute this RDD and return a
# count of elements

# compute this RDD and materialize it
# as a local collection

# compute this RDD and write each
# partition to stable storage

WORD COUNT EXAMPLE



f = spark.textFile("...")


words = f.flatMap(lambda line: line.split(" "))


occs = words.map(lambda word: (word, 1))


counts = occs.reduceByKey(lambda a, b: a + b)


counts.saveAsTextFile("...")

# create an RDD backed by the lines of a file

# ...mapping from lines of text to words

# ...mapping from words to occurrences

# ...reducing occurrences to counts

# POP QUIZ: what have we computed so far?

petascale storage, IN-MEMORY COMPUTE

storage/compute nodes

compute-only nodes, which primarily operate on cached post-ETL data

Data Science at Red Hat

the Emerging tech data science team

Engineers with distributed systems, data science, and scientific computing expertise
Goal: help internal customers solve data problems and make data-driven decisions
Principles: identify best practices, question outdated assumptions, use best-of-breed technology

development

Six compute-only nodes
Two nodes for Gluster storage
Apache Spark running under Apache Mesos
Open-source “notebook” interfaces to analyses

DATA SOURCES

FTP

MySQL

MongoDB

ElasticSearch

interactive query

TWO case studies

role analysis

Data source: historical configuration and telemetry data for internal machines from ElasticSearch
Data size: hundreds of GB
Analysis: identify machine roles based on the packages each has installed

budget forecasting

Data sources: operational log data for OpenShift Online (from MySQL), actual costs incurred by OpenShift
Data size: over 120 GB
Analysis: identify operational metrics most strongly correlated with operating expenses; model daily operating expense as a function of these metrics

Aggregating performance metrics:
17 hours in MySQL, 15 minutes in Spark!

Next steps

DEMO VIDEO

See a video demo of Continuum Analytics, PySpark, and Red Hat Storage: h.264 or Ogg Theora

where from here

Check out the Emerging Technology Data Science team's library to help build your own data-driven applications: https://github.com/willb/silex/
See my blog for articles about open source data science: http://chapeau.freevariable.com
Questions?

Using apache spark for analytics in the cloud

William C. Benton

About me

Forecast

Data processing

Challenges

MapReduce (2004)

Word count

Mapped inputs

shuffled records

reduced records

Hadoop (2005)

“Facts”

“FACTS”

“FACTS”

“FACTS”

Facts (revised)

Hadoop in 2015

Data processing in the cloud

Collocated data and compute

Elastic resources

distinct storage and compute

Introducing Spark

CREATING AN RDD

TRANSFORMING RDDs

TRANSFORMING RDDs

caching results

computing Results

WORD COUNT EXAMPLE

petascale storage, IN-MEMORY COMPUTE

Data Science at Red Hat

the Emerging tech data science team

development

DATA SOURCES

interactive query

TWO case studies

role analysis

budget forecasting

Next steps

DEMO VIDEO

where from here

thanks

Data processing
in the cloud