Principal Software Engineer
Red Hat Emerging Technology
June 24, 2015
Recent history & persistent mythology
What makes distributed data processing difficult?
"a b"
"c e"
"a b"
"d a"
"d b"
(a, 1)
(b, 1)
(c, 1)
(e, 1)
(a, 1)
(b, 1)
(d, 1)
(a, 1)
(d, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(b, 1)
(c, 1)
(d, 1)
(d, 1)
(e, 1)
(a, 3)
(b, 3)
(c, 1)
(d, 2)
(e, 1)
...at least two analytics production clusters (at Microsoft and Yahoo) have median job input sizes under 14 GB and 90% of jobs on a Facebook cluster have input sizes under 100 GB.
Appuswamy et al., “Nobody ever got fired for buying a cluster.” Microsoft Research Tech Report.
Takeaway #1: you may need petascale storage, but you probably don't even need terascale compute.
Takeaway #2: moderately sized workloads benefit more from
scale-up than scale out.
Contrary to our expectations ... CPU (and not I/O) is often the bottleneck [and] improving network performance can improve job completion time by a median of at most 2%
Ousterhout et al., “Making Sense of Performance in Data Analytics Frameworks.” USENIX NSDI ’15.
Takeaway #3: I/O is not the bottleneck (especially in moderately-sized jobs); focus on CPU performance.
Takeaway #4: collocated data and compute was a sensible choice for petascale jobs in 2005, but shouldn't necessarily be the default today.
How our assumptions should change
Combine the best storage system for your application with elastic compute resources.
Apache Spark is a framework for distributed computing based on a high-level, expressive abstraction.
Language bindings for Scala, Java, Python, and R
Access data from JDBC, Gluster, HDFS, S3, and more
A resilient distributed dataset is a partitioned, immutable, lazy collection.
A resilient distributed dataset is a partitioned, immutable, lazy collection.
A resilient distributed dataset is a partitioned, immutable, lazy collection.
The PARTITIONS making up an RDD can be distributed across multiple machines
A resilient distributed dataset is a partitioned, immutable, lazy collection.
TRANSFORMATIONS create new (lazy) collections; ACTIONS force computations and return results
spark.parallelize(range(1, 1000))
spark.textFile("hamlet.txt")
spark.hadoopFile("...")
spark.sequenceFile("...")
spark.objectFile("...")
# from an in-memory collection
# from the lines of a text file
# from a Hadoop-format binary file
numbers.map(lambda x: x + 1)
lines.flatMap(lambda s: s.split(" "))
vowels = ['a', 'e', 'i', 'o', 'u']
words.filter(lambda s: s[0] in vowels)
words.distinct()
# transform each element independently
# turn each element into zero or more elements
# reject elements that don't satisfy a predicate
# keep only one copy of duplicate elements
pairs.sortByKey()
pairs.reduceByKey(lambda x, y: max(x, y))
pairs.join(other_pairs)
# return an RDD of key-value pairs, sorted by
# the keys of each
# combine every two pairs having the same key,
# using the given reduce function
# join together two RDDs of pairs so that
# [(a, b)] join [(a, c)] == [(a, (b, c))]
sorted_pairs = pairs.sortByKey()
sorted_pairs.cache()
sorted_pairs.persist(MEMORY_AND_DISK)
sorted_pairs.unpersist()
# tell Spark to cache this RDD in cluster
# memory after we compute it
# as above, except also store a copy on disk
# uncache and free this result
numbers.count()
counts.collect()
words.saveAsTextFile("...")
# compute this RDD and return a
# count of elements
# compute this RDD and materialize it
# as a local collection
# compute this RDD and write each
# partition to stable storage
f = spark.textFile("...")
words = f.flatMap(lambda line: line.split(" "))
occs = words.map(lambda word: (word, 1))
counts = occs.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("...")
# create an RDD backed by the lines of a file
# ...mapping from lines of text to words
# ...mapping from words to occurrences
# ...reducing occurrences to counts
# POP QUIZ: what have we computed so far?
storage/compute nodes
compute-only nodes, which primarily operate on cached post-ETL data
FTP
S3
MySQL
MongoDB
ElasticSearch
Aggregating performance metrics:
17 hours in MySQL, 15 minutes in Spark!
See a video demo of Continuum Analytics, PySpark, and Red Hat Storage: h.264 or Ogg Theora