How to avoid columnar calamities: What no one told you about Apache Parquet
If you're dealing with structured data at scale, it's a safe bet that you're depending on Apache Parquet in at least a few parts of your pipeline. Parquet is a sensible default choice for storing structured data at rest because of two major advantages: its efficiency and its ubiquity. While Parquet's storage efficiency enables dramatically improved time and space performance for query jobs, its ubiquity may be even more valuable. Since Parquet readers and writers are available in a wide range of languages and ecosystems, the Parquet format can support a range of applications across the data lifecycle, including data engineering and ETL jobs, query engines, and machine learning pipelines.
However, the ubiquity of Parquet readers and writers hides some complexity: if you don't take care, some of the advantages of Parquet can be lost in translation as you move tables from Hadoop, Flink, or Spark jobs to Python machine learning code. This talk will help you understand Parquet more fully in order to use it more effectively, with an eye towards the special challenges that might arise in polyglot environments. We'll level-set with a quick overview of how Parquet works and why it's so efficient. We'll then dive in to the type, encoding, and compression options available and discuss when each is most appropriate. You'll learn how to interrogate and understand Parquet metadata, and you'll learn about some of the challenges you'll run into when sharing data between JVM-based data engineering pipelines and Python-based machine learning pipelines. You'll leave this talk with a better understanding of Parquet and a roadmap pointing you away from some interoperability and performance pitfalls.
Modernize Your Analytics Workloads for Apache Spark 3.0 and Beyond
Apache Spark 3.0 has been out for almost a year, and you’re probably running at least some production workloads against it today. However, many production Spark jobs may have evolved over the better part of a decade, and your code, configuration, and architecture may not be taking full advantage of all that Spark 3 has to offer.
In this talk, we’ll discuss changes you might need to make to legacy applications in order to make the most of Apache Spark 3.0. You’ll learn some common sources of technical debt in mature Apache Spark applications and how to pay them down, when to replace hand-tuned configurations with Adaptive Query Execution, how to ensure that your queries can take advantage of columnar processing, including execution on GPUs, and how your Spark analytics workloads can directly incorporate accelerated ML training.
We’ll provide several concrete examples taken from an end-to-end analytics application addressing customer churn modeling, recent experience modernizing Apache Spark applications, and lessons learned while maintaining a library of Apache Spark extensions across three major versions of Apache Spark.
Cloud Native Machine Learning Systems at Day Two and Beyond
You’re probably already convinced that Kubernetes is the right infrastructure for your next machine learning initiative, but you may not be ready for some of the speedbumps that await you on the way. This talk will introduce some of the challenges unique to machine learning systems, prepare you for the tradeoffs you’ll face supporting practitioners and putting systems in production, and present some of the additional tools you’ll need in your DevOps toolbox as your cloud-native machine learning systems mature. You’ll learn how to negotiate pitfalls related to interactive development, reproducibility, and monitoring machine learning systems in production with concrete solutions inspired by our experience with end-users in various industries.
Sketching Data and Other Magic Tricks
This hands-on tutorial explores ways to answer interesting queries about truly massive datasets almost instantly and with a fixed amount of space. It sounds like magic, but you’ll go hands-on to learn about the sketching data structures that work this magic and the key trick that makes them possible. Sophie and William introduce truly scalable techniques for several fundamental problems like set membership, set and document similarity, counting kinds of events, and counting distinct elements. You’ll learn how and when to use these structures as well as how they work. You’ll see how the same techniques work for parallel, distributed, and stream processing at scale. You’ll leave able to put these techniques to work in real data engineering and machine learning applications like join processing, document classification, and content personalization.
Band-Aids Don't Fix Bullet Holes: Repairing the Broken Promises of Ubiquitous Machine Learning
Buoyed by expensive industrial research efforts, amazing engineering breakthroughs, and an ever-increasing volume of training data, machine learning techniques have recently seen successes on problems that seemed largely intractable twenty years ago. However, beneath awe-inspiring demos and impressive real-world results, there are cracks in the foundation: ordinary organizations struggle to get real insight or value out of their data and wonder how they’ve missed out on the promised democratization of AI and machine learning.
This talk will diagnose how we got to this point. You’ll see how the incentives and rhetoric of software and infrastructure vendors have led to inflated expectations. We’ll show how internal political pressures can encourage teams to aim for moonshots instead of realistic and meaningful goals. You’ll learn why contemporary frameworks that have enjoyed prominent successes on perception problems are almost certainly not the best fit for gleaning insights from structured business data. Finally, you’ll see why many of the solutions the industry has offered to real-world machine learning woes are essentially “bandages” that cover deep problems without addressing their causes.
This talk won’t merely offer a diagnosis without a prescription; we’ll conclude by showing that the way to avoid disappointing machine learning initiatives in the future isn’t a patchwork of superficial fixes to help us ignore that we’re solving the wrong problems. Instead, we need to radically simplify the way we approach learning from data by embracing broader definitions of “AI” and “machine learning.” Organizations should prioritize results over emulating research labs and practitioners should focus first on fundamental techniques including summaries, sketches, and straightforward models. These techniques are unlikely to attract acclaim on social media or in the technology press, but they are broadly applicable, allow practitioners to realize business value quickly, produce interpretable results, and truly democratize machine intelligence.
Machine learning and discovery with Kubernetes
Why Data Scientists Love Kubernetes
This talk will introduce the workflows and concerns of data scientists and machine learning engineers and demonstrate how to make Kubernetes a powerhouse for intelligent applications.
We’ll show how community projects like Kubeflow and radanalytics.io support the entire intelligent application development lifecycle. We’ll cover several key benefits of Kubernetes for a data scientist’s workflow, from experiment design to publishing results. You’ll see how well scale-out data processing frameworks like Apache Spark work in Kubernetes.
System operators will learn how Kubernetes can support data science and machine learning workflows. Application developers will learn how Kubernetes can enable intelligent applications and cross-functional collaboration. Data scientists will leave with concrete suggestions for how to use Kubernetes and open-source tools to make their work more productive.
Why Data Scientists Should Love Linux Containers
Learn how containers and automated build pipelines can realize the potential of interactive notebooks as truly reproducible research, how data scientists can use containers and workflows from the DevOps world to communicate with application development teams, how container platforms let data scientists scale experiments beyond their laptops with easy access to powerful and specialized hardware and simplify governing access to sensitive internal data and provide a clearer path to regulatory compliance, and how to get started using key open source projects that enable data scientists and machine learning engineers to make the most of container technology.
Apache Spark for Library Developers
As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.
You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.
We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Building Machine Learning Algorithms on Apache Spark: Scaling Out and Up
There are lots of reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn’t implemented in MLlib.
In this talk, we’ll walk through the process of developing a new machine learning algorithm for Spark. We’ll start with the basics, by considering how we’d design a scale-out parallel implementation of our unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark.
We’ll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. We’ll conclude by briefly examining some useful techniques to complement scale-out performance by scaling our code up, taking advantage of specialized hardware to accelerate single-worker performance.
You’ll leave this talk with everything you need to build a new machine learning technique that runs on Spark.
Probabilistic Structures for Scalable Computing
In this talk you'll learn about streaming algorithms and approximate data structures to characterize data sources that are too big to keep around or difficult to replay. We'll start simple, with an algorithm for on-line mean and variance estimates of a stream of samples. Then we'll look at Bloom filters (for approximate set membership), count-min sketch (for approximate member count in a multiset), and HyperLogLog (for approximate set cardinality). We'll cover implementing these algorithms, using them for data analysis (and even machine learning), and provide some intuition for why they work at scale. Come with reading knowledge of Python and leave with some cool new options in your scalable data processing toolbox!
Note that the YouTube video for this talk is audio-only; the actual talk was delivered without slides due to projector malfunction.
Building Machine Learning Algorithms on Apache Spark
There are many reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn’t implemented in MLlib. In this talk, we’ll walk through the process of developing a new machine learning model for Spark. We’ll start with the basics, by considering how we’d design a parallel implementation of a particular unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we’ll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. You’ll leave this talk with everything you need to build a new machine learning technique that runs on Spark.
The Revolution Will Be Containerized • Architecting the Intelligent Applications of Tomorrow
Linux containers are increasingly popular with application developers: they offer improved elasticity, fault-tolerance, and portability between different public and private clouds, along with an unbeatable development workflow. It’s hard to imagine a technology that has had more impact on application developers in the last decade than containers, with the possible exception of ubiquitous analytics. Indeed, analytics is no longer a separate workload that occasionally generates reports on things that happened yesterday; instead, it pulses beneath the rhythms of contemporary business and supports today’s most interesting and vital applications. Since applications depend on analytic capabilities, it makes good sense to deploy our data-processing frameworks alongside our applications.
In this talk, you’ll learn from our expertise deploying Apache Spark and other data-processing frameworks in Linux containers on Kubernetes. We’ll explain what containers are and why you should care about them. We'll cover the benefits of containerizing applications, architectures for analytic applications that make sense in containers, and how to handle external data sources. You’ll also get practical advice on how to ensure security and isolation, how to achieve high performance, and how to sidestep and negotiate potential challenges. Throughout the talk, we’ll refer back to concrete lessons we’ve learned about containerized analytic jobs ranging from interactive notebooks to production applications. You’ll leave inspired and enabled to deploy high-performance analytic applications without giving up the security you need or the developer-friendly workflow you want.
Some Things You Learn Running Apache Spark in Production for Three Years
Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, and provide concrete advice regarding:
- How to integrate Spark with external data sources
- How best to deploy and manage Spark in the cloud
- The tradeoffs of various archive storage options for Spark
- Configuring machines for data processing
- How to evaluate predictive models and make sense of the analytic components of insightful applications
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance
Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?
Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
Big Data In Production: Bare Metal to OpenShift
Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, explain why we migrated from a dedicated Spark cluster to OpenShift, and provide concrete advice regarding:
- how to integrate Spark with external data sources (including databases, in-memory data grids, and message queues),
- how best to deploy and manage Spark in the cloud,
- the tradeoffs of various archive storage options for Spark,
- how to evaluate predictive models and make sense of the analytic components of insightful applications, and
- integrating Spark into microservice applications on OpenShift
This talk assumes some familiarity with Apache Spark but will provide context for attendees who are new to Spark. You’ll learn from a seasoned Red Hat engineer with over three years of experience running Spark in production and contributing to the Spark community.
Insightful Apps with Apache Spark and OpenShift
Nearly all of today’s most exciting applications are insightful applications: they employ machine learning and large-scale data processing to improve with longevity and popularity. It’s an easy bet that the important applications of tomorrow will be insightful as well. It’s also an easy bet that you’ll want to be deploying tomorrow’s applications on a contemporary container platform with a great developer workflow like OpenShift.
Insightful applications pose some new challenges for developers, but this hands-on workshop will show you how to navigate them confidently. You'll learn how to develop an insightful application on OpenShift with Apache Spark from the ground up. We’ll cover:
- architectures for analytic applications and microservices;
- a crash course in Apache Spark, some data science techniques, and OpenShift;
- how to deploy Apache Spark as part of an OpenShift application; and
- building a data-driven application from the ground up.
This workshop is largely self-contained: the only prerequisite is some familiarity with Python. Learn from the experience of Red Hat emerging technology engineers who are focused on bringing data-driven application development to OpenShift!
Containerized Spark on Kubernetes
Consider two recent trends in application development: more and more applications are taking advantage of architectures involving containerized microservices in order to enable improved elasticity, fault-tolerance, and scalability — whether in the public cloud or on-premise. In addition, analytic capabilities and scalable data processing have increasingly become a basic requirement for contemporary applications. The confluence of these trends suggests that there are a lot of good reasons to want to manage Spark with a container orchestration platform, but it’s not quite as simple as packaging up a standalone cluster in containers. This talk will present our team’s experiences migrating a production Spark cluster from a multi-tenant Mesos cluster to a shared compute resource managed by Kubernetes. We’ll explain the motivation behind microservices and containers and identify the architectures that make sense for containerized applications that depend on Spark. We’ll pay special attention to practical concerns of running Spark in containers, including networking, access control, persistent storage, and multitenancy. You’ll leave this talk with a better understanding of why you might want to run Spark in containers and some concrete ideas for how to get started doing it.
Big Data and Apache Spark on OpenShift Pt. II
The first meeting of the OpenShift Commons Big Data Special Interest Group, expanded on a previous Commons session entitled Big Data and Apache Spark on OpenShift (Part 1) which kicked off the Big Data SIG.
In the previous session, Red Hat’s Will Benton gave us a vocabulary for talking about data-driven applications and outlined some example architectures for building data-driven applications with microservices. In this SIG session, he gave us an introduction to using Apache Spark on OpenShift and walk through an example data-driven application.
Big Data and Apache Spark on OpenShift Pt. I
In this introductory Big Data OpenShift Commons Briefing session, Red Hat’s Will Benton gave an overview into Big Data architecture and concepts to help level the playing field. This video will give us a better understanding of what a data-intensive application should actually look like on a modern container orchestration platform, and to help kick off the OpenShift Common Big Data SIG.
In this recording, you’ll learn about the anatomy of data-intensive applications, how they come to life, and what they have to accomplish. We walked through a few applications and explored their responsibilities, saw how they use data, discuss trade-offs they must negotiate, and point to some example architectures that make sense for realizing data-intensive applications on OpenShift.
Analyzing Log Data With Apache Spark
Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This aggregated “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured. This session will introduce the log processing domain and provide practical advice for analyzing log data with Apache Spark, including:
- how to impose a uniform structure on disparate log sources;
- machine-learning techniques to detect infrastructure failures automatically and characterize the text of log messages;
- best practices for tuning Spark, training models against structured data, and ingesting data from external sources like ElasticSearch; and
- a few relatively painless ways to visualize your results.
You’ll have a better understanding of the unique challenges posed by infrastructure log data after this session. You’ll also learn the most important lessons from our efforts both to develop analytic capabilities for an open-source log aggregation service and to evaluate these at enterprise scale.
Data Science for the Datacenter: Analyzing Logs with Apache Spark
Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured.
In this session, Will Benton will introduce the log processing domain and give you practical advice for using Apache Spark to analyze log data, including data engineering techniques to impose structure on disparate log sources; data science approaches to detect infrastructure failures; language-processing techniques to characterize the text of log messages; best practices for tuning Spark and using newer Spark features; and how to visualize your results. You’ll learn from Benton’s experience developing applications that analyze the vast log data generated within Red Hat’s network and leave well-prepared to analyze your own logs.
Diagnosing Open-Source Community Health with Spark
Successful companies use analytic measures to identify and reward their best projects and contributors. Successful open source developers often make similar decisions when they evaluate whether or not to reward a project or community by investing their time. This talk will show how Spark enables a data-driven understanding of the dynamics of open source communities, using operational data from the Fedora Project as an example. With thousands of contributors and millions of users, Fedora is one of the world’s largest open-source communities. Notably, Fedora also has completely open infrastructure: every event related to the project’s daily operation is logged to a public messaging bus, and historical event data are available in bulk. We’ll demonstrate best practices for using Spark SQL to ingest bulk data with rich, nested structure, using ML pipelines to make sense of software community data, and keeping insights current by processing streaming updates.
Improving Spark Application Performance
Apache Spark presents an elegant and powerful set of high-level abstractions for developing distributed data-processing applications. Analysts who use Spark can rapidly prototype applications and experiment with new techniques at scale. However, to make the most of Spark, developers need to understand both the abstractions and how Spark will schedule and execute their code.
This talk will show you how to improve Spark application performance by working with, not against, Spark's operational model. We'll start with a real prototype Spark application and apply several simple, generally applicable transformations to make it more efficient and scalable. For each transformation, we'll look both at why it works, considering the relevant details of Spark's internals, and how well it works, considering its impact on overall application performance. You'll leave this talk with an improved understanding of how Spark runs your code and some additional tools to make your big data apps even more efficient.
Analyzing endurance-sports activity data with Spark
Spark’s support for efficient execution and rapid interactive prototyping enable novel approaches to understanding data-rich domains that have historically been underserved by analytical techniques. One such field is endurance sports, where athletes are faced with GPS and elevation traces as well as samples from heart rate, cadence, temperature, and wattage sensors. These data streams can be somewhat comprehensible at any given moment, when looking at a small window of samples on one’s watch or cycle computer, but are overwhelming in the aggregate.
In this talk, I’ll present my recent efforts using Spark and MLLib to mine my personal cycling training data for deeper insights and help me design workouts to meet particular fitness goals. This work incorporates analysis of geographic and time-series data, computational geometry, visualization, and domain knowledge of exercise physiology. I’ll show how Spark made this work possible, demonstrate some novel techniques for analyzing fitness data, and discuss how these approaches could be applied to make sense of data from an entire community of cyclists.