Data Science for the Datacenter: Analyzing Logs with Apache Spark


May 11, 2016

Presented at Apache: Big Data (Vancouver, British Columbia)

Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured.

In this session, Will Benton will introduce the log processing domain and give you practical advice for using Apache Spark to analyze log data, including data engineering techniques to impose structure on disparate log sources; data science approaches to detect infrastructure failures; language-processing techniques to characterize the text of log messages; best practices for tuning Spark and using newer Spark features; and how to visualize your results. You’ll learn from Benton’s experience developing applications that analyze the vast log data generated within Red Hat’s network and leave well-prepared to analyze your own logs.

Slides Handout