I’m speaking at Spark Summit today about using Spark to analyze operational data from the Fedora project. Here are some links to further resources related to my talk:
- My talk slides are online; the online deck includes some extra slides that I skipped in the talk as delivered
- An earlier post provides some background on fedmsg and Spark
- You may also be interested in a higher-level discussion of issues with schema inference from the perspective of type theory
- Here’s the annotated source code for the ML pipeline transformer I discussed in my talk
You should also check out my team’s Silex library, which contains useful code factored out of real Spark applications we’ve built in Red Hat’s Emerging Technology group. It includes a lot of cool functionality, but the part I mentioned in the talk is this handy interface for preprocessing JSON data before inferring a schema.