I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:
- Our team’s Silex library is where I’ve published my ongoing work to develop a self-organizing map implementation for Spark and to extend it with support for data frames and ML pipelines
- I gave a talk about using self-organizing maps in Spark last year at Spark Summit
- If you like the idea of developing new ML techniques on Spark, you’ll also want to attend a session tomorrow in which my friend and teammate Erik Erlandson will be talking about using his parallel t-digest implementation to support feature importance and other applications.
- Finally, if you’re doing anything where parallelism and scale matter, especially in a cloud-native environment, you should also check out Mike McCune’s talk on Spark monitoring and metrics.