Over the last few weeks, a regular question that I’ve been asked are where I can find resources about Spark, Machine Learning, and Distributed Systems. While they seem to be disparate problems, the fact is that as a Data Engineer (or someone in Data Sciences Engineering or a Data Scientist that loves scalability and performance) you need to have your feet wet in all three disciplines to truly excel.
Apache Spark
Let’s start with Apache Spark (disclosure, I am with Databricks – the company was founded by the creators of Apache Spark). I am a big fan of Apache Spark because of its ability to handle so many scenarios – in a performant manner – that a data engineer must deal with – data transformations; code in Python, Scala, Java, or R; fast structured querying with DataFrames; machine learning with MLlib; graph with GraphX; etc. A great resource to learn more about Apache Spark is the recently launch SparkHub: A Community Site for Apache Spark.
Machine Learning for Developers
There are some great resources for machine learning for developers including the edX MOOC: Scalable Machine Learning (sponsored by Databricks). There is also a solid read that was recently shared with me by Alexander Stojanovic (a good friend and had led the creation of Hadoop on Windows and Azure) – Machine Learning for Developers by Mike de Waard. He does a great explaining Machine Learning concepts in a way for developers to understand.
Distributed Systems
Some friends of mine over at WearHacks were discussing about distributed systems theory and called out Henry Robinson’s blog post of great papers Distributed systems theory for the distributed systems engineer. Some fun ones that I would include in the list are:
- Omega: flexible, scalable schedulers for large compute clusters
- RAFT Consensus Algorithm
- Leslie Lamport’s Paxos Made Simple
I’m sure there are a lot more papers that are worth reading but this should be a great starting point. Comment away if you think there are more papers that should be included, eh?!
Enjoy!