There are three key concepts that are essential for the beginner Apache Spark Developer, we will cover them here. If you want to receive a condensed summary of the most relevant news in Big Data, Data Science and Advanced Analytics do not forget to subscribe to our newsletter, we send it once a month so you get the very best, only once a month.
All right, getting back to our topic, the three key things to remember when you begin working with Spark RDDs are:
- Creating RDDs does not need to be a hard, involved process: For your learning environment you can easily create an RDD from a collection or by loading a CSV file. This saves you the step of transferring files to Hadoop’s HDFS file system and enhances your productivity in your sandbox environment.
- Remember to persist your RDDs: This has to do with the fact that Spark RDDs are lazy, transformations are not executed as you define them, only once you ask Spark for a result. Experienced data scientists will define a base RDD and then create different sub-sets through transformations, every time you define one of these subsets by defining a new RDD remember to persist it, otherwise this will execute again and again every time you ask for the results of a downstream RDD.
- Remember that RDDs in Spark are immutable: A big reason why the previous point is difficult to digest for new Spark developers is that we are not accustomed to the Functional Programming paradigm that underlies Spark. In our regular programming languages we create a variable that references a specific space in memory and then we are able to assign distinct values to this variable in different parts of our program. In Functional programming each object or variable is immutable, every time you create a new RDD based on a the results of an upstream RDD Apache will execute all of the logic that led to the creation of the source RDD.
I hope you enjoyed this introduction to Apache Spark Resilient Distributed Dataset (RDD) Operations, stay tuned for additional coverage on best practices as well as for Apache Spark Data Frames.