Three key things to remember about Apache Spark RDD Operations

There are three key concepts that are essential for the beginner Apache Spark Developer, we will cover them here. If you want to receive a condensed summary of the most relevant news in Big Data, Data Science and Advanced Analytics do not forget to subscribe to our newsletter, we send it once a month so you get the very best, only once a month.

All right, getting back to our topic, the three key things to remember when you begin working with Spark RDDs are:

  • Creating RDDs does not need to be a hard, involved process: For your learning environment you can easily create an RDD from a collection or by loading a CSV file. This saves you the step of transferring files to Hadoop’s HDFS file system and enhances your productivity in your sandbox environment.
  • Remember to persist your RDDs: This has to do with the fact that Spark RDDs are lazy, transformations are not executed as you define them, only once you ask Spark for a result.  Experienced data scientists will define a base RDD and then create different sub-sets through transformations, every time you define one of these subsets by defining a new RDD remember to persist it, otherwise this will execute again and again every time you ask for the results of a downstream RDD.
  • Remember that RDDs in Spark are immutable: A big reason why the previous point is difficult to digest for new Spark developers is that we are not accustomed to the Functional Programming paradigm that underlies Spark. In our regular programming languages we create a variable that references a specific space in memory and then we are able to assign distinct values to this variable in different parts of our program. In Functional programming each object or variable is immutable, every time you create a new RDD based on a the results of an upstream RDD Apache will execute  all of the logic that led to the creation of the source RDD.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Dataset (RDD) Operations, stay tuned for additional coverage on best practices as well as for Apache Spark Data Frames.

 

 

What is an Apache Spark RDD?

Image result for what is a spark rdd
Apache Spark Resilient Distributed Dataset (RDD)

Apache Spark  Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:

  • Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
  • Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.

 

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.

 

Reference:
Apache Spark Programming Guide
http://spark.apache.org/docs/2.1.1/programming-guide.html