Big Data – The Analytics Journal

What is an Apache Spark RDD?

Posted on June 26, 2018July 17, 2018 by admin

Image result for what is a spark rdd — Apache Spark Resilient Distributed Dataset (RDD)

Apache Spark Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:

Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.

Reference:
Apache Spark Programming Guide
http://spark.apache.org/docs/2.1.1/programming-guide.html

The Google File System’s conscious design tradeoffs

Posted on May 12, 2018July 17, 2018 by admin

This is my first post on the Google File System where I will very briefly touch base on a very specific feature-set that is driven by conscious design tradeoffs that have made GFS and derived systems so successful.

Highly Redundant Data vs. Highly Available Hardware
When working with Petabytes of data hardware failure is a norm more than an exception, expensive highly redundant hardware is replaced with commodity components that allow the file system to store multiple copies of data across storage nodes and switches at a reasonable cost.
Store a small number of large files vs. millions of small individual documents
With the need to store hundreds of terabytes composed of billions of small objects (i.e. e-Mail Messages, Webpages), GFS attempts to simplify file system design by serializing these small individual objects to be grouped together into larger files. Having a small number of large files allows GFS to keep all file and namespace metadata in memory on the GFS master which in turn allows the master to leverage this global visibility to make smarter load balancing and redundancy decisions.
Generally Immutable data
Once a serialized object or file record is written to disk it will never be updated again, as Google states on their research paper random writes are practically non-existent. This is driven by application requirements where data is generally written once and then consumed by applications over time without alteration. Google describes the application data as mutating by either inserting new records or appending on the last “chunk” or block of a file, applications are encouraged to constrain their update strategies to these two operations.

On my next series of post I will analyze other architecture and performance characteristics that make the Google File System brilliantly innovative, stay tuned!

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

What is Apache Spark Streaming?

Posted on April 21, 2018July 17, 2018 by admin

Continuing with the rapid innovation of the Apache Spark code base the Spark Streaming API allows enterprises to leverage the full power of the Spark architecture to process real-time workloads.

Built upon the foundation of Core Spark, Spark Streams is able to consume data from common real time pipelines such as Apache Kafka, Apache Flume, Kinesis, TCP Sockets and run complex algorithms (MLib Predictive Models, GraphX Algorithms). Results can be then displayed in real time dashboards or be stored in HDFS.

Apache Spark Streaming Architecture: Tutorial, interview questions, interview preparation, big data, Apache Kafka, real time

Reference:

Apache Spark Streaming Programming Guide:
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html

Three key things to remember about Apache Spark RDD Operations

Posted on March 11, 2018July 17, 2018 by admin

There are three key concepts that are essential for the beginner Apache Spark Developer, we will cover them here. If you want to receive a condensed summary of the most relevant news in Big Data, Data Science and Advanced Analytics do not forget to subscribe to our newsletter, we send it once a month so you get the very best, only once a month.

All right, getting back to our topic, the three key things to remember when you begin working with Spark RDDs are:

Creating RDDs does not need to be a hard, involved process: For your learning environment you can easily create an RDD from a collection or by loading a CSV file. This saves you the step of transferring files to Hadoop’s HDFS file system and enhances your productivity in your sandbox environment.
Remember to persist your RDDs: This has to do with the fact that Spark RDDs are lazy, transformations are not executed as you define them, only once you ask Spark for a result. Experienced data scientists will define a base RDD and then create different sub-sets through transformations, every time you define one of these subsets by defining a new RDD remember to persist it, otherwise this will execute again and again every time you ask for the results of a downstream RDD.
Remember that RDDs in Spark are immutable: A big reason why the previous point is difficult to digest for new Spark developers is that we are not accustomed to the Functional Programming paradigm that underlies Spark. In our regular programming languages we create a variable that references a specific space in memory and then we are able to assign distinct values to this variable in different parts of our program. In Functional programming each object or variable is immutable, every time you create a new RDD based on a the results of an upstream RDD Apache will execute all of the logic that led to the creation of the source RDD.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Dataset (RDD) Operations, stay tuned for additional coverage on best practices as well as for Apache Spark Data Frames.

Google File System Design Assumptions

Posted on October 7, 2017July 17, 2018 by admin

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

Failure is an expectation, not an exception
Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.
The system stores a modest number of large data files
This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files. Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.
Workloads primarily consist of large streaming reads and small random reads
By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.
The workloads also have many large, sequential writes that append to data files
Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.
Atomicity with minimal synchronization is essential
The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.
High sustained bandwidth is more important than low latency
A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!