Ready – The Analytics Journal

The Google File System’s conscious design tradeoffs

Posted on May 12, 2018July 17, 2018 by admin

This is my first post on the Google File System where I will very briefly touch base on a very specific feature-set that is driven by conscious design tradeoffs that have made GFS and derived systems so successful.

Highly Redundant Data vs. Highly Available Hardware
When working with Petabytes of data hardware failure is a norm more than an exception, expensive highly redundant hardware is replaced with commodity components that allow the file system to store multiple copies of data across storage nodes and switches at a reasonable cost.
Store a small number of large files vs. millions of small individual documents
With the need to store hundreds of terabytes composed of billions of small objects (i.e. e-Mail Messages, Webpages), GFS attempts to simplify file system design by serializing these small individual objects to be grouped together into larger files. Having a small number of large files allows GFS to keep all file and namespace metadata in memory on the GFS master which in turn allows the master to leverage this global visibility to make smarter load balancing and redundancy decisions.
Generally Immutable data
Once a serialized object or file record is written to disk it will never be updated again, as Google states on their research paper random writes are practically non-existent. This is driven by application requirements where data is generally written once and then consumed by applications over time without alteration. Google describes the application data as mutating by either inserting new records or appending on the last “chunk” or block of a file, applications are encouraged to constrain their update strategies to these two operations.

On my next series of post I will analyze other architecture and performance characteristics that make the Google File System brilliantly innovative, stay tuned!

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Posted on February 10, 2018July 18, 2018 by admin

“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ ^[1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
Queues: Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by a Zookeeper node that clients connect to to submit new queue member as well as to request a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point once all worker processes have executed their independent units of work.
Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Website, http://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Google File System Design Assumptions

Posted on October 7, 2017July 17, 2018 by admin

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

Failure is an expectation, not an exception
Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.
The system stores a modest number of large data files
This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files. Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.
Workloads primarily consist of large streaming reads and small random reads
By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.
The workloads also have many large, sequential writes that append to data files
Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.
Atomicity with minimal synchronization is essential
The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.
High sustained bandwidth is more important than low latency
A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Ignacio de la Torre

Posted on January 31, 2013 by admin

I am a technophile based in San Francisco, in my current endeavors take me into the realms of big data, data science and information discovery. I am documenting my findings and experiences on this wild new frontier in the history of computing in this blog and always looking for a conversation with those in the trenches that live and breathe the use case, send me an email to [email protected] and follow me on twitter at @binsights or LinkedIn at http://linkedin.com/in/idelatorre

Fresh news on big data, data science and cutting edge analytics!

Tag: Ready

The Google File System’s conscious design tradeoffs

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Google File System Design Assumptions

Ignacio de la Torre