The Google File System’s conscious design tradeoffs

The Google File System’s conscious design tradeoffs

 

 

This is my first post on the Google File System where I will very briefly touch base on a very specific feature-set that is driven by conscious design tradeoffs that have made GFS and derived systems so successful.

  1.  Highly Redundant Data vs. Highly Available Hardware

    When working with Petabytes of data hardware failure is a norm more than an exception, expensive highly redundant hardware is replaced with commodity components that allow the file system to store multiple copies of data across storage nodes and switches at a reasonable cost.

  2.  Store a small number of large files vs. millions of small individual documents

    With the need to store hundreds of terabytes composed of billions of small objects (i.e. e-Mail Messages, Webpages), GFS attempts to simplify file system design by serializing these small individual objects to be grouped together into larger files. Having a small number of large files allows GFS to keep all file and namespace metadata in memory on the GFS master which in turn allows the master to leverage this global visibility to make smarter load balancing and redundancy decisions.

  3.  Generally Immutable data

    Once a serialized object or file record is written to disk it will never be updated again, as Google states on their research paper random writes are practically non-existent. This is driven by application requirements where data is generally written once and then consumed by applications over time without alteration. Google describes the application data as mutating by either inserting new records or appending on the last “chunk” or block of a file, applications are encouraged to constrain their update strategies to these two operations.

On my next series of post I will analyze other architecture and performance characteristics that make the Google File System brilliantly innovative, stay tuned!

 

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Apache Zookeeper Logo

Apache Zookeeper is a centralized service for distributed systems to a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems.“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ [1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

  • Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
  • Queues:  Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by  a Zookeeper node that clients connect to to submit new queue member as well as to request  a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
  • Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
  • Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point  once all worker processes have executed their independent units of work.
  • Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a  vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Websitehttp://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Apache Hive

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

 

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.

Google File System Design Assumptions

 

The Google File System’s conscious design tradeoffs

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

 

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their  application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

 

  1. Failure is an expectation, not an exception

    Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.

  2. The system stores a modest number of large data files

    This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files.  Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.

  3. Workloads primarily consist of large streaming reads and small random reads

    By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.

  4. The workloads also have many large, sequential writes that append to data files

    Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.

  5. Atomicity with minimal synchronization is essential

    The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.

  6. High sustained bandwidth is more important than low latency

    A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

 

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!

 

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Where to download older versions of Java?

I have found myself asking where can I download old versions of Java several times lately. They are generally found on Oracle’s website on a version archive page. To help with direct acess to versions here’s a list with a few versions:

 

Version 64-bit JDK 64-bit JRE 32-bit JDK 32-bit JRE
8u25 (1.8) JDK JRE JDK JRE
7u72 (1.7) JDK JRE JDK JRE
6u45 (1.6) JDK JRE JDK JRE
5.0u22 (1.5) JDK JRE JDK JRE

Issue/Error with ODI Studio right click

As I work with the Oracle Business Intelligence Applications (OBIA) repository in ODI studio I have recently noticed I am no longer able to right click on objects. I have found two solutions, the first one is a work-around:

 

Work Around:

Let’s assume you want to right click on a particular folder or scenario, you notice as you do so the context menu does not come up, go ahead and do the following:

  1. Select the object with a left click
  2. Move your mouse pointer outside the object’s boundary, I prefer a little bit to the right
  3. Right click, the context menu should come up now

This work around works if you are restricted on changing your installation’s settings or using a hosted platform such as Citrix

 

Solution:

In cases where you have access to install software your system then you should look into the compatibility matrix for ODI Studio and the version of Java you are working with. In my case I noticed the hosting provider for my environment has setup JDK 1.7  64-bit, I noticed for some versions of ODI JDK 1.6 was required so I downloaded both 32 and 64 bit versions and pointed my odi.conf file to them version. The 64 bit version did solve my issue, which is great since I can allocate more memory to the client under this bit version.

 

Related:

ODI Tip: How to make sure a “Select distinct” is issued and an ODI interface returns a unique dataset with no duplicates

PROBLEM

 

As a developer I do have a need to make sure that the subset of columns I am mapping through from source to target on my ODI interface is unique, in other words, I want ODI to include a DISTINCT clause on the SELECT statement that will be issued on the source database.

 

SOLUTION

  • Open my interface on the ODI Interface designer
  • Click on the Flow tab on the bottom
  • Click on the Target object
  • On the Property Inspector, click on the “Distinct Rows” checkbox

    image

ETL Tuning in ODI / BI Apps–The #ETL_ANALYZE_WORK_TABLE parameter

One of the first things I do when I run into performance issues with ETL loads is to look at the source and target table statistics. Have they been collected before the current select / insert statement was issued?

It turns out that in Oracle BI Apps the #ETL_ANALYZE_WORK_TABLE parameter is turned off by default when a load plan is generated. This can make doing a high level review of your load plan execution tricky since there will be steps that will seem to be gathering statistics, when in reality, the ODI code generator just puts a placeholder instead of the code for statistics. An example of this is shown below:

 

image

 

image

 

SOLUTION:

Once I realized what the issue was with statistics not being gathered for my work tables I was able to zoom into the ETL_ANALYZE_WORK_TABLE variable by looking at my generated load plan as depicted below, and change its default value to Y. The variable is defined globally so once you change the definition this new default value will apply to any newly generated load plans.

 

image  

 

image

Hadoop Ecosystem: SQOOP – The Data Mover

Apache Sqoop Logo

Sqoop Logo

 SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

 

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

 

  • Sqoop import: Helps users import a single table into Hadoop
  • Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
  • Sqoop export: Allows users to export a set of files from HDFS back into a relational database
  • Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive