Hadoop Ecosystem: SQOOP – The Data Mover

Apache Sqoop Logo

Sqoop Logo

 SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

 

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

 

  • Sqoop import: Helps users import a single table into Hadoop
  • Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
  • Sqoop export: Allows users to export a set of files from HDFS back into a relational database
  • Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Apache Zookeeper Logo

image

“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ [1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

  • Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
  • Queues:  Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by  a Zookeeper node that clients connect to to submit new queue member as well as to request  a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
  • Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
  • Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point  once all worker processes have executed their independent units of work.
  • Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a  vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Websitehttp://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Apache Hive

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

 

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.