Hadoop Ecosystem: Zookeeper – The distributed coordination server

Apache Zookeeper Logo


“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ [1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

  • Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
  • Queues:  Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by  a Zookeeper node that clients connect to to submit new queue member as well as to request  a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
  • Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
  • Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point  once all worker processes have executed their independent units of work.
  • Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a  vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.


[1] Zookeeper Websitehttp://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Oracle Open World 2017

We are beginning our coverage of Oracle Open World 2017 a on day second with our hit list of topics to cover this year, please send us suggestions on twitter @GreatAnalytics and we’ll add them to our list. This year we will be razor focused on advanced analytics, big data and new analytical and processing capabilities being added to Oracle Financial Services Analytics (OFSAA).

  • Creating custom credit risk models in OFSAA (In Process…)
  • OFSAA and Hadoop, the state of the union (In Process…)
  • OOW2017: Oracle Enterprise R state of the union (In Process…)
  • Oracle Fusion Applications: The evolution of cloud ERP embedded analytics (In Process…)
  • OOW2017: Our updated Big Data Reference Architecture, leveraging Oracle’s full stack of analytical capabilities (In Process…)
  • OOW2017: Oracle Data Integration and Big Data, latest updates in the adoption patterns and features
  • OOW2017: Industry Update – Oracle Higher Education Analytics
  • OOW2017: Industry Update – Oracle Financial Services and Insurance Analytics

The Google File System’s conscious design tradeoffs

Google File System Architecture

ProfProfile4-cartoon.jpgThis is my first post on the Google File System where I will very briefly touch base on a very specific feature-set that is driven by conscious design tradeoffs that have made GFS and derived systems so successful.

  1.  Highly Redundant Data vs. Highly Available HardwareWhen working with Petabytes of data hardware failure is a norm more than an exception, expensive highly redundant hardware is replaced with commodity components that allow the file system to store multiple copies of data across storage nodes and switches at a reasonable cost.
  2.  Store a small number of large files vs. millions of small individual documentsWith the need to store hundreds of terabytes composed of billions of small objects (i.e. e-Mail Messages, Webpages), GFS attempts to simplify file system design by serializing these small individual objects to be grouped together into larger files. Having a small number of large files allows GFS to keep all file and namespace metadata in memory on the GFS master which in turn allows the master to leverage this global visibility to make smarter load balancing and redundancy decisions.
  3.  Generally Immutable dataOnce a serialized object or file record is written to disk it will never be updated again, as Google states on their research paper random writes are practically non-existent. This is driven by application requirements where data is generally written once and then consumed by applications over time without alteration. Google describes the application data as mutating by either inserting new records or appending on the last “chunk” or block of a file, applications are encouraged to constrain their update strategies to these two operations.

On my next series of post I will analyze other architecture and performance characteristics that make the Google File System brilliantly innovative, stay tuned!



“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Where to download older versions of Java?

I have found myself asking where can I download old versions of Java several times lately. They are generally found on Oracle’s website on a version archive page. To help with direct acess to versions here’s a list with a few versions:


Version 64-bit JDK 64-bit JRE 32-bit JDK 32-bit JRE
8u25 (1.8) JDK JRE JDK JRE
7u72 (1.7) JDK JRE JDK JRE
6u45 (1.6) JDK JRE JDK JRE
5.0u22 (1.5) JDK JRE JDK JRE

What is Apache Spark Streaming?


Continuing with the rapid innovation of the Apache Spark code base the Spark Streaming API allows enterprises to leverage the full power of the Spark architecture to process real-time workloads.

Built upon the foundation of Core Spark, Spark Streams is able to consume data from common real time pipelines such as Apache Kafka, Apache Flume, Kinesis, TCP Sockets and run complex algorithms (MLib Predictive Models, GraphX Algorithms). Results can be then displayed in real time dashboards or be stored in HDFS.

Apache Spark Streaming Architecture
Apache Spark Streaming Architecture: Tutorial, interview questions, interview preparation, big data, Apache Kafka, real time
  • Apache Spark Streaming Programming Guide:

Issue/Error with ODI Studio right click

As I work with the Oracle Business Intelligence Applications (OBIA) repository in ODI studio I have recently noticed I am no longer able to right click on objects. I have found two solutions, the first one is a work-around:


Work Around:

Let’s assume you want to right click on a particular folder or scenario, you notice as you do so the context menu does not come up, go ahead and do the following:

  1. Select the object with a left click
  2. Move your mouse pointer outside the object’s boundary, I prefer a little bit to the right
  3. Right click, the context menu should come up now

This work around works if you are restricted on changing your installation’s settings or using a hosted platform such as Citrix



In cases where you have access to install software your system then you should look into the compatibility matrix for ODI Studio and the version of Java you are working with. In my case I noticed the hosting provider for my environment has setup JDK 1.7  64-bit, I noticed for some versions of ODI JDK 1.6 was required so I downloaded both 32 and 64 bit versions and pointed my odi.conf file to them version. The 64 bit version did solve my issue, which is great since I can allocate more memory to the client under this bit version.



ODI Tip: How to make sure a “Select distinct” is issued and an ODI interface returns a unique dataset with no duplicates



As a developer I do have a need to make sure that the subset of columns I am mapping through from source to target on my ODI interface is unique, in other words, I want ODI to include a DISTINCT clause on the SELECT statement that will be issued on the source database.



  • Open my interface on the ODI Interface designer
  • Click on the Flow tab on the bottom
  • Click on the Target object
  • On the Property Inspector, click on the “Distinct Rows” checkbox


ETL Tuning in ODI / BI Apps–The #ETL_ANALYZE_WORK_TABLE parameter

One of the first things I do when I run into performance issues with ETL loads is to look at the source and target table statistics. Have they been collected before the current select / insert statement was issued?

It turns out that in Oracle BI Apps the #ETL_ANALYZE_WORK_TABLE parameter is turned off by default when a load plan is generated. This can make doing a high level review of your load plan execution tricky since there will be steps that will seem to be gathering statistics, when in reality, the ODI code generator just puts a placeholder instead of the code for statistics. An example of this is shown below:







Once I realized what the issue was with statistics not being gathered for my work tables I was able to zoom into the ETL_ANALYZE_WORK_TABLE variable by looking at my generated load plan as depicted below, and change its default value to Y. The variable is defined globally so once you change the definition this new default value will apply to any newly generated load plans.





ODI: Purging OLD Sessions

One common administrative task that I find myself doing when I realize that my ODI logs are growing fairly large is purging old sessions from the log. The steps are fairly straightforward as follows:


  1. Login to your ODI Studio client
  2. To to the Operator View
  3. On the top right corner of your navigation pane, expand the menu and select purge log…


  4. On the Purge Log screen you can select which old sessions to remove by date, agent, context, status, user and session name


  5. Once you have set parameters as desired click on OK and the ODI session logs will be purged accordingly



Three key things to remember about Apache Spark RDD Operations

There are three key concepts that are essential for the beginner Apache Spark Developer, we will cover them here. If you want to receive a condensed summary of the most relevant news in Big Data, Data Science and Advanced Analytics do not forget to subscribe to our newsletter, we send it once a month so you get the very best, only once a month.

All right, getting back to our topic, the three key things to remember when you begin working with Spark RDDs are:

  • Creating RDDs does not need to be a hard, involved process: For your learning environment you can easily create an RDD from a collection or by loading a CSV file. This saves you the step of transferring files to Hadoop’s HDFS file system and enhances your productivity in your sandbox environment.
  • Remember to persist your RDDs: This has to do with the fact that Spark RDDs are lazy, transformations are not executed as you define them, only once you ask Spark for a result.  Experienced data scientists will define a base RDD and then create different sub-sets through transformations, every time you define one of these subsets by defining a new RDD remember to persist it, otherwise this will execute again and again every time you ask for the results of a downstream RDD.
  • Remember that RDDs in Spark are immutable: A big reason why the previous point is difficult to digest for new Spark developers is that we are not accustomed to the Functional Programming paradigm that underlies Spark. In our regular programming languages we create a variable that references a specific space in memory and then we are able to assign distinct values to this variable in different parts of our program. In Functional programming each object or variable is immutable, every time you create a new RDD based on a the results of an upstream RDD Apache will execute  all of the logic that led to the creation of the source RDD.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Dataset (RDD) Operations, stay tuned for additional coverage on best practices as well as for Apache Spark Data Frames.



How To: Manage your Oracle patch deployment life cycle using Oracle Support Patch Plans



As part of my writing I often try to document and share best practices I develop on my day to day work, this one relates to formalizing the patch deployment process for your oracle environments. This approach is developed for organizations that have formal release cycles and have established procedures to take patches through test life cycles that; at a minimum, begin in a develop environment, followed by integration testing in a QA and culminate when patches are promoted to production.

I will try to keep this post brief so, at a high level, I have found that the best way to manage patches is to use the Oracle support portal patch & upgrades functionality to create a patch plan for each environment in the life cycle for either each major release or at least each quarter. This process is always initiated by the need to apply a patch so whenever no patches are necessary during a release or quarter no patch plans are created.

The two main benefits of this approach is (1) that it brings transparency into which patches have been approved for each environment, (2) it is a straight forward process that does not carry a lot of overhead. The way patches make it to a patch plan is when a project manager requests a patch to be applied or promoted to each environment in your life cycle, this in turn is monitored using standard project management mechanisms such as issue, task and test management.



Creating your first patch plan is very simple, just take your first requested patch through the process outlined below.


  1. Login to http://support.oracle.com
  2. Click on the Patches & Updates tab
  3. Locate the appropriate version of your patch by specifying a patch number and operating system on the patch search interface

    Locate the appropriate version of your patch by specifying a patch number and operating system on the patch search interface

  4. Locate your patch on the search results screen and click on Add to Plan > Add to new …

    Locate your patch on the search results screen and click on Add to Plan > Add to new ...

  5. Locate the a valid target application server or host name using the search box
  6. Provide a patch plan name using your company’s naming standard and click create plan

    An example naming convention I have used in the past, this particular one allows system administrators to sort by date and to manage patch plans by product:

    – – – approved patches

    Provide a patch plan name using your company's naming standard and click create plan

  7. To add any additional requested patches to your plan go back to Patches & Updatesand select your plan from the Plans list and click on the Add Patch… button.

Having this patching plan makes it easy to manage patch deployment through your environments. As for the actual deployment of each patch, I am a command line geek and like the ability to make sure that each individual patch deployment works correctly by running OPatch for each individual package.

If you find this post useful please or Share our site!


As part of my writing I often try to document and share best practices I develop on my day to day work, this one relates to formalizing the patch deployment process for your oracle environments …