Hortonworks HDP / Ambari Install – Configure Open File Descriptors

Apache Ambari Logo

This is a very simple script I run in the target hosts that I am prepping for Apache Ambari Server or Agent and Hadoop deployment. The main thing it achieves is displaying the current settings for Linux Open File Descriptors, it then allows the user to specify whether they are below Ambari’s and/or Hadoop’s minimum system requirements, if that is the case the script will update the settings for you:

 

#!/bin/bash
#Script Name: ignacio_ofd.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: http://linkedin.com/in/idelatorre
#################
#Configure Maximum Open File Descriptors
echo ">>> Configure Maximum Open File Descriptors..."
echo "! ! ! Pay attention to the output below, if any of the two numbers displayed is less than 10,000, enter y at the prompt:"

ulimit -Sn
ulimit -Hn
echo "Enter y if the limits are below 10,000:"
read var_yesno
if [ "$var_yesno" = "y" ]
then
     echo "Updating /etc/security/limits.conf"
    #this updates the limits globally
    sudo chmod 666 /etc/security/limits.conf
    sudo echo "ubuntu    hard    nofile    10000" >> /etc/security/limits.conf
    sudo echo "ubuntu    soft    nofile    10000" >> /etc/security/limits.conf
    sudo echo "root    hard    nofile    100000" >> /etc/security/limits.conf
    sudo echo "root    soft    nofile    100000" >> /etc/security/limits.conf
    sudo chmod 644 /etc/security/limits.conf
else
    echo "ulimit not updated, not necessary"
fi

 

 

 

 

Hortonworks HDP / Ambari Install – Configure Network Time Protocol (NTP)

Apache Ambari aims at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

This is a small script I developed to configure NTP on my hosts before deploying the Ambari server or agent and Hadoop:

 

 

 

 

 

#!/bin/bash
#Script Name: ignacio_config_ntp.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: https://linkedin.com/in/idelatorre
#################
#configure ntp to auto-start at boot time
#Install NTP
sudo yum install -y -q ntp

#Disable autostart
sudo systemctl disable ntpd
sudo timedatectl set-ntp no

#configure NTP
sudo ntpdate pool.ntp.org
sudo timedatectl set-timezone America/Los_Angeles

#re-enable NTP autostart
sudo systemctl enable ntpd
sudo timedatectl set-ntp on

Google File System Design Assumptions

 

The Google File System’s conscious design tradeoffs

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

 

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their  application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

 

  1. Failure is an expectation, not an exception

    Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.

  2. The system stores a modest number of large data files

    This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files.  Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.

  3. Workloads primarily consist of large streaming reads and small random reads

    By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.

  4. The workloads also have many large, sequential writes that append to data files

    Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.

  5. Atomicity with minimal synchronization is essential

    The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.

  6. High sustained bandwidth is more important than low latency

    A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

 

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!

 

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Where to download older versions of Java?

I have found myself asking where can I download old versions of Java several times lately. They are generally found on Oracle’s website on a version archive page. To help with direct acess to versions here’s a list with a few versions:

 

Version 64-bit JDK 64-bit JRE 32-bit JDK 32-bit JRE
8u25 (1.8) JDK JRE JDK JRE
7u72 (1.7) JDK JRE JDK JRE
6u45 (1.6) JDK JRE JDK JRE
5.0u22 (1.5) JDK JRE JDK JRE

Issue/Error with ODI Studio right click

As I work with the Oracle Business Intelligence Applications (OBIA) repository in ODI studio I have recently noticed I am no longer able to right click on objects. I have found two solutions, the first one is a work-around:

 

Work Around:

Let’s assume you want to right click on a particular folder or scenario, you notice as you do so the context menu does not come up, go ahead and do the following:

  1. Select the object with a left click
  2. Move your mouse pointer outside the object’s boundary, I prefer a little bit to the right
  3. Right click, the context menu should come up now

This work around works if you are restricted on changing your installation’s settings or using a hosted platform such as Citrix

 

Solution:

In cases where you have access to install software your system then you should look into the compatibility matrix for ODI Studio and the version of Java you are working with. In my case I noticed the hosting provider for my environment has setup JDK 1.7  64-bit, I noticed for some versions of ODI JDK 1.6 was required so I downloaded both 32 and 64 bit versions and pointed my odi.conf file to them version. The 64 bit version did solve my issue, which is great since I can allocate more memory to the client under this bit version.

 

Related:

ODI Tip: How to make sure a “Select distinct” is issued and an ODI interface returns a unique dataset with no duplicates

PROBLEM

 

As a developer I do have a need to make sure that the subset of columns I am mapping through from source to target on my ODI interface is unique, in other words, I want ODI to include a DISTINCT clause on the SELECT statement that will be issued on the source database.

 

SOLUTION

  • Open my interface on the ODI Interface designer
  • Click on the Flow tab on the bottom
  • Click on the Target object
  • On the Property Inspector, click on the “Distinct Rows” checkbox

    image

ETL Tuning in ODI / BI Apps–The #ETL_ANALYZE_WORK_TABLE parameter

One of the first things I do when I run into performance issues with ETL loads is to look at the source and target table statistics. Have they been collected before the current select / insert statement was issued?

It turns out that in Oracle BI Apps the #ETL_ANALYZE_WORK_TABLE parameter is turned off by default when a load plan is generated. This can make doing a high level review of your load plan execution tricky since there will be steps that will seem to be gathering statistics, when in reality, the ODI code generator just puts a placeholder instead of the code for statistics. An example of this is shown below:

 

image

 

image

 

SOLUTION:

Once I realized what the issue was with statistics not being gathered for my work tables I was able to zoom into the ETL_ANALYZE_WORK_TABLE variable by looking at my generated load plan as depicted below, and change its default value to Y. The variable is defined globally so once you change the definition this new default value will apply to any newly generated load plans.

 

image  

 

image

Hadoop Ecosystem: SQOOP – The Data Mover

Apache Sqoop Logo

Sqoop Logo

 SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

 

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

 

  • Sqoop import: Helps users import a single table into Hadoop
  • Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
  • Sqoop export: Allows users to export a set of files from HDFS back into a relational database
  • Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

ODI: Purging OLD Sessions

One common administrative task that I find myself doing when I realize that my ODI logs are growing fairly large is purging old sessions from the log. The steps are fairly straightforward as follows:

 

  1. Login to your ODI Studio client
  2. To to the Operator View
  3. On the top right corner of your navigation pane, expand the menu and select purge log…

    image

  4. On the Purge Log screen you can select which old sessions to remove by date, agent, context, status, user and session name

    image

  5. Once you have set parameters as desired click on OK and the ODI session logs will be purged accordingly

 

Related:

How To: Manage your Oracle patch deployment life cycle using Oracle Support Patch Plans

Introduction

 

As part of my writing I often try to document and share best practices I develop on my day to day work, this one relates to formalizing the patch deployment process for your oracle environments. This approach is developed for organizations that have formal release cycles and have established procedures to take patches through test life cycles that; at a minimum, begin in a develop environment, followed by integration testing in a QA and culminate when patches are promoted to production.

I will try to keep this post brief so, at a high level, I have found that the best way to manage patches is to use the Oracle support portal patch & upgrades functionality to create a patch plan for each environment in the life cycle for either each major release or at least each quarter. This process is always initiated by the need to apply a patch so whenever no patches are necessary during a release or quarter no patch plans are created.

The two main benefits of this approach is (1) that it brings transparency into which patches have been approved for each environment, (2) it is a straight forward process that does not carry a lot of overhead. The way patches make it to a patch plan is when a project manager requests a patch to be applied or promoted to each environment in your life cycle, this in turn is monitored using standard project management mechanisms such as issue, task and test management.

 

Implementation

Creating your first patch plan is very simple, just take your first requested patch through the process outlined below.

 

  1. Login to http://support.oracle.com
  2. Click on the Patches & Updates tab
  3. Locate the appropriate version of your patch by specifying a patch number and operating system on the patch search interface

    Locate the appropriate version of your patch by specifying a patch number and operating system on the patch search interface

  4. Locate your patch on the search results screen and click on Add to Plan > Add to new …

    Locate your patch on the search results screen and click on Add to Plan > Add to new ...

  5. Locate the a valid target application server or host name using the search box
  6. Provide a patch plan name using your company’s naming standard and click create plan

    An example naming convention I have used in the past, this particular one allows system administrators to sort by date and to manage patch plans by product:

    – – – approved patches

    Provide a patch plan name using your company's naming standard and click create plan

  7. To add any additional requested patches to your plan go back to Patches & Updatesand select your plan from the Plans list and click on the Add Patch… button.

Having this patching plan makes it easy to manage patch deployment through your environments. As for the actual deployment of each patch, I am a command line geek and like the ability to make sure that each individual patch deployment works correctly by running OPatch for each individual package.

If you find this post useful please or Share our site!

Reference:

As part of my writing I often try to document and share best practices I develop on my day to day work, this one relates to formalizing the patch deployment process for your oracle environments …