What is the Hive Metastore URI address? hive-site.xml? hive config resources?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the Hive Streaming component to connect to a Hive Table, this personal knowledge base article documents the the locations of the resources I needed.

What is my Hive Metastore URI?

This is located on your Hive Metastore host at port 9083 and uses the Thrift protocol, an example URI would look like this:

thrift://<host_name>:9083

 

Where is my hive-site.xml file located? What should I enter under Hive Config Resources?

When configuring Apache NiFi to connect to a Hive table using Hive Streaming you will need to enter the location of your hive-site.xml file under Hive config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hive the script below can help you with this:

 

#find the Hive folder
cd /etc/hive
#run a search for the hive-site.xml file, starting at the current location
find . -name hive-site.xml

#in my case after examining the results from the command the file is located at:

/etc/hive/2.6.5.0-292/0/hive-site.xml


 

 

 

 

Ambari Agent Node OpenSSL / EOF / Failed to Connect Error

Apache Ambari Logo

 

I recently ran into an issue when deploying  Ambari Agent to a new host in my cluster. Here’s my personal Knowledge Base article on the issue.

Symptoms

When deploying Ambari Agent to a new node, the wizard fails. At the bottom of stderr I found the following :

 

INFO DataCleaner.py:122 - Data cleanup finished

INFO 11,947 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname '<my host>' using socket.getfqdn().

INFO 11,952 PingPortListener.py:50 - Ping port listener started on port: 8670

INFO 11,954 main.py:439 - Connecting to Ambari server at https://<my host>:8440 (172.31.42.192)

INFO 955 NetUtil.py:70 - Connecting to https://<my host>:8440/ca

ERROR 11,958 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)

ERROR 11,958 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.

Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details.

WARNING 11,958 NetUtil.py:124 - Server at https://<my host>:8440 is not reachable, sleeping for 10 seconds...


 

Root Cause

After a solving the issue I can say the issue, in my case, was that there were previously installed versions of Java that conflicted with my preferred version even after selecting my version using the alternatives command.

Working through the issue I also found a suggestion to disable certificate validation that I implemented since this is not a production cluster, I am listing it as Solution 3 here.

Solution 1 – Deploy new hosts with no previous JDK

After much tinkering with the alternatives command to repair my JDK configuration I decided that it was easier to start with a new AWS set of nodes and ensure that no JDK was installed in my image before I began my prepping of each node. If you have nodes that are having the issue after an upgrade read Solution 2.

I am including the script I used to download and configure the correct JDK pre-requisite for my version of Ambari and HDP below for your consumption:

 

#!/bin/bash
#Script Name: ignacio_install_jdk.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: https://linkedin.com/in/idelatorre
#################
##Install Oracle JDK
export hm=/home/ec2-user
cd /usr
sudo mkdir -p jdk64/jdk1.8.0_112
cd jdk64
sudo wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz
sudo gunzip jdk-8u112-linux-x64.tar.gz
sudo tar -xf jdk-8u112-linux-x64.tar

#configure paths
chmod 666 $hm/.bash_profile
echo export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 >>  $hm/.bash_profile
echo export PATH=$PATH:/usr/jdk64/jdk1.8.0_112/bin >>  $hm/.bash_profile
chmod 640 /root/.bash_profile

#configure java version using alternatives
sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1

#if the link to /usr/bin/java is broken (file displays red), rebuild using:
#ln -s -f /usr/jdk64/jdk1.8.0_112/bin/java /usr/bin/java

 

Solution 2 – Install new JDK with utilities

I realize scrapping nodes is not an option, especially for those experiencing the issue after an install. Because of a tight deadline I did not try the solution displayed here but it addresses what I think the issue is.

 

Scenario: On my original nodes that had a previous non-compatible version of JDK installed I issued the following command to select my new JDK as preferred:

#Select Oracle's JDK 1.8 as preferred after install (see install script on Solution 1)
sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1

Issue: After selecting my new JDK I was able to see it listed in the configurations with the command below, BUT, all of the Java utilities such as jar, javadoc, etc. are pointing to null in my preferred JDK.

#list Java configured alternatives
sudo alternatives --display java

Proposed solution: Use the list of tools from the non-compatible version of JDK to install your new JDK with all the java utilities as slaves, please note that you cannot add slaves to an installed JDK, you need to issue the install command with all the utilities all at once. An example adding only JAR is displayed below:

#Select Oracle’s JDK 1.8 as preferred after install with a slave configuration for the JAR and javadoc utilities:

sudo alternatives --install "/usr/bin/java" "java" "/usr/jdk64/jdk1.8.0_112/bin/java" 1 \ 
--slave "/usr/bin/jar" "jar" "usr/jdk64/jdk1.8.0_112/bin/jar" \
--slave "/usr/bin/javadoc" "javadoc" "usr/jdk64/jdk1.8.0_112/bin/javadoc"

 

Solution 3 – Disable certificate validation

Like I said before, my cluster is not a production one and will not contain sensitive or confidential data so I opted to implement the suggestion to disable certificate validation as part of my troubleshooting. To do this you have to set verify=diable by editing the  /etc/python/cert-verification.cfg  file. Do this at your own risk.

 

 

What to enter under hadoop config resources? where to find core-site.xml and hdfs-xml.xml?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the the PutHDFS component to connect to HDFS, this personal knowledge base article documents the the locations of the resources I needed.

Where is are my core-site.xml and hdfs-site.xml files located? What should I enter under Hadoop Config Resources?

When configuring Apache NiFi to connect to HDFS using the  PutHDFS component you will need to enter the location of your core-site.xml and hdfs-site.xml files under Hadoop config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hadoop

The script below can help you with this:

 

#find the hadoop folder
cd /etc/hadoop
#run a search for the core-site.xml file, starting at the current location
find . -name core-site.xml

#in my case after examining the results from the command the file is located at:
/etc/hadoop/2.6.5.0-292/0/core-site.xml

#I then went to the directory and listed its contents to find the location of my HDFS config file:
/etc/hadoop/2.6.5.0-292/0/hdfs-site.xml

 

 

 

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Apache Zookeeper Logo

Apache Zookeeper is a centralized service for distributed systems to a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems.“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ [1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

  • Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
  • Queues:  Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by  a Zookeeper node that clients connect to to submit new queue member as well as to request  a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
  • Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
  • Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point  once all worker processes have executed their independent units of work.
  • Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a  vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Websitehttp://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Error deploying Hive using the Ambari agent (MySQL JAVA Connector JAR file missing)

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

 

I recently ran into an issue when deploying Hive using Ambari. Here’s what my personal Knowledge Base article on the issue.

Symptoms

When deploying Hive and Hive Metastore to a new node, the “Hive Metastore Start” task fails. At the bottom of stderr I found the following message:

 

File "/usr/lib/ambari-agent/lib/resource_management/core/source.py", line 52, in __call__

    return self.get_content()

  File "/usr/lib/ambari-agent/lib/resource_management/core/source.py", line 197, in get_content

    raise Fail("Failed to download file from {0} due to HTTP error: {1}".format(self.url, str(ex)))

resource_management.core.exceptions.Fail: Failed to download file from http://<my_node_host>:8080/resources/ mysql-connector-java.jar due to HTTP error: HTTP Error 404: Not Found

 

Root Cause

I did deploy the Amari, HDP and HDP Utils repositories on a local mirror host. It seems as if the Ambari Agent assumes that the MySQL JAVA Connector JAR file would also be hosted by my local mirror.

This could also happen if the repositories configured during deployment no longer host the MySQL JAVA Connector JAR file.

Solution

Once I figured the root cause out it was easy to solve the issue by using YUM to manually install the MySQL JAVA Connector JAR file:

 

#On the Linux/Unix host(s) experiencing the install error, issue the following command:

sudo yum -y -q install mysql-connector-java

 

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Apache Hive

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

 

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.

Hortonworks HDP / Ambari Install – Configure Open File Descriptors

Apache Ambari Logo

This is a very simple script I run in the target hosts that I am prepping for Apache Ambari Server or Agent and Hadoop deployment. The main thing it achieves is displaying the current settings for Linux Open File Descriptors, it then allows the user to specify whether they are below Ambari’s and/or Hadoop’s minimum system requirements, if that is the case the script will update the settings for you:

 

#!/bin/bash
#Script Name: ignacio_ofd.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: http://linkedin.com/in/idelatorre
#################
#Configure Maximum Open File Descriptors
echo ">>> Configure Maximum Open File Descriptors..."
echo "! ! ! Pay attention to the output below, if any of the two numbers displayed is less than 10,000, enter y at the prompt:"

ulimit -Sn
ulimit -Hn
echo "Enter y if the limits are below 10,000:"
read var_yesno
if [ "$var_yesno" = "y" ]
then
     echo "Updating /etc/security/limits.conf"
    #this updates the limits globally
    sudo chmod 666 /etc/security/limits.conf
    sudo echo "ubuntu    hard    nofile    10000" >> /etc/security/limits.conf
    sudo echo "ubuntu    soft    nofile    10000" >> /etc/security/limits.conf
    sudo echo "root    hard    nofile    100000" >> /etc/security/limits.conf
    sudo echo "root    soft    nofile    100000" >> /etc/security/limits.conf
    sudo chmod 644 /etc/security/limits.conf
else
    echo "ulimit not updated, not necessary"
fi

 

 

 

 

Hortonworks HDP / Ambari Install – Configure Network Time Protocol (NTP)

Apache Ambari aims at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

This is a small script I developed to configure NTP on my hosts before deploying the Ambari server or agent and Hadoop:

 

 

 

 

 

#!/bin/bash
#Script Name: ignacio_config_ntp.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: https://linkedin.com/in/idelatorre
#################
#configure ntp to auto-start at boot time
#Install NTP
sudo yum install -y -q ntp

#Disable autostart
sudo systemctl disable ntpd
sudo timedatectl set-ntp no

#configure NTP
sudo ntpdate pool.ntp.org
sudo timedatectl set-timezone America/Los_Angeles

#re-enable NTP autostart
sudo systemctl enable ntpd
sudo timedatectl set-ntp on

Hadoop Ecosystem: SQOOP – The Data Mover

Apache Sqoop Logo

Sqoop Logo

 SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

 

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

 

  • Sqoop import: Helps users import a single table into Hadoop
  • Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
  • Sqoop export: Allows users to export a set of files from HDFS back into a relational database
  • Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

Sqoop: Create-hive-table tool

Apache Sqoop Logo

Tool: CREATE-HIVE-TABLE

–table <source_table> : The name of the table on the originating database

— hive-table <target table> :  The name of the table to be created written to into

 

–enclosed-by

–escaped-by

–fields-terminated-by

–lines-terminated-by

–mysql-delimiters : ( , ) for fields, ( \n ) for lines, enclosed-by ( ‘ ), escaped-by ( \ )

 

Notes:

* The tool will fail if the target table exists