Hadoop – The Analytics Journal

Hive: How to drop partitions by range

Posted on February 15, 2019 by admin

This is the general syntax for the drop partition syntax in Apache Hive:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;

So the syntax to drop a range of partitions in a table that uses year as a partitioning column:

ALTER TABLE mytable DROP [IF EXISTS] PARTITION (year>2019) PURGE;

What is the Hive Metastore URI address? hive-site.xml? hive config resources?

Posted on July 14, 2018July 17, 2018 by admin

In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the Hive Streaming component to connect to a Hive Table, this personal knowledge base article documents the the locations of the resources I needed.

What is my Hive Metastore URI?

This is located on your Hive Metastore host at port 9083 and uses the Thrift protocol, an example URI would look like this:

thrift://<host_name>:9083

Where is my hive-site.xml file located? What should I enter under Hive Config Resources?

When configuring Apache NiFi to connect to a Hive table using Hive Streaming you will need to enter the location of your hive-site.xml file under Hive config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hive the script below can help you with this:

#find the Hive folder
cd /etc/hive
#run a search for the hive-site.xml file, starting at the current location
find . -name hive-site.xml

#in my case after examining the results from the command the file is located at:

/etc/hive/2.6.5.0-292/0/hive-site.xml

Ambari Agent Node OpenSSL / EOF / Failed to Connect Error

Posted on June 9, 2018July 18, 2018 by admin

I recently ran into an issue when deploying Ambari Agent to a new host in my cluster. Here’s my personal Knowledge Base article on the issue.

Symptoms

When deploying Ambari Agent to a new node, the wizard fails. At the bottom of stderr I found the following :

INFO DataCleaner.py:122 - Data cleanup finished

INFO 11,947 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname '<my host>' using socket.getfqdn().

INFO 11,952 PingPortListener.py:50 - Ping port listener started on port: 8670

INFO 11,954 main.py:439 - Connecting to Ambari server at https://<my host>:8440 (172.31.42.192)

INFO 955 NetUtil.py:70 - Connecting to https://<my host>:8440/ca

ERROR 11,958 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)

ERROR 11,958 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.

Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details.

WARNING 11,958 NetUtil.py:124 - Server at https://<my host>:8440 is not reachable, sleeping for 10 seconds...

Root Cause

After a solving the issue I can say the issue, in my case, was that there were previously installed versions of Java that conflicted with my preferred version even after selecting my version using the alternatives command.

Working through the issue I also found a suggestion to disable certificate validation that I implemented since this is not a production cluster, I am listing it as Solution 3 here.

Solution 1 – Deploy new hosts with no previous JDK

After much tinkering with the alternatives command to repair my JDK configuration I decided that it was easier to start with a new AWS set of nodes and ensure that no JDK was installed in my image before I began my prepping of each node. If you have nodes that are having the issue after an upgrade read Solution 2.

I am including the script I used to download and configure the correct JDK pre-requisite for my version of Ambari and HDP below for your consumption:

#!/bin/bash
#Script Name: ignacio_install_jdk.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: https://linkedin.com/in/idelatorre
#################
##Install Oracle JDK
export hm=/home/ec2-user
cd /usr
sudo mkdir -p jdk64/jdk1.8.0_112
cd jdk64
sudo wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz
sudo gunzip jdk-8u112-linux-x64.tar.gz
sudo tar -xf jdk-8u112-linux-x64.tar

#configure paths
chmod 666 $hm/.bash_profile
echo export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 >>  $hm/.bash_profile
echo export PATH=$PATH:/usr/jdk64/jdk1.8.0_112/bin >>  $hm/.bash_profile
chmod 640 /root/.bash_profile

#configure java version using alternatives
sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1

#if the link to /usr/bin/java is broken (file displays red), rebuild using:
#ln -s -f /usr/jdk64/jdk1.8.0_112/bin/java /usr/bin/java

Solution 2 – Install new JDK with utilities

I realize scrapping nodes is not an option, especially for those experiencing the issue after an install. Because of a tight deadline I did not try the solution displayed here but it addresses what I think the issue is.

Scenario: On my original nodes that had a previous non-compatible version of JDK installed I issued the following command to select my new JDK as preferred:

#Select Oracle's JDK 1.8 as preferred after install (see install script on Solution 1)
sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1

Issue: After selecting my new JDK I was able to see it listed in the configurations with the command below, BUT, all of the Java utilities such as jar, javadoc, etc. are pointing to null in my preferred JDK.

#list Java configured alternatives
sudo alternatives --display java

Proposed solution: Use the list of tools from the non-compatible version of JDK to install your new JDK with all the java utilities as slaves, please note that you cannot add slaves to an installed JDK, you need to issue the install command with all the utilities all at once. An example adding only JAR is displayed below:

#Select Oracle’s JDK 1.8 as preferred after install with a slave configuration for the JAR and javadoc utilities:

sudo alternatives --install "/usr/bin/java" "java" "/usr/jdk64/jdk1.8.0_112/bin/java" 1 \ 
--slave "/usr/bin/jar" "jar" "usr/jdk64/jdk1.8.0_112/bin/jar" \
--slave "/usr/bin/javadoc" "javadoc" "usr/jdk64/jdk1.8.0_112/bin/javadoc"

Solution 3 – Disable certificate validation

Like I said before, my cluster is not a production one and will not contain sensitive or confidential data so I opted to implement the suggestion to disable certificate validation as part of my troubleshooting. To do this you have to set verify=diable by editing the /etc/python/cert-verification.cfg file. Do this at your own risk.

What to enter under hadoop config resources? where to find core-site.xml and hdfs-xml.xml?

Posted on March 24, 2018July 17, 2018 by admin

In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the the PutHDFS component to connect to HDFS, this personal knowledge base article documents the the locations of the resources I needed.

Where is are my core-site.xml and hdfs-site.xml files located? What should I enter under Hadoop Config Resources?

When configuring Apache NiFi to connect to HDFS using the PutHDFS component you will need to enter the location of your core-site.xml and hdfs-site.xml files under Hadoop config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hadoop

The script below can help you with this:

#find the hadoop folder
cd /etc/hadoop
#run a search for the core-site.xml file, starting at the current location
find . -name core-site.xml

#in my case after examining the results from the command the file is located at:
/etc/hadoop/2.6.5.0-292/0/core-site.xml

#I then went to the directory and listed its contents to find the location of my HDFS config file:
/etc/hadoop/2.6.5.0-292/0/hdfs-site.xml

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Posted on February 10, 2018July 18, 2018 by admin

“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ ^[1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
Queues: Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by a Zookeeper node that clients connect to to submit new queue member as well as to request a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point once all worker processes have executed their independent units of work.
Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Website, http://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Error deploying Hive using the Ambari agent (MySQL JAVA Connector JAR file missing)

Posted on January 20, 2018July 17, 2018 by admin

I recently ran into an issue when deploying Hive using Ambari. Here’s what my personal Knowledge Base article on the issue.

Symptoms

When deploying Hive and Hive Metastore to a new node, the “Hive Metastore Start” task fails. At the bottom of stderr I found the following message:

File "/usr/lib/ambari-agent/lib/resource_management/core/source.py", line 52, in __call__

    return self.get_content()

  File "/usr/lib/ambari-agent/lib/resource_management/core/source.py", line 197, in get_content

    raise Fail("Failed to download file from {0} due to HTTP error: {1}".format(self.url, str(ex)))

resource_management.core.exceptions.Fail: Failed to download file from http://<my_node_host>:8080/resources/ mysql-connector-java.jar due to HTTP error: HTTP Error 404: Not Found

Root Cause

I did deploy the Amari, HDP and HDP Utils repositories on a local mirror host. It seems as if the Ambari Agent assumes that the MySQL JAVA Connector JAR file would also be hosted by my local mirror.

This could also happen if the repositories configured during deployment no longer host the MySQL JAVA Connector JAR file.

Solution

Once I figured the root cause out it was easy to solve the issue by using YUM to manually install the MySQL JAVA Connector JAR file:

#On the Linux/Unix host(s) experiencing the install error, issue the following command:

sudo yum -y -q install mysql-connector-java

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Posted on January 8, 2018July 17, 2018 by admin

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.

Hortonworks HDP / Ambari Install – Configure Open File Descriptors

Posted on December 9, 2017July 18, 2018 by admin

This is a very simple script I run in the target hosts that I am prepping for Apache Ambari Server or Agent and Hadoop deployment. The main thing it achieves is displaying the current settings for Linux Open File Descriptors, it then allows the user to specify whether they are below Ambari’s and/or Hadoop’s minimum system requirements, if that is the case the script will update the settings for you:

#!/bin/bash
#Script Name: ignacio_ofd.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: http://linkedin.com/in/idelatorre
#################
#Configure Maximum Open File Descriptors
echo ">>> Configure Maximum Open File Descriptors..."
echo "! ! ! Pay attention to the output below, if any of the two numbers displayed is less than 10,000, enter y at the prompt:"

ulimit -Sn
ulimit -Hn
echo "Enter y if the limits are below 10,000:"
read var_yesno
if [ "$var_yesno" = "y" ]
then
     echo "Updating /etc/security/limits.conf"
    #this updates the limits globally
    sudo chmod 666 /etc/security/limits.conf
    sudo echo "ubuntu    hard    nofile    10000" >> /etc/security/limits.conf
    sudo echo "ubuntu    soft    nofile    10000" >> /etc/security/limits.conf
    sudo echo "root    hard    nofile    100000" >> /etc/security/limits.conf
    sudo echo "root    soft    nofile    100000" >> /etc/security/limits.conf
    sudo chmod 644 /etc/security/limits.conf
else
    echo "ulimit not updated, not necessary"
fi

Hortonworks HDP / Ambari Install – Configure Network Time Protocol (NTP)

Posted on November 17, 2017July 18, 2018 by admin

This is a small script I developed to configure NTP on my hosts before deploying the Ambari server or agent and Hadoop:

#!/bin/bash
#Script Name: ignacio_config_ntp.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: https://linkedin.com/in/idelatorre
#################
#configure ntp to auto-start at boot time
#Install NTP
sudo yum install -y -q ntp

#Disable autostart
sudo systemctl disable ntpd
sudo timedatectl set-ntp no

#configure NTP
sudo ntpdate pool.ntp.org
sudo timedatectl set-timezone America/Los_Angeles

#re-enable NTP autostart
sudo systemctl enable ntpd
sudo timedatectl set-ntp on

Hadoop Ecosystem: SQOOP – The Data Mover

Posted on May 26, 2017July 17, 2018 by admin

SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

Sqoop import: Helps users import a single table into Hadoop
Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
Sqoop export: Allows users to export a set of files from HDFS back into a relational database
Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

Fresh news on big data, data science and cutting edge analytics!

Category: Hadoop

Hive: How to drop partitions by range

What is the Hive Metastore URI address? hive-site.xml? hive config resources?

What is my Hive Metastore URI?

Where is my hive-site.xml file located? What should I enter under Hive Config Resources?

Ambari Agent Node OpenSSL / EOF / Failed to Connect Error

Symptoms

Root Cause

Solution 1 – Deploy new hosts with no previous JDK

Solution 2 – Install new JDK with utilities

Solution 3 – Disable certificate validation

What to enter under hadoop config resources? where to find core-site.xml and hdfs-xml.xml?

Where is are my core-site.xml and hdfs-site.xml files located? What should I enter under Hadoop Config Resources?

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Error deploying Hive using the Ambari agent (MySQL JAVA Connector JAR file missing)

Symptoms

Root Cause

Solution

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Hortonworks HDP / Ambari Install – Configure Open File Descriptors

Hortonworks HDP / Ambari Install – Configure Network Time Protocol (NTP)

Hadoop Ecosystem: SQOOP – The Data Mover