What is the Hive Metastore URI address? hive-site.xml? hive config resources?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the Hive Streaming component to connect to a Hive Table, this personal knowledge base article documents the the locations of the resources I needed.

What is my Hive Metastore URI?

This is located on your Hive Metastore host at port 9083 and uses the Thrift protocol, an example URI would look like this:

thrift://<host_name>:9083

 

Where is my hive-site.xml file located? What should I enter under Hive Config Resources?

When configuring Apache NiFi to connect to a Hive table using Hive Streaming you will need to enter the location of your hive-site.xml file under Hive config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hive the script below can help you with this:

 

#find the Hive folder
cd /etc/hive
#run a search for the hive-site.xml file, starting at the current location
find . -name hive-site.xml

#in my case after examining the results from the command the file is located at:

/etc/hive/2.6.5.0-292/0/hive-site.xml


 

 

 

 

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Apache Hive

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

 

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.