This is the general syntax for the drop partition syntax in Apache Hive:
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;
So the syntax to drop a range of partitions in a table that uses year as a partitioning column:
ALTER TABLE mytable DROP [IF EXISTS] PARTITION (year>2019) PURGE;
I like a lot of detail about my script runs when I am doing unit testing, I will usually append status and event messages to a summarized log file that I call my email file, this is the syntax I use to send an email with this summary when my script is done running:
mail -s "My script completed" [email protected] < $my_email_summary
So the general syntax for the command is:
mail -s "subject" myemail@mydomain < file_name
As I was trying to automate my unit testing I run into this requirement, the general syntax to execute a command remotely is:
ssh -t user@host command
So for example if I wanted to execute my_script.sh on my home folder I could do:
ssh -t myusername@myserver /home/my_user/my_script.sh
In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the Hive Streaming component to connect to a Hive Table, this personal knowledge base article documents the the locations of the resources I needed.
What is my Hive Metastore URI?
This is located on your Hive Metastore host at port 9083 and uses the Thrift protocol, an example URI would look like this:
Where is my hive-site.xml file located? What should I enter under Hive Config Resources?
When configuring Apache NiFi to connect to a Hive table using Hive Streaming you will need to enter the location of your hive-site.xml file under Hive config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hive the script below can help you with this:
#find the Hive folder cd /etc/hive #run a search for the hive-site.xml file, starting at the current location find . -name hive-site.xml #in my case after examining the results from the command the file is located at: /etc/hive/184.108.40.206-292/0/hive-site.xml
Apache Spark Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:
- Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
- Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.
I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.
Apache Spark Programming Guide
I recently ran into an issue when deploying Ambari Agent to a new host in my cluster. Here’s my personal Knowledge Base article on the issue.
When deploying Ambari Agent to a new node, the wizard fails. At the bottom of stderr I found the following :
INFO DataCleaner.py:122 - Data cleanup finished INFO 11,947 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname '<my host>' using socket.getfqdn(). INFO 11,952 PingPortListener.py:50 - Ping port listener started on port: 8670 INFO 11,954 main.py:439 - Connecting to Ambari server at https://<my host>:8440 (172.31.42.192) INFO 955 NetUtil.py:70 - Connecting to https://<my host>:8440/ca ERROR 11,958 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579) ERROR 11,958 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions. Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details. WARNING 11,958 NetUtil.py:124 - Server at https://<my host>:8440 is not reachable, sleeping for 10 seconds...
After a solving the issue I can say the issue, in my case, was that there were previously installed versions of Java that conflicted with my preferred version even after selecting my version using the alternatives command.
Working through the issue I also found a suggestion to disable certificate validation that I implemented since this is not a production cluster, I am listing it as Solution 3 here.
Solution 1 – Deploy new hosts with no previous JDK
After much tinkering with the alternatives command to repair my JDK configuration I decided that it was easier to start with a new AWS set of nodes and ensure that no JDK was installed in my image before I began my prepping of each node. If you have nodes that are having the issue after an upgrade read Solution 2.
I am including the script I used to download and configure the correct JDK pre-requisite for my version of Ambari and HDP below for your consumption:
#!/bin/bash #Script Name: ignacio_install_jdk.scr #Author: Ignacio de la Torre #Independent Contractor Profile: https://linkedin.com/in/idelatorre ################# ##Install Oracle JDK export hm=/home/ec2-user cd /usr sudo mkdir -p jdk64/jdk1.8.0_112 cd jdk64 sudo wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz sudo gunzip jdk-8u112-linux-x64.tar.gz sudo tar -xf jdk-8u112-linux-x64.tar #configure paths chmod 666 $hm/.bash_profile echo export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 >> $hm/.bash_profile echo export PATH=$PATH:/usr/jdk64/jdk1.8.0_112/bin >> $hm/.bash_profile chmod 640 /root/.bash_profile #configure java version using alternatives sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1 #if the link to /usr/bin/java is broken (file displays red), rebuild using: #ln -s -f /usr/jdk64/jdk1.8.0_112/bin/java /usr/bin/java
Solution 2 – Install new JDK with utilities
I realize scrapping nodes is not an option, especially for those experiencing the issue after an install. Because of a tight deadline I did not try the solution displayed here but it addresses what I think the issue is.
Scenario: On my original nodes that had a previous non-compatible version of JDK installed I issued the following command to select my new JDK as preferred:
#Select Oracle's JDK 1.8 as preferred after install (see install script on Solution 1) sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1
Issue: After selecting my new JDK I was able to see it listed in the configurations with the command below, BUT, all of the Java utilities such as jar, javadoc, etc. are pointing to null in my preferred JDK.
#list Java configured alternatives sudo alternatives --display java
Proposed solution: Use the list of tools from the non-compatible version of JDK to install your new JDK with all the java utilities as slaves, please note that you cannot add slaves to an installed JDK, you need to issue the install command with all the utilities all at once. An example adding only JAR is displayed below:
#Select Oracle’s JDK 1.8 as preferred after install with a slave configuration for the JAR and javadoc utilities:
sudo alternatives --install "/usr/bin/java" "java" "/usr/jdk64/jdk1.8.0_112/bin/java" 1 \ --slave "/usr/bin/jar" "jar" "usr/jdk64/jdk1.8.0_112/bin/jar" \ --slave "/usr/bin/javadoc" "javadoc" "usr/jdk64/jdk1.8.0_112/bin/javadoc"
Solution 3 – Disable certificate validation
Like I said before, my cluster is not a production one and will not contain sensitive or confidential data so I opted to implement the suggestion to disable certificate validation as part of my troubleshooting. To do this you have to set verify=diable by editing the /etc/python/cert-verification.cfg file. Do this at your own risk.
This is my first post on the Google File System where I will very briefly touch base on a very specific feature-set that is driven by conscious design tradeoffs that have made GFS and derived systems so successful.
- Highly Redundant Data vs. Highly Available Hardware When working with Petabytes of data hardware failure is a norm more than an exception, expensive highly redundant hardware is replaced with commodity components that allow the file system to store multiple copies of data across storage nodes and switches at a reasonable cost.
- Store a small number of large files vs. millions of small individual documents With the need to store hundreds of terabytes composed of billions of small objects (i.e. e-Mail Messages, Webpages), GFS attempts to simplify file system design by serializing these small individual objects to be grouped together into larger files. Having a small number of large files allows GFS to keep all file and namespace metadata in memory on the GFS master which in turn allows the master to leverage this global visibility to make smarter load balancing and redundancy decisions.
- Generally Immutable data Once a serialized object or file record is written to disk it will never be updated again, as Google states on their research paper random writes are practically non-existent. This is driven by application requirements where data is generally written once and then consumed by applications over time without alteration. Google describes the application data as mutating by either inserting new records or appending on the last “chunk” or block of a file, applications are encouraged to constrain their update strategies to these two operations.
On my next series of post I will analyze other architecture and performance characteristics that make the Google File System brilliantly innovative, stay tuned!
“The Google File System”; Ghemawat, Gobioff, Leung; Google Research
Continuing with the rapid innovation of the Apache Spark code base the Spark Streaming API allows enterprises to leverage the full power of the Spark architecture to process real-time workloads.
Built upon the foundation of Core Spark, Spark Streams is able to consume data from common real time pipelines such as Apache Kafka, Apache Flume, Kinesis, TCP Sockets and run complex algorithms (MLib Predictive Models, GraphX Algorithms). Results can be then displayed in real time dashboards or be stored in HDFS.
- Apache Spark Streaming Programming Guide:
In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the the PutHDFS component to connect to HDFS, this personal knowledge base article documents the the locations of the resources I needed.
Where is are my core-site.xml and hdfs-site.xml files located? What should I enter under Hadoop Config Resources?
When configuring Apache NiFi to connect to HDFS using the PutHDFS component you will need to enter the location of your core-site.xml and hdfs-site.xml files under Hadoop config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hadoop
The script below can help you with this:
#find the hadoop folder cd /etc/hadoop #run a search for the core-site.xml file, starting at the current location find . -name core-site.xml #in my case after examining the results from the command the file is located at: /etc/hadoop/220.127.116.11-292/0/core-site.xml #I then went to the directory and listed its contents to find the location of my HDFS config file: /etc/hadoop/18.104.22.168-292/0/hdfs-site.xml