The Analytics Journal

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Posted on February 10, 2018July 18, 2018 by admin

“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ ^[1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
Queues: Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by a Zookeeper node that clients connect to to submit new queue member as well as to request a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point once all worker processes have executed their independent units of work.
Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Website, http://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Google File System Design Assumptions

Posted on October 7, 2017July 17, 2018 by admin

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

Failure is an expectation, not an exception
Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.
The system stores a modest number of large data files
This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files. Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.
Workloads primarily consist of large streaming reads and small random reads
By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.
The workloads also have many large, sequential writes that append to data files
Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.
Atomicity with minimal synchronization is essential
The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.
High sustained bandwidth is more important than low latency
A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

How to write a LinkedIn recommendation? What are the key things to touch on or highlight?

Posted on September 29, 2023September 29, 2023 by admin

I covered the benefits of endorsing trusted teammate skills and writing recommendations for close professional associates as a tools to build goodwill, drive engagement with your LinkedIn profile and also boost its SEO relevance for recruiters using search engines to scout for candidates in your field of work. Make sure to also read my post on ways to boost your profile as a passive job seeker.

Writing a LinkedIn recommendation for a co-worker involves highlighting their key skills, achievements, and your personal experience working with them.

Here is a list of topics to touch on for inspiration as you write your recommendation, at the end I am including an example of a recommendation, the key elements to include:

1. Professional Relationship:

Begin with Context: Start by explaining how you know the person and your professional relationship.

2. Key Skills & Strengths:

Highlight Skills: Focus on 2-3 key skills or strengths that make your co-worker stand out.
Provide Examples: Use specific examples to demonstrate their skills and expertise.

3. Achievements:

Mention Accomplishments: Outline any significant accomplishments, contributions, or projects they’ve led or contributed to.
Results and Impacts: If possible, quantify the impacts and results of their work within your project, organization or internal to your team’s culture.

4. Personal Qualities:

Interpersonal Skills: Touch on their teamwork, mentorship, leadership, and communication skills.
Character Attributes: Discuss attributes like dependability, creativity, or problem-solving abilities.

5. Endorsement:

Personal Endorsement: End with a strong statement of recommendation, indicating your confidence in their abilities and contributions to future employers.
- Would you work with them again?
- Would you recommend them for leadership positions and as mentors or sponsors?
- Would you trust them to lead large projects in your organization?

Sample Recommendation:

I had the pleasure of working with [Co-worker’s Name] for [X years] at [Company Name], where we were both [Your Roles]. [Co-worker’s Name] is not only adept at [Key Skill 1] and [Key Skill 2], but also is a natural leader and team player.

One of [Co-worker’s Name]’s standout qualities is their ability to [Specific Skill or Achievement, e.g., “turn complex problems into actionable, manageable tasks”]. In our time working together, they spearheaded a project that [Explain the Project, Result, and Impact, e.g., “led to a 30% increase in customer satisfaction”].

On a personal level, [Co-worker’s Name]’s ability to build and maintain relationships is unparalleled. They are [Include Personal Qualities, e.g., “dependable, creative, and solution-oriented”], making them a favorite among clients and colleagues alike.

I wholeheartedly recommend [Co-worker’s Name] for any team looking to add a dedicated and innovative professional. They would no doubt be a valuable asset to any organization.
Modify this template according to the specific skills, achievements, and personal qualities of your co-worker to make your recommendation personal and impactful.

How does giving recommendations and endorsing team mate skills on LinkedIn benefit my profile ranking?

Posted on September 22, 2023September 29, 2023 by admin

Giving recommendations and endorsing skills on LinkedIn can positively influence your professional reputation and visibility, although it may not directly boost your profile ranking internally on LinkedIn, it will highly benefit it’s ranking on search engines like Google whose advanced search features are preferred by some recruiters scouting for great candidates within your field of work. Here are specific ways it can benefit you:

1. Professional Reputation:

Credibility: By endorsing skills and writing recommendations, you show that you acknowledge and appreciate the expertise of others, which enhances your credibility.
Authority: Giving well-articulated recommendations can position you as an authority in your field.

2. Networking and Relationships:

Reciprocity: Endorsing or recommending someone can often lead to them returning the favor, which can enhance your profile’s credibility.
Networking: This practice strengthens your professional relationships and keeps you on top of mind among your connections.

3. Engagement:

Visibility: Your activity is visible to your connections, keeping your profile active and engaging.
Content Creation: Recommendations contribute to content on LinkedIn, boosting your overall activity and engagement.

4. Skills Validation:

Trust: When you endorse skills, it’s a validation from a real person, building trust and authenticity.
Expertise: Your endorsements can sometimes reflect your ability to judge and recognize skills, contributing to your image as an expert.

5. SEO Benefits:

Keywords: Recommendations often contain relevant keywords that can enhance the search visibility of your profile indirectly.
Backlinks: Each time a connection publishes your recommendation on their profile a backlink to your profile is created, this is an implicit endorsement of your profile for Google’s algorithm. It will also, on occasion, will drive recruiters and other connections browsing that profile back to yours.
Search Ranking: Increased engagement and activity can potentially improve your visibility in LinkedIn search results.

6. Professional Image:

Positivity: Giving recommendations and endorsements reflects positively on your professional image as someone supportive and appreciative.
Community Engagement: It shows you’re actively engaged in supporting and uplifting your professional community.

Key Takeaways:

Reciprocal Benefits: Giving recommendations and endorsements can often result in receiving them, enhancing the richness and credibility of your profile.
Engagement: Active participation on LinkedIn, including giving recommendations and endorsements, increases your profile’s visibility and engagement.

While these activities may not directly impact your “ranking” on LinkedIn, they contribute to a richer, more credible, and engaging profile, increasing your visibility and attractiveness to other professionals, recruiters, and potential employers. They also foster a positive professional image and strengthen your network. Ensure your endorsements and recommendations are genuine to maximize their impact.

What are ways to boost your LinkedIn profile as a passive job seeker?

Posted on September 15, 2023September 29, 2023 by admin

Even as a passive job seeker, it’s essential to maintain and boost your LinkedIn profile to stay visible and attractive to potential employers. Here are some strategies to build goodwill and enhance your profile:

1. Profile Optimization:

Professional Photo: Use a clear, professional photo to make a positive first impression.
Update Information: Ensure your profile is up-to-date with your latest achievements, skills, and experiences.

2. Networking:

Connect: Grow your network by connecting with professionals in and outside of your industry.
Engage: Comment, like, and share posts from close connections and past associates that is relevant to your field. This will help their content as well as surface this content to your own network.

3. Content Sharing and Creation:

Share Useful Content: Regularly share articles, news, and updates related to your industry.
Write Articles: Publish articles on LinkedIn to showcase your knowledge and insights.

4. Recommendations and Endorsements:

Give Recommendations: Write recommendations for your colleagues to build goodwill, this also creates backlinks to your profile boosting it on search engines such as Google.
Give Endorsements: Networking is a team sport, make it a habit to endorse skills you know your close associates to possess when visiting their profiles. When you endorse skills of close associates you boost their profile and your own as a link back to your profile is created for each skill you endorse.
Request Recommendations: Politely ask for recommendations to strengthen your profile. Share these articles with them so they can realize the benefits of endorsing trusted professional associates.

5. Skills and Accomplishments:

List Skills: Make sure to list all relevant skills and get them endorsed by your connections.
Highlight Achievements: Update your accomplishments, certificates, and projects.
Reorder your skills: Make sure to re-order your listed skills from time to time, boosting unendorsed skills to the top so that people visiting your profile view them and endorse them first.

6. Join Groups:

Participate: Join and actively participate in LinkedIn groups related to your field.
Network: Use groups to network with like-minded professionals and potential employers.

7. Profile Badge:

LinkedIn Badge: Create a LinkedIn Badge and add it to your email signature, blog, or website.

8. Learning:

LinkedIn Learning: Complete LinkedIn Learning courses and add certificates to your profile.

9. Career Interests:

Settings: Update “Career interests” settings to be approachable by recruiters, even if you’re not actively looking.

10. Volunteer Work:

Add Volunteer Experience: This can be a great way to showcase your skills and contributions to the community.

Bonus Tip:

Analytics: Use LinkedIn analytics to understand your profile’s reach and engagement, and adjust your strategies accordingly.

Implementing these strategies will not only enhance your LinkedIn profile but also build goodwill in your professional network. Even as a passive job seeker, staying active and engaged on LinkedIn can open up unexpected opportunities.

ODI Tutorial–Reverse Engineer a Database Table

Posted on December 31, 2020 by admin

Hive: How to drop partitions by range

Posted on February 15, 2019 by admin

This is the general syntax for the drop partition syntax in Apache Hive:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;

So the syntax to drop a range of partitions in a table that uses year as a partitioning column:

ALTER TABLE mytable DROP [IF EXISTS] PARTITION (year>2019) PURGE;

How to send email from a unix / linux script

Posted on February 15, 2019 by admin

I like a lot of detail about my script runs when I am doing unit testing, I will usually append status and event messages to a summarized log file that I call my email file, this is the syntax I use to send an email with this summary when my script is done running:

mail -s "My script completed" [email protected] < $my_email_summary

So the general syntax for the command is:

mail -s "subject" myemail@mydomain < file_name

How to run a script on a remote host via SSH

Posted on February 15, 2019 by admin

As I was trying to automate my unit testing I run into this requirement, the general syntax to execute a command remotely is:

ssh -t user@host command

So for example if I wanted to execute my_script.sh on my home folder I could do:

ssh -t myusername@myserver /home/my_user/my_script.sh

What is the Hive Metastore URI address? hive-site.xml? hive config resources?

Posted on July 14, 2018July 17, 2018 by admin

In configuring an Apache NiFi Data Flow (within Hortonworks Dataflow) I ran in to the need to configure the Hive Streaming component to connect to a Hive Table, this personal knowledge base article documents the the locations of the resources I needed.

What is my Hive Metastore URI?

This is located on your Hive Metastore host at port 9083 and uses the Thrift protocol, an example URI would look like this:

thrift://<host_name>:9083

Where is my hive-site.xml file located? What should I enter under Hive Config Resources?

When configuring Apache NiFi to connect to a Hive table using Hive Streaming you will need to enter the location of your hive-site.xml file under Hive config resources. Below you can see the location in my hadoop node, to find the location in your installation look under directory /etc/hive the script below can help you with this:

#find the Hive folder
cd /etc/hive
#run a search for the hive-site.xml file, starting at the current location
find . -name hive-site.xml

#in my case after examining the results from the command the file is located at:

/etc/hive/2.6.5.0-292/0/hive-site.xml

What is an Apache Spark RDD?

Posted on June 26, 2018July 17, 2018 by admin

Image result for what is a spark rdd — Apache Spark Resilient Distributed Dataset (RDD)

Apache Spark Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:

Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.

Reference:
Apache Spark Programming Guide
http://spark.apache.org/docs/2.1.1/programming-guide.html

Ambari Agent Node OpenSSL / EOF / Failed to Connect Error

Posted on June 9, 2018July 18, 2018 by admin

I recently ran into an issue when deploying Ambari Agent to a new host in my cluster. Here’s my personal Knowledge Base article on the issue.

Symptoms

When deploying Ambari Agent to a new node, the wizard fails. At the bottom of stderr I found the following :

INFO DataCleaner.py:122 - Data cleanup finished

INFO 11,947 hostname.py:67 - agent:hostname_script configuration not defined thus read hostname '<my host>' using socket.getfqdn().

INFO 11,952 PingPortListener.py:50 - Ping port listener started on port: 8670

INFO 11,954 main.py:439 - Connecting to Ambari server at https://<my host>:8440 (172.31.42.192)

INFO 955 NetUtil.py:70 - Connecting to https://<my host>:8440/ca

ERROR 11,958 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)

ERROR 11,958 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.

Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details.

WARNING 11,958 NetUtil.py:124 - Server at https://<my host>:8440 is not reachable, sleeping for 10 seconds...

Root Cause

After a solving the issue I can say the issue, in my case, was that there were previously installed versions of Java that conflicted with my preferred version even after selecting my version using the alternatives command.

Working through the issue I also found a suggestion to disable certificate validation that I implemented since this is not a production cluster, I am listing it as Solution 3 here.

Solution 1 – Deploy new hosts with no previous JDK

After much tinkering with the alternatives command to repair my JDK configuration I decided that it was easier to start with a new AWS set of nodes and ensure that no JDK was installed in my image before I began my prepping of each node. If you have nodes that are having the issue after an upgrade read Solution 2.

I am including the script I used to download and configure the correct JDK pre-requisite for my version of Ambari and HDP below for your consumption:

#!/bin/bash
#Script Name: ignacio_install_jdk.scr
#Author: Ignacio de la Torre
#Independent Contractor Profile: https://linkedin.com/in/idelatorre
#################
##Install Oracle JDK
export hm=/home/ec2-user
cd /usr
sudo mkdir -p jdk64/jdk1.8.0_112
cd jdk64
sudo wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz
sudo gunzip jdk-8u112-linux-x64.tar.gz
sudo tar -xf jdk-8u112-linux-x64.tar

#configure paths
chmod 666 $hm/.bash_profile
echo export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 >>  $hm/.bash_profile
echo export PATH=$PATH:/usr/jdk64/jdk1.8.0_112/bin >>  $hm/.bash_profile
chmod 640 /root/.bash_profile

#configure java version using alternatives
sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1

#if the link to /usr/bin/java is broken (file displays red), rebuild using:
#ln -s -f /usr/jdk64/jdk1.8.0_112/bin/java /usr/bin/java

Solution 2 – Install new JDK with utilities

I realize scrapping nodes is not an option, especially for those experiencing the issue after an install. Because of a tight deadline I did not try the solution displayed here but it addresses what I think the issue is.

Scenario: On my original nodes that had a previous non-compatible version of JDK installed I issued the following command to select my new JDK as preferred:

#Select Oracle's JDK 1.8 as preferred after install (see install script on Solution 1)
sudo alternatives --install /usr/bin/java java /usr/jdk64/jdk1.8.0_112/bin/java 1

Issue: After selecting my new JDK I was able to see it listed in the configurations with the command below, BUT, all of the Java utilities such as jar, javadoc, etc. are pointing to null in my preferred JDK.

#list Java configured alternatives
sudo alternatives --display java

Proposed solution: Use the list of tools from the non-compatible version of JDK to install your new JDK with all the java utilities as slaves, please note that you cannot add slaves to an installed JDK, you need to issue the install command with all the utilities all at once. An example adding only JAR is displayed below:

#Select Oracle’s JDK 1.8 as preferred after install with a slave configuration for the JAR and javadoc utilities:

sudo alternatives --install "/usr/bin/java" "java" "/usr/jdk64/jdk1.8.0_112/bin/java" 1 \ 
--slave "/usr/bin/jar" "jar" "usr/jdk64/jdk1.8.0_112/bin/jar" \
--slave "/usr/bin/javadoc" "javadoc" "usr/jdk64/jdk1.8.0_112/bin/javadoc"

Solution 3 – Disable certificate validation

Like I said before, my cluster is not a production one and will not contain sensitive or confidential data so I opted to implement the suggestion to disable certificate validation as part of my troubleshooting. To do this you have to set verify=diable by editing the /etc/python/cert-verification.cfg file. Do this at your own risk.