WIP – The Analytics Journal

What is an Apache Spark RDD?

Posted on June 26, 2018July 17, 2018 by admin

Image result for what is a spark rdd — Apache Spark Resilient Distributed Dataset (RDD)

Apache Spark Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:

Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.

Reference:
Apache Spark Programming Guide
http://spark.apache.org/docs/2.1.1/programming-guide.html

Three key things to remember about Apache Spark RDD Operations

Posted on March 11, 2018July 17, 2018 by admin

There are three key concepts that are essential for the beginner Apache Spark Developer, we will cover them here. If you want to receive a condensed summary of the most relevant news in Big Data, Data Science and Advanced Analytics do not forget to subscribe to our newsletter, we send it once a month so you get the very best, only once a month.

All right, getting back to our topic, the three key things to remember when you begin working with Spark RDDs are:

Creating RDDs does not need to be a hard, involved process: For your learning environment you can easily create an RDD from a collection or by loading a CSV file. This saves you the step of transferring files to Hadoop’s HDFS file system and enhances your productivity in your sandbox environment.
Remember to persist your RDDs: This has to do with the fact that Spark RDDs are lazy, transformations are not executed as you define them, only once you ask Spark for a result. Experienced data scientists will define a base RDD and then create different sub-sets through transformations, every time you define one of these subsets by defining a new RDD remember to persist it, otherwise this will execute again and again every time you ask for the results of a downstream RDD.
Remember that RDDs in Spark are immutable: A big reason why the previous point is difficult to digest for new Spark developers is that we are not accustomed to the Functional Programming paradigm that underlies Spark. In our regular programming languages we create a variable that references a specific space in memory and then we are able to assign distinct values to this variable in different parts of our program. In Functional programming each object or variable is immutable, every time you create a new RDD based on a the results of an upstream RDD Apache will execute all of the logic that led to the creation of the source RDD.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Dataset (RDD) Operations, stay tuned for additional coverage on best practices as well as for Apache Spark Data Frames.

Hadoop Ecosystem: SQOOP – The Data Mover

Posted on May 26, 2017July 17, 2018 by admin

SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

Sqoop import: Helps users import a single table into Hadoop
Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
Sqoop export: Allows users to export a set of files from HDFS back into a relational database
Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

ODI: Purging OLD Sessions

Posted on May 21, 2017October 3, 2017 by admin

One common administrative task that I find myself doing when I realize that my ODI logs are growing fairly large is purging old sessions from the log. The steps are fairly straightforward as follows:

Login to your ODI Studio client
To to the Operator View
On the top right corner of your navigation pane, expand the menu and select purge log…
On the Purge Log screen you can select which old sessions to remove by date, agent, context, status, user and session name
Once you have set parameters as desired click on OK and the ODI session logs will be purged accordingly

Related:

ODI: Shrink ODI repository session tables after log purge

Error when importing work repository in ODI Studio (java.lang.OutOfMemoryError: Java heap space)

Posted on February 25, 2017October 3, 2017 by admin

INTRO

I am having an issue with the work repository in one of my environments this week to the point where I had to rebuild it. After dropping and recreating the schema I am running on a java heap space error.

SOLUTION

In my case the issue went away with the following steps:

Unpack the repository content ZIP file I was importing into an uncompressed folder
Up the MaxPermSize parameter on my ODI\client\odi\bin\odi.conf filefrom 512M to 1024M

FULL ERROR MESSAGE

java.lang.OutOfMemoryError: Java heap space

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)

at java.lang.Class.getDeclaredMethod(Class.java:1935)

at com.sunopsis.tools.core.SnpsTools.getMethodFromHierarchy(SnpsTools.java:370)

at com.sunopsis.tools.core.SnpsTools.getMethodFromHierarchy(SnpsTools.java:392)

at com.sunopsis.tools.xml.SnpsXmlObjectParser.processValue(SnpsXmlObjectParser.java:611)

at com.sunopsis.tools.xml.SnpsXmlObjectParser.endElement(SnpsXmlObjectParser.java:270)

at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1588)

at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:442)

at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:388)

at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:232)

at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:390)

at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:337)

at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:347)

at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6747)

at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6620)

at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6578)

at com.sunopsis.repository.manager.RepositoryManager.importObjectsUsingDoImport(RepositoryManager.java:5918)

at com.sunopsis.repository.manager.RepositoryManager.treatObjectListGeneral(RepositoryManager.java:3985)

at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImport(RepositoryManager.java:4506)

at com.sunopsis.repository.manager.RepositoryManager.access$7(RepositoryManager.java:4395)

at com.sunopsis.repository.manager.RepositoryManager$2.doAction(RepositoryManager.java:4369)

at oracle.odi.core.persistence.dwgobject.DwgObjectTemplate.execute(DwgObjectTemplate.java:216)

at oracle.odi.core.persistence.dwgobject.TransactionalDwgObjectTemplate.execute(TransactionalDwgObjectTemplate.java:64)

at com.sunopsis.repository.manager.RepositoryManager.internalWorkRepositoryImportWithCommit(RepositoryManager.java:4357)

at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImport(RepositoryManager.java:4661)

at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFile(RepositoryManager.java:4814)

at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFileWithCommit(RepositoryManager.java:4884)

at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFileWithCommit(RepositoryManager.java:4939)

at com.sunopsis.graphical.dialog.SnpsDialogImportWork$1.run(SnpsDialogImportWork.java:155)

at oracle.ide.dialogs.ProgressBar.run(ProgressBar.java:655)

at java.lang.Thread.run(Thread.java:662)

Other related issues I found when researching the solution are:

‘Error Java.lang.OutOfMemoryError: PermGen Space’ Signalled When Executing ODI 11.1.1.5 Integration Interfaces To Teradata Tables (Doc ID 1524974.1)

‘java.lang.OutOfMemoryError: allocLargeObjectOrArray’ Signalled When Using SFTP to Transfer Large Files With ODI 11g Or 12c Set Up With JRockit JDK (Doc ID 1556280.1)

OutOfMemoryError Signaled When Opening ODI Operator Window (Doc ID 423918.1)

“Unable to create new native thread” Error While Executing a Scenario From ODI Console (Doc ID 1912832.1)

Sqoop: Create-hive-table tool

Posted on November 14, 2016October 3, 2017 by admin

Tool: CREATE-HIVE-TABLE

–table <source_table> : The name of the table on the originating database

— hive-table <target table> : The name of the table to be created written to into

–enclosed-by

–escaped-by

–fields-terminated-by

–lines-terminated-by

–mysql-delimiters : ( , ) for fields, ( \n ) for lines, enclosed-by ( ‘ ), escaped-by ( \ )

Notes:

* The tool will fail if the target table exists

How to define an index on a source or target table in ODI

Posted on July 14, 2016October 3, 2017 by admin

The topic of indexes in Oracle Data Integrator (ODI) is an easy one once you learn your way around in ODI studio. Indexes are defined under Design > Models > [your model] > [Table] > [Constraints]. They are independent of the column definitions for the most part although they do reference valid column definitions.

The first step to create metadata for an index is to add a new constraint and name it:

On the Columns tab define the list of columns to be indexed

Ensure that the index is active and marked for creation on the database on the Control tab

Review all additional options on the Flexfields tab, for some of the options such as Index Type you do need to manually type your selection.

After you have defined an index on metadata you need to run a new load for the index to be created on the database.

My Tools:

Oracle Business Intelligence Applications (OBIA) 11.1.1.7.0

Oracle Data Integrator 11g (ODI Studio, ODI Server, no clustering)

How to Make sure an index is defined as unique in odi

Posted on April 22, 2016October 3, 2017 by admin

Indexes are defined as constraints on the Model view in ODI Studio. To make a field unique follow one of the two alternatives listed below.

PRIMARY KEY

Define the Index as a Primary Key constraint object on the Models area within the design view

Add the unique columns on the Columns tab

In the Control tab make sure the constraint is active and marked to be defined in the database for both flow and static control checkboxes

Review that the correct settings are configured in the Flexfields tab

ALTERNATE KEY

Define the constraint object for your table as an alternate key in the ODI models area