What is an Apache Spark RDD?

Image result for what is a spark rdd
Apache Spark Resilient Distributed Dataset (RDD)

Apache Spark  Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:

  • Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
  • Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.

 

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.

 

Reference:
Apache Spark Programming Guide
http://spark.apache.org/docs/2.1.1/programming-guide.html

Three key things to remember about Apache Spark RDD Operations

There are three key concepts that are essential for the beginner Apache Spark Developer, we will cover them here. If you want to receive a condensed summary of the most relevant news in Big Data, Data Science and Advanced Analytics do not forget to subscribe to our newsletter, we send it once a month so you get the very best, only once a month.

All right, getting back to our topic, the three key things to remember when you begin working with Spark RDDs are:

  • Creating RDDs does not need to be a hard, involved process: For your learning environment you can easily create an RDD from a collection or by loading a CSV file. This saves you the step of transferring files to Hadoop’s HDFS file system and enhances your productivity in your sandbox environment.
  • Remember to persist your RDDs: This has to do with the fact that Spark RDDs are lazy, transformations are not executed as you define them, only once you ask Spark for a result.  Experienced data scientists will define a base RDD and then create different sub-sets through transformations, every time you define one of these subsets by defining a new RDD remember to persist it, otherwise this will execute again and again every time you ask for the results of a downstream RDD.
  • Remember that RDDs in Spark are immutable: A big reason why the previous point is difficult to digest for new Spark developers is that we are not accustomed to the Functional Programming paradigm that underlies Spark. In our regular programming languages we create a variable that references a specific space in memory and then we are able to assign distinct values to this variable in different parts of our program. In Functional programming each object or variable is immutable, every time you create a new RDD based on a the results of an upstream RDD Apache will execute  all of the logic that led to the creation of the source RDD.

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Dataset (RDD) Operations, stay tuned for additional coverage on best practices as well as for Apache Spark Data Frames.

 

 

Hadoop Ecosystem: SQOOP – The Data Mover

Apache Sqoop Logo

Sqoop Logo

 SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

 

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

 

  • Sqoop import: Helps users import a single table into Hadoop
  • Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
  • Sqoop export: Allows users to export a set of files from HDFS back into a relational database
  • Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

ODI: Purging OLD Sessions

One common administrative task that I find myself doing when I realize that my ODI logs are growing fairly large is purging old sessions from the log. The steps are fairly straightforward as follows:

 

  1. Login to your ODI Studio client
  2. To to the Operator View
  3. On the top right corner of your navigation pane, expand the menu and select purge log…

    image

  4. On the Purge Log screen you can select which old sessions to remove by date, agent, context, status, user and session name

    image

  5. Once you have set parameters as desired click on OK and the ODI session logs will be purged accordingly

 

Related:

Error when importing work repository in ODI Studio (java.lang.OutOfMemoryError: Java heap space)

INTRO

 

I am having an issue with the work repository in one of my environments this week to the point where I had to rebuild it. After dropping and recreating the schema I am running on a java heap space error.

clip_image002

 

SOLUTION

 

In my case the issue went away with the following steps:

  1. Unpack the repository content ZIP file I was importing into an uncompressed folder
  2. Up the MaxPermSize parameter on my ODI\client\odi\bin\odi.conf filefrom 512M to 1024M
    image

 

FULL ERROR MESSAGE

 

clip_image002

java.lang.OutOfMemoryError: Java heap space

                at java.lang.Class.getDeclaredMethods0(Native Method)

                at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)

                at java.lang.Class.getDeclaredMethod(Class.java:1935)

                at com.sunopsis.tools.core.SnpsTools.getMethodFromHierarchy(SnpsTools.java:370)

                at com.sunopsis.tools.core.SnpsTools.getMethodFromHierarchy(SnpsTools.java:392)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.processValue(SnpsXmlObjectParser.java:611)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.endElement(SnpsXmlObjectParser.java:270)

                at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1588)

                at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:442)

                at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:388)

                at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:232)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:390)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:337)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:347)

                at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6747)

                at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6620)

                at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6578)

                at com.sunopsis.repository.manager.RepositoryManager.importObjectsUsingDoImport(RepositoryManager.java:5918)

                at com.sunopsis.repository.manager.RepositoryManager.treatObjectListGeneral(RepositoryManager.java:3985)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImport(RepositoryManager.java:4506)

                at com.sunopsis.repository.manager.RepositoryManager.access$7(RepositoryManager.java:4395)

                at com.sunopsis.repository.manager.RepositoryManager$2.doAction(RepositoryManager.java:4369)

                at oracle.odi.core.persistence.dwgobject.DwgObjectTemplate.execute(DwgObjectTemplate.java:216)

                at oracle.odi.core.persistence.dwgobject.TransactionalDwgObjectTemplate.execute(TransactionalDwgObjectTemplate.java:64)

                at com.sunopsis.repository.manager.RepositoryManager.internalWorkRepositoryImportWithCommit(RepositoryManager.java:4357)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImport(RepositoryManager.java:4661)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFile(RepositoryManager.java:4814)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFileWithCommit(RepositoryManager.java:4884)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFileWithCommit(RepositoryManager.java:4939)

                at com.sunopsis.graphical.dialog.SnpsDialogImportWork$1.run(SnpsDialogImportWork.java:155)

                at oracle.ide.dialogs.ProgressBar.run(ProgressBar.java:655)

                at java.lang.Thread.run(Thread.java:662)

 

RELATED

 

Other related issues I found when researching the solution are:

 

 

 

 

Sqoop: Create-hive-table tool

Apache Sqoop Logo

Tool: CREATE-HIVE-TABLE

–table <source_table> : The name of the table on the originating database

— hive-table <target table> :  The name of the table to be created written to into

 

–enclosed-by

–escaped-by

–fields-terminated-by

–lines-terminated-by

–mysql-delimiters : ( , ) for fields, ( \n ) for lines, enclosed-by ( ‘ ), escaped-by ( \ )

 

Notes:

* The tool will fail if the target table exists

How to define an index on a source or target table in ODI

Oracle Data Integrator (ODI) Logo

 

 

 

 

 

 

 

 

 

 

The topic of indexes in Oracle Data Integrator (ODI) is an easy one once you learn your way around in ODI studio. Indexes are defined under Design > Models > [your model] > [Table] > [Constraints]. They are independent of the column definitions for the most part although they do reference valid column definitions.

 

  • The first step to create metadata for an index is to add a new constraint and name it:image

  • On the Columns tab define the list of columns to be indexedimage

  • Ensure that the index is active and marked for creation on the database on the Control tabimage

  • Review all additional options on the Flexfields tab, for some of the options such as Index Type you do need to manually type your selection.
    image

 

 

After you have defined an index on metadata you need to run a new load for the index to be created on the database.

 

My Tools:

Oracle Business Intelligence Applications (OBIA) 11.1.1.7.0

Oracle Data Integrator 11g (ODI Studio, ODI Server, no clustering)

 

How to Make sure an index is defined as unique in odi

Oracle Data Integrator

Indexes are defined as constraints on the Model view in ODI Studio. To make a field unique follow one of the two alternatives listed below.

 

PRIMARY KEY

 

Define the Index as a Primary Key constraint object on the Models area within the design view

Description view on primary key index definition in ODI

 

Add the unique columns on the Columns tab

image

 

 

In the Control tab make sure the constraint is active and marked to be defined in the database for both flow and static control checkboxes

 

 

 

image

 

Review that the correct settings are configured in the Flexfields tab

Flexfields view on Primary Key Index in ODI

 

ALTERNATE KEY

 

Define the constraint object for your table as an alternate key in the ODI models area

Defining a unique index in Oracle ODI

 

Add the unique columns on the Columns tab

image

 

In the Control tab make sure the constraint is active and marked to be defined in the database for both flow and static control checkboxes

 

image

Review that the correct settings are configured in the Flexfields tab

Flexfields view on unique index in ODI