Did you run out of time in Oracle Business Intelligence Applications (OBIA)?

There’s nothing worse than this right? I’m with you man (sister)!

All right in all due seriousness this can be an awkward situation where you come in Monday morning and your business stakeholders look deeply angry since none of their reports look right and they need to finish the close of the month/year/quarter.

Anyways, in both OBIA 7.9.x and 11g this is most likely due to the fact that the variables that control the generation of calendar tables are set to a date now in the past.

 

OBIA 11.1.1.7 Solution

  1. Log in to OBIA Configuration Manager (BIACM)
  2. Go to Manage Data Load Parameters
  3. Look for Configure Time Dimension > Gregorian Date Range End
  4. Change the END_DATE parameter to a date far far in the future
  5. Put on a contrite face and let your users know this will be fixed next time your load runs

OBIA 7.9.x Solution

The same general steps would need to be applied on your DAC client to fix this issue in versions of OBIA that use Informatica Power Center

  1. Open your DAC console client
  2. Navitate to Design > Tasks
  3. Look up the SIL_DayDimension task
  4. Look up the Parameters tab on the bottom panel
  5. Change the $$END_DATE parameter to a future date
  6. Same deal, contrite face, deep breath and break in the news that this won’t be fixed until tomorrow morning

 

Related:

Hadoop Ecosystem: SQOOP – The Data Mover

Apache Sqoop Logo

Sqoop Logo

 SQOOP is an open source project hosted by the Apache Foundation whose objective is to provide a tool that will allow users to move large volumes of data in bulk from structured data sources into the Hadoop Distributed File System (HDFS). The project graduated from the Apache Incubator in March of 2012 and it is now a Top-Level Apache project.

The best way to look at Sqoop is as a collection of related tools where each of these sub-modules serves a specific use case such as importing into Hive or leveraging parallelism when reading from a MySQL database. You do specify the tool you are invoking when you use Sqoop. In terms of syntax, each of these tools have a specific set of arguments while supporting global arguments as well.

 

Below is a list of the most frequently used Sqoop tools as of version 1.4.5 with a brief description of their purpose:

 

  • Sqoop import: Helps users import a single table into Hadoop
  • Sqoop import-all-tables: Imports all tables in a database schema into Hadoop
  • Sqoop export: Allows users to export a set of files from HDFS back into a relational database
  • Sqoop create-hive-table: Allows users to import relational data directly into Apache Hive

Error when importing work repository in ODI Studio (java.lang.OutOfMemoryError: Java heap space)

INTRO

 

I am having an issue with the work repository in one of my environments this week to the point where I had to rebuild it. After dropping and recreating the schema I am running on a java heap space error.

clip_image002

 

SOLUTION

 

In my case the issue went away with the following steps:

  1. Unpack the repository content ZIP file I was importing into an uncompressed folder
  2. Up the MaxPermSize parameter on my ODI\client\odi\bin\odi.conf filefrom 512M to 1024M
    image

 

FULL ERROR MESSAGE

 

clip_image002

java.lang.OutOfMemoryError: Java heap space

                at java.lang.Class.getDeclaredMethods0(Native Method)

                at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)

                at java.lang.Class.getDeclaredMethod(Class.java:1935)

                at com.sunopsis.tools.core.SnpsTools.getMethodFromHierarchy(SnpsTools.java:370)

                at com.sunopsis.tools.core.SnpsTools.getMethodFromHierarchy(SnpsTools.java:392)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.processValue(SnpsXmlObjectParser.java:611)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.endElement(SnpsXmlObjectParser.java:270)

                at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1588)

                at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:442)

                at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:388)

                at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:232)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:390)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:337)

                at com.sunopsis.tools.xml.SnpsXmlObjectParser.parseXmlFile(SnpsXmlObjectParser.java:347)

                at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6747)

                at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6620)

                at com.sunopsis.dwg.DwgObject.doImport(DwgObject.java:6578)

                at com.sunopsis.repository.manager.RepositoryManager.importObjectsUsingDoImport(RepositoryManager.java:5918)

                at com.sunopsis.repository.manager.RepositoryManager.treatObjectListGeneral(RepositoryManager.java:3985)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImport(RepositoryManager.java:4506)

                at com.sunopsis.repository.manager.RepositoryManager.access$7(RepositoryManager.java:4395)

                at com.sunopsis.repository.manager.RepositoryManager$2.doAction(RepositoryManager.java:4369)

                at oracle.odi.core.persistence.dwgobject.DwgObjectTemplate.execute(DwgObjectTemplate.java:216)

                at oracle.odi.core.persistence.dwgobject.TransactionalDwgObjectTemplate.execute(TransactionalDwgObjectTemplate.java:64)

                at com.sunopsis.repository.manager.RepositoryManager.internalWorkRepositoryImportWithCommit(RepositoryManager.java:4357)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImport(RepositoryManager.java:4661)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFile(RepositoryManager.java:4814)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFileWithCommit(RepositoryManager.java:4884)

                at com.sunopsis.repository.manager.RepositoryManager.workRepositoryImportFromZipFileWithCommit(RepositoryManager.java:4939)

                at com.sunopsis.graphical.dialog.SnpsDialogImportWork$1.run(SnpsDialogImportWork.java:155)

                at oracle.ide.dialogs.ProgressBar.run(ProgressBar.java:655)

                at java.lang.Thread.run(Thread.java:662)

 

RELATED

 

Other related issues I found when researching the solution are:

 

 

 

 

What is an Apache Spark RDD?

Image result for what is a spark rdd
Apache Spark Resilient Distributed Dataset (RDD)

Apache Spark  Resilient Distributed Datasets (RDDs) are the main vehicle used by the processing engine to represent a dataset. Given that the name itself is pretty self explanatory let’s look into each of these attributes in additional detail:

  • Distributed: This is the key attribute of RDDs, an RDD is a collection of partitions or fragments distributed across processing nodes, this allows Spark to fit and process massive data sets in memory by distributing the workload in parallel across a collection of worker nodes.
  • Resilient: The ability to recover from processing from failure, this is achieved by storing multiple copies of each fragment on multiple working nodes, if a working node goes offline that workload can be relocated to another node containing the same fragment.

 

I hope you enjoyed this introduction to Apache Spark Resilient Distributed Datasets (RDDs), stay tuned for additional coverage on RDD operations and best practices as well as for Apache Spark Data Frames.

 

Reference:
Apache Spark Programming Guide
http://spark.apache.org/docs/2.1.1/programming-guide.html

Google File System Design Assumptions

Ignacio de la Torre, Editor, The Analytics Journal

 

ProfProfile4-cartoon.jpg

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

 

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their  application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

 

  1. Failure is an expectation, not an exception
    Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.
  2. The system stores a modest number of large data files
    This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files.  Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.
  3. Workloads primarily consist of large streaming reads and small random reads
    By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.
  4. The workloads also have many large, sequential writes that append to data files
    Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.
  5. Atomicity with minimal synchronization is essential
    The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.
  6. High sustained bandwidth is more important than low latency
    A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

 

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!

 

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Apache Zookeeper Logo

image

“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ [1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

  • Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
  • Queues:  Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by  a Zookeeper node that clients connect to to submit new queue member as well as to request  a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
  • Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
  • Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point  once all worker processes have executed their independent units of work.
  • Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a  vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Websitehttp://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Sqoop: Create-hive-table tool

Apache Sqoop Logo

Tool: CREATE-HIVE-TABLE

–table <source_table> : The name of the table on the originating database

— hive-table <target table> :  The name of the table to be created written to into

 

–enclosed-by

–escaped-by

–fields-terminated-by

–lines-terminated-by

–mysql-delimiters : ( , ) for fields, ( \n ) for lines, enclosed-by ( ‘ ), escaped-by ( \ )

 

Notes:

* The tool will fail if the target table exists

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Apache Hive

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

 

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.

Configuring OBIEE to run as a Windows Service

Introduction

One key step to configuring an enterprise deployment of Oracle Business Intelligence is to be setup your services to run in the background as Windows services and start automatically with your server.  By default the installer will create windows services for Oracle Process Manager (OPMN) and the Weblogic Node Manager, this leaves us with the need to configure services for the Weblogic AdminServer and the BI Managed Server.

Pre-requisites

  • Verify boot.properties files exist for both Weblogic Servers

    AdminServer:
    %mw_home%\user_projects\domains\bifoundation_domain\servers\AdminServer\security\boot.properties

    BI Managed Server:
    %mw_home%\user_projects\domains\bifoundation_domain\servers\bi_server1\security\boot.properties

  • Define the MW_HOME Variable
  • Edit %MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd to redirect standard output to a file and to set the service name prefix to “Oracle_”

    Log Syntax:
    -log:”%MW_HOME%\user_projects\domains\bifoundation_domain\servers\%SERVER_NAME%-stdout.txt”

    Example Customization:

    set MW_HOME=%WL_HOME%\..\

    rem *** Install the service”
    %WL_HOME%\server\bin\beasvc” -install -svcname:”Oracle_%DOMAIN_NAME%_%SERVER_NAME%” -javahome:”%JAVA_HOME%” -execdir:”%USERDOMAIN_HOME%” -maxconnectretries:”%MAX_CONNECT_RETRIES%” -host:”%HOST%” -port:”%PORT%” -extrapath:”%EXTRAPATH%” -password:”%WLS_PW%” -cmdline:%CMDLINE% -log:”%MW_HOME%\user_projects\domains\bifoundation_domain\servers\%SERVER_NAME%-stdout.txt”

    Note: Make sure you replace beasvc and the subsequent space with Oracle_ on the section for -svcname

  • Change the environment script

    %MW_HOME%\user_projects\domains\bifoundation_domain\bin\setOBIDomainEnv.cmd

  • Edit %MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd to ensure the correct Java memory arguments are utilized by your windows service

    Old Code:
    call “%WL_HOME%\common\bin\commEnv.cmd”
    New Code:
    call “%WL_HOME%\..\user_projects\domains\bifoundation_domain\bin\setOBIDomainEnv.cmd”

  • Edit %MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd to implement a workaround for the Windows limitation of the maximum length of the command line being 2KB
    • Locate the two instances where the script sets the value of the CMDLINE variable
    • Add the code below before each instance, this code will output the current value of CLASSPATH to a text file

      REM –
      REM output the class path to text file and change reference to file on CMDLINE variable
      REM this is a workaround to a limit on windows command line to 2KB
      echo %CLASSPATH% > %WL_HOME%\server\bin\classpath.txt

    • Replace the class path variable reference \”%CLASSPATH%\” with@%WL_HOME%\server\bin\classpath.txt as depicted in the example below

      set CMDLINE=”%JAVA_VM% %MEM_ARGS% %JAVA_OPTIONS% -classpath @%WL_HOME%\server\bin\classpath.txt -Dweblogic.Name=%SERVER_NAME% -Dweblogic.management.username=%WLS_USER% -Dweblogic.ProductionModeEnabled=%PRODUCTION_MODE% -Djava.security.policy=\”%WL_HOME%\server\lib\weblogic.policy\” weblogic.Server”

  • Read the Microsoft Support article on specifying the startup order of Windows Services
  • Using regedit, add the one group for each of the OBIEE processes to be started at the end of the list entry at:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ServiceGroupOrder

    Example groups:

    OBI Node Manager
    OBI AdminServer
    OBI Managed Server
    OBI OPMN

    This will sequence the startup of your services based on group

  • Note down the names of the OPMN and Node Manager Services from the registry

    Registry Location:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\

    Sample Service Names:

    OracleProcessManager_instance1
    Oracle WebLogic NodeManager (d_obi_mw_wlserver_10.3)

  • For each of the two services above add a string value (right click the registry folder and follow New > String Value) named Group and provide the corresponding group value for each service (ie. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Oracle WebLogic NodeManager (d_obi_mw_wlserver_10.3)\Group=OBI Node Manager).

    This will work along with the ServiceGroupOrder configuration to ensure the startup order of your services

Implementation


AdminServer Service

  • Create a new script named %MW_HOME%\wlserver_10.3\server\bin\installAdminServer_svc.cmd and using the code below:

    SETLOCAL
    @echo off
    set MW_HOME=d:\obi_mw
    set DOMAIN_NAME=bifoundation_domain
    set USERDOMAIN_HOME=%MW_HOME%\user_projects\domains\%DOMAIN_NAME%
    set SERVER_NAME=AdminServer
    set PRODUCTION_MODE=true
    call “%MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd”
    ENDLOCAL

  • Run the installAdminServer_svc.cmd script
  • Using regedit, verify that a service named Oracle_bifoundation_domain_AdminServer now exists under the following location

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\

  • Right click your service folder and follow the context menus New > String Value to add a new entry in your service folder, call your new string value Group and double click on it to add OBI AdminServer as a value. This will work in tandem with the ServiceGroupOrder configuration to ensure the startup order of your services.
  • Right click your service folder and follow the context menus New > Multi-String Value to add a new entry in your service folder, call your new string value DependOnService and double click on it to add the node manager service as a dependency, the node manager service must match the service listing you noted down as part of the pre-requisite preparation steps (ie. Oracle WebLogic NodeManager (d_obi_mw_wlserver_10.3)). Using this dependency value will cause Windows to verify that dependent services have been started before attempting to start this service.

BI Managed Server Service

  • Create a new script named %MW_HOME%\wlserver_10.3\server\bin\installbi_server1_svc.cmd and using the code below:

    SETLOCAL
    @echo off
    set MW_HOME=d:\obi_mw
    set DOMAIN_NAME=bifoundation_domain
    set USERDOMAIN_HOME=%MW_HOME%\user_projects\domains\%DOMAIN_NAME%
    set SERVER_NAME=bi_server1
    set PRODUCTION_MODE=true
    set ADMIN_URL=http://localhost:7001
    call “%MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd”
    ENDLOCAL

  • Run the installbi_server1_svc.cmd script
  • Using regedit, verify that a service named Oracle_bifoundation_domain_bi_server1 now exists under the following location

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\

  • Right click your service folder and follow the context menus New > String Value to add a new entry in your service folder, call your new string value Group and double click on it to add OBI Managed Server as a value. This will work in tandem with the ServiceGroupOrder configuration to ensure the startup order of your services.
  • Right click your service folder and follow the context menus New > Multi-String Value to add a new entry in your service folder, call your new string value DependOnService and double click on it to add the node manager service and Weblogic AdminServer services as a dependencies (ie.Oracle_bifoundation_domain_AdminServer ). Using this dependency value will cause Windows to verify that dependent services have been started before attempting to start this service.
  • Right click the service folder for OPMN and follow the context menus New > Multi-String Value to add a new entry in your service folder, call your new string value DependOnService and double click on it to add the node manager service and Weblogic AdminServer  and your new BI Managed Server services as a dependencies (ie. Oracle_bifoundation_domain_bi_server1). Using this dependency value will cause Windows to verify that dependent services have been started before attempting to start this service.
  • On the Administration Tools > Services application verify that all of the following services are configured to start automatically and, optionally, configure what actions are taking on failure starting each service.

    Oracle WebLogic NodeManager
    Oracle_bifoundation_domain_AdminServer
    Oracle_bifoundation_domain_bi_server1
    Oracle Process Manager (instance 1)

Setup Validation

  • Restart your windows server and monitor the order in which services are started
  • If you see issues with OPMN starting at the same time as your WebLogic servers you might need to try setting the AdminServer, BI Managed Server and OPMN  services to start manually and using the code below to create batch command file that is executed by a scheduled task each time the computer starts:

    net start Oracle_bifoundation_domain_AdminServer
    timeout 300
    net start Oracle_bifoundation_domain_bi_server1
    timeout 300
    net start OracleProcessManager_instance1

    This script would use the timeout DOS command to institute a five minute wait between each of the OBI services being started.

One key step to configuring an enterprise deployment of Oracle Business Intelligence is to be setup your services to run in the background as Windows services and start automatically with your server.  By default the installer will create windows services for Oracle Process Manager (OPMN) and the Weblogic Node Manager, this leaves us with the need to configure services for the Weblogic AdminServer and the BI Managed Server.

Error: Unable to access Oracle Data Integrator repository. You will not be able generate or execute load plans.

Oracle Data Integrator (ODI) Logo

Upon logging to the Oracle Business Intelligence Application’s Configuration Manager system I was greeted with an error message stating the following: “Unable to access Oracle Data Integrator repository. You will not be able generate or execute load plans.”

 

Upon some triage I was able to determine a couple of possible root causes to my issue.

 

  1. Not using the BIAdmin user created during the install process, or, the BIAdmin user account was not created.
  2. Current session was initiated with a set of credentials that haven’t been granted the BIA_ADMINISTRATOR_DUTY  role on weblogic security
  3. Additional roles are missing or not assigned to the credentials initiating the current session

 

After reviewing all possible options and confirming I am using a valid account with the proper roles and permissions I asked my system administrator to restart the server, this fixed my issue. I should have started there but at least I came out of the experience with a better understanding of the roles that control security in my installation of OBIA 11.1.1.7.0.

 

Last resort, if all else fails:

* Make sure after you have regenerated and moved the security files that the file odi.conf is updated to refer to jps-config-jse.xml.