Google File System Design Assumptions

Ignacio de la Torre, Editor, The Analytics Journal

 

ProfProfile4-cartoon.jpg

In today’s post I want to highlight the brilliance of the Google Research team, their ability to step back and look at old assumptions kind of reminds me of the Wright brothers realizing that lift values from the 1700’s and other widespread assumptions of the time were the main constrains holding them back from being able to come with the first airplane.

 

At Google Research something similar went on when they realized that traditional data storage and processing paradigms did not fit well with their  application’s processing workloads. Here are some of the design assumptions for Google File System straight from the published research paper with my comments:

 

  1. Failure is an expectation, not an exception
    Google realized that the traditional way to address failure on the datacenter is to increase the sophistication of the hardware platforms involved. This approach increases cost both by using highly specialized hardware and by requiring system administrators with very sophisticated skills. The main innovation here is realizing that when dealing with massive datasets (i.e. downloading a copy of the entire web) hardware failure is a fact of life rather than an exception; once this observation is incorporated into their design costs can be decreased by storing and processing data on very large clusters of commodity hardware where redundancy and replication across processing nodes and racks allows for seamless recovery from hardware failure.
  2. The system stores a modest number of large data files
    This observation is arrived at by looking at the nature of the data being processed such as HTML markup from crawling a large number of websites, this is what we would call “unstructured data” that is cleaned and serialized by the crawler before it is “batched” together into large files.  Once again, by taking a step back and looking at the problem with fresh eyes the researchers were able to realize their design did not need to optimize for the storage of billions of small files, this is a great constraint to remove from their design as we will explore when we look at the ability of the GFS master server to control and store metadata for all files in a cluster in memory, thus allowing it to make very smart load balancing, placement and replication decisions.
  3. Workloads primarily consist of large streaming reads and small random reads
    By looking at actual application workloads the researchers found that they could generally group read operations in these two categories and that sucessive read operations from the same client will often read contiguous regions of a file; also, performance minded applications will batch and sort their reads so that their progress through a dataset is one directional moving from beginning to end instead of going back and forth with random I/O operations.
  4. The workloads also have many large, sequential writes that append to data files
    Notice here how “delete” and “update” operations are extremely rare to non-existent, this frees up the system design from the onerous task of maintaining locks to ensure the atomicity of these two operations.
  5. Atomicity with minimal synchronization is essential
    The system design focuses on supporting large writes by batch processes and “append” operations by a large number of concurrent clients, freeing itself from the constraints mentioned on the previous point.
  6. High sustained bandwidth is more important than low latency
    A good observation on the fact that when dealing with these large datasets most applications are batch oriented and benefit the most of high processing throughput versus the traditional database application that places a premium in fast response times.

 

In hindsight, these observations might seem obvious, specially as they have been incorporated into the design principles that drive other products such as Apache Hadoop; but, Google’s decision to invest into a custom made file system to fit their very specific needs and the ability of the Google Research team to step back and start their design with fresh eyes have truly revolutionized our data processing forever, cheers to them!

 

Reference:

“The Google File System”; Ghemawat, Gobioff, Leung; Google Research

Hadoop Ecosystem: Zookeeper – The distributed coordination server

Apache Zookeeper Logo

image

“ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. “ [1]

At first it is hard to visualize the role of Zookeeper as a component in the Hadoop ecosystem so let’s examine a couple of the services and constructs that it provides to distributed computing applications:

  • Locks: Zookeeper provides mechanisms to create an maintain globally distributed lock mechanisms, this allows applications to maintain transaction atomicity for any kind of object by ensuring that at any point in time no two clients or transactions can hold a lock on the same resource.
  • Queues:  Zookeeper allows distributed applications to maintain regular FIFO and priority-based queues where a list of messages or objects is held by  a Zookeeper node that clients connect to to submit new queue member as well as to request  a list of the members pending processing. This allows applications to implement asynchronous processes where a unit of processing is placed on a queue and processed whenever the next worker process is available to take on the work.
  • Two-Phased Commit Coordination: Zookeeper allows applications that need to commit or abort a transaction across multiple processing nodes to coordinate the two phase commit pattern through its infrastructure. Each client will apply the transaction tentatively on the first commit phase and notify the coordination node that will then let all parties involved know whether or not the transaction was globally successful or not.
  • Barriers: Zookeeper supports the creation of synchronization points called Barriers. This is useful when multiple asynchronous processes need to converge on a common synchronization point  once all worker processes have executed their independent units of work.
  • Leader Election: Zookeeper allows distributed applications to automate leader election across a list of available nodes, this helps applications running on a cluster optimize for locality and load balancing.

As you can see Zookeeper play a  vital role as foundation service for distributed applications that need to coordinate independent, asynchronous processes across large computing nodes on a cluster environment.

References:

[1] Zookeeper Websitehttp://zookeeper.apache.org/

[2] Zookeeper Recipes, http://zookeeper.apache.org/doc/trunk/recipes.html

Sqoop: Create-hive-table tool

Apache Sqoop Logo

Tool: CREATE-HIVE-TABLE

–table <source_table> : The name of the table on the originating database

— hive-table <target table> :  The name of the table to be created written to into

 

–enclosed-by

–escaped-by

–fields-terminated-by

–lines-terminated-by

–mysql-delimiters : ( , ) for fields, ( \n ) for lines, enclosed-by ( ‘ ), escaped-by ( \ )

 

Notes:

* The tool will fail if the target table exists

Hadoop Ecosystem: Hive – the Data Warehouse and SQL interface

Apache Hive

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

 

Hive is both a metadata layer on top of HDFS and a SQL interpreter. This allows companies to store structured or semi-structured data as files on Hadoop without a large initial data modeling effort, once business requirements align with the need to extract new insights from the stored data a development team can leverage the “schema on read” paradigm to create metadata about these files.

 

Having a SQL interpreter allows business analysts and power users to have access to terabytes or petabytes of information through a familiar query language. This is a dramatic departure from MapReduce where a very specialized skill set would be required to write multiple Map and Reduce functions in order to achieve the same results.

Configuring OBIEE to run as a Windows Service

Introduction

One key step to configuring an enterprise deployment of Oracle Business Intelligence is to be setup your services to run in the background as Windows services and start automatically with your server.  By default the installer will create windows services for Oracle Process Manager (OPMN) and the Weblogic Node Manager, this leaves us with the need to configure services for the Weblogic AdminServer and the BI Managed Server.

Pre-requisites

  • Verify boot.properties files exist for both Weblogic Servers

    AdminServer:
    %mw_home%\user_projects\domains\bifoundation_domain\servers\AdminServer\security\boot.properties

    BI Managed Server:
    %mw_home%\user_projects\domains\bifoundation_domain\servers\bi_server1\security\boot.properties

  • Define the MW_HOME Variable
  • Edit %MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd to redirect standard output to a file and to set the service name prefix to “Oracle_”

    Log Syntax:
    -log:”%MW_HOME%\user_projects\domains\bifoundation_domain\servers\%SERVER_NAME%-stdout.txt”

    Example Customization:

    set MW_HOME=%WL_HOME%\..\

    rem *** Install the service”
    %WL_HOME%\server\bin\beasvc” -install -svcname:”Oracle_%DOMAIN_NAME%_%SERVER_NAME%” -javahome:”%JAVA_HOME%” -execdir:”%USERDOMAIN_HOME%” -maxconnectretries:”%MAX_CONNECT_RETRIES%” -host:”%HOST%” -port:”%PORT%” -extrapath:”%EXTRAPATH%” -password:”%WLS_PW%” -cmdline:%CMDLINE% -log:”%MW_HOME%\user_projects\domains\bifoundation_domain\servers\%SERVER_NAME%-stdout.txt”

    Note: Make sure you replace beasvc and the subsequent space with Oracle_ on the section for -svcname

  • Change the environment script

    %MW_HOME%\user_projects\domains\bifoundation_domain\bin\setOBIDomainEnv.cmd

  • Edit %MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd to ensure the correct Java memory arguments are utilized by your windows service

    Old Code:
    call “%WL_HOME%\common\bin\commEnv.cmd”
    New Code:
    call “%WL_HOME%\..\user_projects\domains\bifoundation_domain\bin\setOBIDomainEnv.cmd”

  • Edit %MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd to implement a workaround for the Windows limitation of the maximum length of the command line being 2KB
    • Locate the two instances where the script sets the value of the CMDLINE variable
    • Add the code below before each instance, this code will output the current value of CLASSPATH to a text file

      REM –
      REM output the class path to text file and change reference to file on CMDLINE variable
      REM this is a workaround to a limit on windows command line to 2KB
      echo %CLASSPATH% > %WL_HOME%\server\bin\classpath.txt

    • Replace the class path variable reference \”%CLASSPATH%\” with@%WL_HOME%\server\bin\classpath.txt as depicted in the example below

      set CMDLINE=”%JAVA_VM% %MEM_ARGS% %JAVA_OPTIONS% -classpath @%WL_HOME%\server\bin\classpath.txt -Dweblogic.Name=%SERVER_NAME% -Dweblogic.management.username=%WLS_USER% -Dweblogic.ProductionModeEnabled=%PRODUCTION_MODE% -Djava.security.policy=\”%WL_HOME%\server\lib\weblogic.policy\” weblogic.Server”

  • Read the Microsoft Support article on specifying the startup order of Windows Services
  • Using regedit, add the one group for each of the OBIEE processes to be started at the end of the list entry at:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ServiceGroupOrder

    Example groups:

    OBI Node Manager
    OBI AdminServer
    OBI Managed Server
    OBI OPMN

    This will sequence the startup of your services based on group

  • Note down the names of the OPMN and Node Manager Services from the registry

    Registry Location:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\

    Sample Service Names:

    OracleProcessManager_instance1
    Oracle WebLogic NodeManager (d_obi_mw_wlserver_10.3)

  • For each of the two services above add a string value (right click the registry folder and follow New > String Value) named Group and provide the corresponding group value for each service (ie. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Oracle WebLogic NodeManager (d_obi_mw_wlserver_10.3)\Group=OBI Node Manager).

    This will work along with the ServiceGroupOrder configuration to ensure the startup order of your services

Implementation


AdminServer Service

  • Create a new script named %MW_HOME%\wlserver_10.3\server\bin\installAdminServer_svc.cmd and using the code below:

    SETLOCAL
    @echo off
    set MW_HOME=d:\obi_mw
    set DOMAIN_NAME=bifoundation_domain
    set USERDOMAIN_HOME=%MW_HOME%\user_projects\domains\%DOMAIN_NAME%
    set SERVER_NAME=AdminServer
    set PRODUCTION_MODE=true
    call “%MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd”
    ENDLOCAL

  • Run the installAdminServer_svc.cmd script
  • Using regedit, verify that a service named Oracle_bifoundation_domain_AdminServer now exists under the following location

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\

  • Right click your service folder and follow the context menus New > String Value to add a new entry in your service folder, call your new string value Group and double click on it to add OBI AdminServer as a value. This will work in tandem with the ServiceGroupOrder configuration to ensure the startup order of your services.
  • Right click your service folder and follow the context menus New > Multi-String Value to add a new entry in your service folder, call your new string value DependOnService and double click on it to add the node manager service as a dependency, the node manager service must match the service listing you noted down as part of the pre-requisite preparation steps (ie. Oracle WebLogic NodeManager (d_obi_mw_wlserver_10.3)). Using this dependency value will cause Windows to verify that dependent services have been started before attempting to start this service.

BI Managed Server Service

  • Create a new script named %MW_HOME%\wlserver_10.3\server\bin\installbi_server1_svc.cmd and using the code below:

    SETLOCAL
    @echo off
    set MW_HOME=d:\obi_mw
    set DOMAIN_NAME=bifoundation_domain
    set USERDOMAIN_HOME=%MW_HOME%\user_projects\domains\%DOMAIN_NAME%
    set SERVER_NAME=bi_server1
    set PRODUCTION_MODE=true
    set ADMIN_URL=http://localhost:7001
    call “%MW_HOME%\wlserver_10.3\server\bin\installSvc.cmd”
    ENDLOCAL

  • Run the installbi_server1_svc.cmd script
  • Using regedit, verify that a service named Oracle_bifoundation_domain_bi_server1 now exists under the following location

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\

  • Right click your service folder and follow the context menus New > String Value to add a new entry in your service folder, call your new string value Group and double click on it to add OBI Managed Server as a value. This will work in tandem with the ServiceGroupOrder configuration to ensure the startup order of your services.
  • Right click your service folder and follow the context menus New > Multi-String Value to add a new entry in your service folder, call your new string value DependOnService and double click on it to add the node manager service and Weblogic AdminServer services as a dependencies (ie.Oracle_bifoundation_domain_AdminServer ). Using this dependency value will cause Windows to verify that dependent services have been started before attempting to start this service.
  • Right click the service folder for OPMN and follow the context menus New > Multi-String Value to add a new entry in your service folder, call your new string value DependOnService and double click on it to add the node manager service and Weblogic AdminServer  and your new BI Managed Server services as a dependencies (ie. Oracle_bifoundation_domain_bi_server1). Using this dependency value will cause Windows to verify that dependent services have been started before attempting to start this service.
  • On the Administration Tools > Services application verify that all of the following services are configured to start automatically and, optionally, configure what actions are taking on failure starting each service.

    Oracle WebLogic NodeManager
    Oracle_bifoundation_domain_AdminServer
    Oracle_bifoundation_domain_bi_server1
    Oracle Process Manager (instance 1)

Setup Validation

  • Restart your windows server and monitor the order in which services are started
  • If you see issues with OPMN starting at the same time as your WebLogic servers you might need to try setting the AdminServer, BI Managed Server and OPMN  services to start manually and using the code below to create batch command file that is executed by a scheduled task each time the computer starts:

    net start Oracle_bifoundation_domain_AdminServer
    timeout 300
    net start Oracle_bifoundation_domain_bi_server1
    timeout 300
    net start OracleProcessManager_instance1

    This script would use the timeout DOS command to institute a five minute wait between each of the OBI services being started.

One key step to configuring an enterprise deployment of Oracle Business Intelligence is to be setup your services to run in the background as Windows services and start automatically with your server.  By default the installer will create windows services for Oracle Process Manager (OPMN) and the Weblogic Node Manager, this leaves us with the need to configure services for the Weblogic AdminServer and the BI Managed Server.

Error: Unable to access Oracle Data Integrator repository. You will not be able generate or execute load plans.

Oracle Data Integrator (ODI) Logo

Upon logging to the Oracle Business Intelligence Application’s Configuration Manager system I was greeted with an error message stating the following: “Unable to access Oracle Data Integrator repository. You will not be able generate or execute load plans.”

 

Upon some triage I was able to determine a couple of possible root causes to my issue.

 

  1. Not using the BIAdmin user created during the install process, or, the BIAdmin user account was not created.
  2. Current session was initiated with a set of credentials that haven’t been granted the BIA_ADMINISTRATOR_DUTY  role on weblogic security
  3. Additional roles are missing or not assigned to the credentials initiating the current session

 

After reviewing all possible options and confirming I am using a valid account with the proper roles and permissions I asked my system administrator to restart the server, this fixed my issue. I should have started there but at least I came out of the experience with a better understanding of the roles that control security in my installation of OBIA 11.1.1.7.0.

 

Last resort, if all else fails:

* Make sure after you have regenerated and moved the security files that the file odi.conf is updated to refer to jps-config-jse.xml.

How to define an index on a source or target table in ODI

Oracle Data Integrator (ODI) Logo

 

 

 

 

 

 

 

 

 

 

The topic of indexes in Oracle Data Integrator (ODI) is an easy one once you learn your way around in ODI studio. Indexes are defined under Design > Models > [your model] > [Table] > [Constraints]. They are independent of the column definitions for the most part although they do reference valid column definitions.

 

  • The first step to create metadata for an index is to add a new constraint and name it:image

  • On the Columns tab define the list of columns to be indexedimage

  • Ensure that the index is active and marked for creation on the database on the Control tabimage

  • Review all additional options on the Flexfields tab, for some of the options such as Index Type you do need to manually type your selection.
    image

 

 

After you have defined an index on metadata you need to run a new load for the index to be created on the database.

 

My Tools:

Oracle Business Intelligence Applications (OBIA) 11.1.1.7.0

Oracle Data Integrator 11g (ODI Studio, ODI Server, no clustering)

 

Customizing your DOS prompt

Being back on a windows shop I find the DOS prompt sort of gets in my way, I have customized it so that it will display my current path and then let met type on a new line with this command:

PROMPT Ignacio’s rocking at $P$G$_

The different formatting options for the command are:

$A & (Ampersand)
$B | (pipe)
$C ( (Left parenthesis)
$D Current date
$E Escape code (ASCII code 27)
$F ) (Right parenthesis)
$G > (greater-than sign)
$H Backspace (erases previous character)
$L < (less-than sign)
$N Current drive
$P Current drive and path
$Q = (equal sign)
$S (space)
$T Current time
$V Windows XP version number
$_ Carriage return and linefeed
$$ $ (dollar sign)

To make your changes permanent go to My Computer > Properties > Advanced > Environment Variables and create a new variable called PROMPT.

If you find this post useful please or Share our site!

Being back on a windows shop I find the DOS prompt sort of gets in my way, I have customized it so that it will display my current path and then let met type on a new line with this command…

How to Make sure an index is defined as unique in odi

Oracle Data Integrator

Indexes are defined as constraints on the Model view in ODI Studio. To make a field unique follow one of the two alternatives listed below.

 

PRIMARY KEY

 

Define the Index as a Primary Key constraint object on the Models area within the design view

Description view on primary key index definition in ODI

 

Add the unique columns on the Columns tab

image

 

 

In the Control tab make sure the constraint is active and marked to be defined in the database for both flow and static control checkboxes

 

 

 

image

 

Review that the correct settings are configured in the Flexfields tab

Flexfields view on Primary Key Index in ODI

 

ALTERNATE KEY

 

Define the constraint object for your table as an alternate key in the ODI models area

Defining a unique index in Oracle ODI

 

Add the unique columns on the Columns tab

image

 

In the Control tab make sure the constraint is active and marked to be defined in the database for both flow and static control checkboxes

 

image

Review that the correct settings are configured in the Flexfields tab

Flexfields view on unique index in ODI

How To: Download, install, configure and verify you have the latest version of Opatch

On this post we will discuss all the steps necessary to ensure you have the correct version of the OPatch patching utility for oracle software running on your system. If you find this post useful please or Share our site!

Environment Variables:

Set your ORACLE_INSTANCE path to a valid OBIEE 11g instance

Set your ORACLE_HOME path to the OBIEE 11g home (Oracle_BI1)

Validating Your Environment:

OPatch is installed by default, locate it go to the ORACLE_HOME path for your OBIEE installation:

\Oracle_BI1\OPatch

To verify your current session is correctly configured verify the installed version by running the following command:

opatch version

Minimum Required Versions:

OBIEE 11.1.1.5.0 -> OPatch version 11.1.0.8.3 or higher (do NOT use OPatch 12.x)

OBIEE 11.1.1.6.0 -> No OPatch packages available as of March 12, 2012

Downloading a Newer Version of OPatch:

To find and download the appropriate version of OPatch for yor system please go to ORacle Support and find the knoledge base article below:

Note 224346.1 – Opatch – Where Can I Find the Latest Version of Opatch?

Installing OPatch:

To install OPatch once you have downloaded the appropriate version follow the steps below:

  1. Rename your current OPatch directory (ORACLE_HOME\OPatch)
  2. Copy the zip file to your ORACLE_HOME
  3. Unzip the zip file
  4. Verify that the upgrade succeeded

    cd OPatch
    opatch version

File System Access:

OPatch will update the local Oracle Inventory so the user account running OPatch must have acces to the location of the OUI Inventory, to verify this you can run the following command:

opatch -lsinventory

Patch Directory Location (PATCH_TOP):

If you have a centralized location where you store code / releases it is a good idea to create a directory called PATCH_TOP to store patches as they are applied to each environment (DEV, QA, STAGE, PRD).

On this post we will discuss all the steps necessary to ensure you have the correct version of the OPatch patching utility for oracle software running on your system.