Search

Top 60 Oracle Blogs

Recent comments

hadoop

More Than Just a Lake

Data lakes, like legacy storage arrays, are passive. They only hold data. Hadoop is an active reservoir, not a passive data lake. HDFS is a computational file system that can digest, filter and analyze any data in the reservoir, not just store it. HDFS is the only economically sustainable, computational file system in existence. Some […]

Hadoop is Not Merely a SQL EDW Platform

I was recently invited to speak about big data at the Rocky Mountain Oracle User's Group. I presentedto Oracle professionals who are faced with an onslaught of hype and mythology regarding Big Data in generaland Hadoop in particular. Most of the audience was familiar with the difficulty of attempting to engineer even Modest Data on […]

20th Century Stacks vs. 21st Century Stacks

In the 1960s, Bank of America and IBM built one of the first credit card processing systems. Although those early mainframes processed just a fraction of the data compared to that of eBay or Amazon, the engineering was complex for the day. Once credit cards became popular, processing systems had to be built to handle […]

C14 OakTable World Las Vegas

class="l-submain">
class="l-submain-h g-html i-cf">

If you haven’t yet made the decision to attend COLLABORATE 14 – IOUG Forum in Las Vegas taking place on 7-11 April, 2014 at the Venetian Hotel, this might just help you to make the call. You know you want to be there.

OakTable Network will be holding its OakTable World for the very first time during the COLLABORATE conference. While it’s a little bit last moment, IOUG was able to provide a room for us to use for the whole day and we at OakTable quickly put the schedule together. The agenda is selected by the OakTable speakers on the topics they are really passionate about. As history shows, this is generally what your also want to hear about.

Kafka or Flume?

A question that keeps popping up is “Should we use Kafka or Flume to load data to Hadoop clusters?”

This question implies that Kafka and Flume are interchangeable components. It makes as much sense to me as “Should we use cars or umbrellas?”. Sure, you can hide from the rain in your car and you can use your umbrella when moving from place to place. But in general, these are different tools intended for different use-cases.

Flume’s main use-case is to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities such a Morphlines. A lot of the Flume development effort goes into maintaining compatibility with Hadoop. Sure, Flume’s design of sources, sinks and channels mean that it can be used to move data between other systems flexibly, but the important feature is its Hadoop integration.

Quick Tip on Using Sqoop Action in Oozie

Another Oozie tip blog post.

If you try to use Sqoop action in Oozie, you know you can use the “command” format, with the entire Sqoop configuration in a single line:


    ...
    
        
            foo:8021
            bar:8020
            import  --connect jdbc:hsqldb:file:db.hsqldb --table TT --target-dir hdfs://localhost:8020/user/tucu/foo -m 1
        
        
        
    
    ...

This is convenient, but can be difficult to read and maintain. I prefer using the “arg” syntax, with each argument in its own line:

Why Oozie?

Thats a really frequently asked question. Oozie is a workflow manager and scheduler. Most companies already have a workflow schedulers – Activebatch, Autosys, UC4, HP Orchestration. These workflow schedulers run jobs on all their existing databases – Oracle, Netezza, MySQL. Why does Hadoop need its own special workflow scheduler?

As usual, it depends. In general, you can keep using any workflow scheduler that works for you. No need to change, really.
However, Oozie does have some benefits that are worth considering:

Parameterizing Hive Actions in Oozie Workflows

Very common request I get from my customers is to parameterize the query executed by a Hive action in their Oozie workflow.
For example, the dates used in the query depend on a result of a previous action. Or maybe they depend on something completely external to the system – the operator just decides to run the workflow on specific dates.

There are many ways to do this, including using EL expressions, capturing output from shell action or java action.
Here’s an example of how to pass the parameters through the command line. This assumes that whoever triggers the workflow (Human or an external system) has the correct value and just needs to pass it to the workflow so it will be used by the query.

Here’s what the query looks like:

Using Oozie in Kerberized Cluster

In general, most Hadoop ecosystem tools work rather transparently in a kerberized cluster. Most of the time things “just work”. This includes Oozie. Still, when things don’t “just work”, they tend to fail with slightly alarming and highly ambiguous error messages. Here are few tips for using Oozie when your Hadoop cluster is kerberized. Note that this is a client/user guide. I assume you already followed the documentation on how to configure the Oozie server in the kerberized cluster (or you are using Cloudera Manager, which magically configures it for you).

My Strata + Hadoop World Schedule

Next week is Strata + Hadoop World which is bound to be exciting for those who deal with big data on a daily basis.  I’ll be spending my time talking about Cloudera Impala at various places so I’m posting my schedule for those interesting in catching about fast SQL on Hadoop.  Hope to see you there!

Office Hour with Greg Rahn @ the Cloudera Booth
10/29/2013 11:00am – 11:30am EDT (30 minutes)
Room: 3rd Floor, Mercury Ballroom, Booth #403
http://www.cloudera.com/content/cloudera/en/new/strata-hadoop-world.html