Top 60 Oracle Blogs

Recent comments


Oracle Big Data Appliance — Oracle’s Bold Move Into Big Data Space

Oracle Big Data Appliance (BDA) is being announced at the Oracle OpenWorld keynote as I’m posting this. It will take some time for it to be actually available for shipment and some details will likely change but here is what we have so far about Oracle Big Data Appliance. A rack with InfiniBand, full of [...]

Oracle Big Data Appliance — What’s Cooking?

Many analysts are suggesting that a big data appliance will be announced at this OOW. Based on published Oracle OpenWorld focus sessions on (PDF documents), the following technologies will most likely be the key — Hadoop, NoSQL, Hadoop data loader for Oracle, R Language. Want more details — you have to wait for them. [...]

It’s the End of the World As We Know It (NoSQL Edition)

This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!

Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.

Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.

On the Difficulty of Data Migrations (Especially to NoSQL Databases)

(Originally posted at Pythian Blog)

I’ve been reading a lot of NoSQL blogs recently, and one thing that bothers me is that many of the leading NoSQL bloggers seem to have very different experience in operations that I’ve had.

Here’s an example:
Over at the O’Reilly community blogs, Andy Oram interviewed two MongoDB experts about migrating from a relational databases to MongoDB.

Here’s what the experts said:

Mapping the NoSQL space

NoSQL is an unfortunate name – it doesn’t give any description of what the product does except what query language it will not support. What’s worse, it makes people treat the various non-relational databases as interchangable, while in fact many of them solve completely different problems and have different trade-offs, strengths, etc.

What is common to all these DBs is that they don’t do ACID in an attempt to improve scalability, most of them are distributed and most of them were built to handle semi-structured or unstructured data.

The theoretical case for these databases starts from the CAP theorem which says you can’t have consistency, availability and partition tolerance all at once. Partition tolerance is the prevention of split-brain in a cluster or distributed system – you don’t want network failures to allow data corruptions or incorrect results.

Since you can’t have all three, you choose two. So RAC does partition tolerance and consistency at the expense of availability – if the voting disk crashes or loses network connectivity, the entire cluster will go down.

NoSQL databases keep availability and partition tolerance at the expense of consistency. They have something called “Soft-State” and “Eventual Consistency”. To the best of my understanding, “Eventual Consistency” means that all the DML statements in the transaction are inserted into a queue (or some equivalent), from which they are executed at different times by different servers. Eventually they are all executed and you reach a consistent state, but you don’t know when. Of course with such system, it appears nearly impossible to prevent lost updates.

Notes about Hadoop

My notes from two presentations given to the data mining SIG of the local ACM chapter.

Hadoop is a scalable fault-tolerant grid operating system for data storage and processing.

It is not a database. It is more similar to an operating system: Hadoop has a file system (HDFS) and a job scheduler (Map-Reduce). Both are distributed. You can load any kind of data into Hadoop.

It is quite popular – the last Hadoop Summit had 750 attendees. Not bad for a new open-source technology. It is also quite efficient for some tasks. Hadoop cluster of 1460 nodes can sort a Terabyte of data in 62 seconds – currently the world record for sorting a terabyte.

Hadoop Design Axioms:

  • System will manage and heal itself. (Because using commodity hardware – failure is inevitable).
  • Performance will scale linearly. (With few limitations).
  • Compute should move to data (Processing job should run on the machine holding the data to process)
  • Simple core. Modular and extensible

Distributed file system. Block size is 64M (!). User configures replication factor – each block is replicated on K machines (K chosen by user). More replication can be configured for hot blocks.
A name node keeps track of the blocks and if a node fails the data on it will be replicated to other nodes.

Distributes jobs. It tried to run jobs local to their data to avoid network overhead. It also detects failures and even servers running behind on the processing. If a part of the job is lagging in processing, it will start copies of this part of the job on other servers with the hope that one of the copies will finish faster.

Hadoop Ecosystem: