Search

Top 60 Oracle Blogs

Recent comments

July 2011

Cost Is Time: Next Generation

It looks like Oracle has introduced with the Oracle 11.2.0.2 patch set a new "cost is time" model for the time estimate of the Cost-Based Optimizer (CBO).

In order to understand the implications let me summarize the evolution of the CBO in terms of cost / time estimate so far:

1. Oracle 7 and 8

The cost estimate generated by the Cost-Based Optimizer (CBO) has always been a time estimate, although expressed in a slightly obscure unit, which is number of single block reads.

Alternative opinions about Oracle Linux vs. RHEL…

Jay Weinshenker has written a couple of good posts in response to my recent post on Oracle Linux vs. RHEL.

I don’t agree 100% with his all his points, but I always think it’s good to hear different sides of the story and I certainly enjoyed reading them.

Cheers

Tim…




The FizzBuzz Oracle Database Coding Challenge

July 26, 2011 Through a web search I located a page titled “Coding Horror: Why Can’t Programmers.. Program?“  A simple question was asked in an interview, and apparently 199 of 200 programmers struggled to build a solution for the problem in less than ten minutes.  The problem must be that the 199 people who did not succeed [...]

IOT 2 – First examples and proofs

<.. IOT1 – Basics
IOT3 – Great reductions in IO for IOTs..>
IOT4 – Boosting Buffer Cache Efficiency….>
IOT5 – Primary Key issues……>

In my first post on IOTs I ran through the basics of what they are. Here I am going to create some test tables and show you a few things.

I am going to create a simple PARENT table with 9,999 records and then two CHILD tables. CHILD_HEAP, a normal table, and CHILD_IOT, an Index Organized Table. They have the same columns and will hold very similar data.

All of this is on Oracle 11.1 but is exactly the same on 10.2. 8K block size, tablespaces are auto segment space managed.

Here are the creation statements:

--first create the parent table, keyed by ID.
-- The other columns are not significant, they just represent "information"
create table mdw.parent
(id       number(10)    not null
,name     varchar2(100) not null
,date_1   date          not null
,num_1    number(2)
,num_2    number(2)
,constraint pare_pk primary key(id)
 using index tablespace index_01
)
tablespace data_01
/
--
--Now put my 9999 parents into the table.
insert into parent
select rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level < 10000
/
--
-- create the table to hold the children as a heap table
create table child_heap
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chhe_pk primary key(pare_id,cre_date)
 using index tablespace index_01
)
tablespace data_01
/
--
-- create the table to hold the children as an IOT table
create table child_iot
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chio_pk primary key(pare_id,cre_date)
-- using index tablespace index_01 -- CANNOT STATE for IOT. State in table definition
)
ORGANIZATION INDEX -- This is it. This makes the table an IOT
tablespace data_01
/

There are only two differences between the statements creating the CHILD_HEAP and the CHILD_IOT tables.

The main one is the inclusion of the line ORGANIZATION INDEX and is what instructs Oracle to create the table as an IOT. Note that it does not state the index and you cannot state the index. The IOT is created based on the Primary Key.
The other change is that you now cannot state the tablespace for the Primary Key index. I’ve not played with this at all but I don’t think you can state anything with the “using index” as the table storage clauses are used for the Primary Key index. I personally find this a little illogical as it is the index segment that is created, but I guess others would find it more natural that you still state this at the table level.

When I create IOTs on a real system, I put the IOT in a table tablespace {I still maintain table and index tablespaces, for reasons I won’t go into here}. I put it there as it holds the actual data. If I lose that Primary Key index I am losing real data, not duplicated data.

I then populated the two CHILD tables with data. The method of creating this test data is very important.

I am simulating a very common situation, where data is coming in for a set of Parents (think customers, accounts, scientific instruments, financial entities) and the data is coming in as a record or set of records each day. ie not where the parent and all of it’s child records are created at one time, like an order and it’s order lines. I am simulating where the child data is created a few records at a time, not all in one go.

The code is simple. it loops for one hundred days and for each day it creates 10,000 records for random parents. On each day any given parent will have none, one or several records. On average, each parent will end up with 100 records, but some will have more and some less. The key thing is that the data for any given parent is created a record at a time, with lots of records created for other parents before the next record for that given parent.

The two tables will have the same pattern of data but not identical data. {I could have seeded the random number generator to make the two data sets the same but this will do}. Below is the statement for one table, you just change the table name to populate each table. {BTW I like using the from dual connect by level <=x method of getting the number of rows desired – it is fast and is neat, once you have seen it once}.

declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into CHILD_HEAP
    (pare_id,cre_date,vc_1,date_1,num_1,num_2)
  select
    trunc(dbms_random.value(1,v_num))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/

I then gathered objects stats on the tables.
Let’s check the size of the tables:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CHILD%';

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHILD_HEAP      TABLE           DATA_01              12288

1 row selected.

ONE row? Where is the other table, where is CHILD_IOT? It does not exists.

Remember from my first post that I made the comment I would have prefered it if Index Organized Tables had been called something like ‘Table Containing Indexes’? The table data has been placed in the Primary Key index and the table segment does not even exist. If you start using IOTs this will catch you out periodically – it does me anyway and I’ve been using them on and off for years :-) .

Let’s look at the size of the primary key indexes:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CH%PK'
and segment_name not like '%ORD%'

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHHE_PK         INDEX           INDEX_01              4224
CHIO_PK         INDEX           DATA_01              19456

2 rows selected.

Note that the Primary Key index for CHILD_HEAP, CHHE_PK, is there and is 4,224 blocks in size, and the CHILD_IOT Primary Key, CHIO_PK, is a lot larger at 19,456 blocks. In fact, not only is the CHIO_PK index larger than the CHILD_HEAP table, it is larger than the combined size of the CHILD_HEAP table and CHHE_PK index combines. So much for me saying last post that IOTs can save disk space? I’ll come back to that in a later post…

Here are some other stats from one of my scripts:

mdw11> @tab_sci_own
owner for Table: mdw
Name for Table: child_heap

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_HEAP          1000,000      12,137    83 YES NO  250711 22:01 NO     1000000

INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHHE_PK         NOR NO  UNI  2      4,034    1000,000      995,857          1          1 250711 22:02

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHHE_PK                      CHILD_HEAP       1   PARE_ID
CHHE_PK                      CHILD_HEAP       2   CRE_DATE

--
--
owner for Table: mdw
Name for Table: child_iot

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_IOT           1000,000                83 YES NO  250711 22:03 NO     1000000

INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHIO_PK         IOT NO  UNI  2     17,855     910,881            0          1          1 250711 22:03

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHIO_PK                      CHILD_IOT        1   PARE_ID
CHIO_PK                      CHILD_IOT        2   CRE_DATE

Note the lack of BLOCKS for the CHILD_IOT table and the CLUSTERING_FACTOR of 0 for the CHIO_PK.

The clustering factor is the number of times Oracle, when scanning the whole index in order, would have to swap to a different Table block to look up the table record for each index entry. If it is close to the number of blocks in the table, then the clustering factor is low and the order of records in the table matches the order of entries in the index. This would make index range scans that need to visit the table reasonably efficient.

If the clustering factor is close to the number of records in the table then it means there is no correlation between index order and table row order and such index ranges scans that have to visit the table would be inefficient. Again, this is significant and will be the major topic of the next post.

The depth of the index does not change, being 3 in each case (BL or blevel 2)

So, can we see evidence of the theoretical efficiency of looking up single records via the IOT that I mentioned in the fist post? Here we go {oh, usual disclaimer, I run the code twice and show the second run, to remove the parsing overhead}:

-- First the Heap table
select * from child_HEAP where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 20:13:21','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- ------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11  LUTFHOCIJNYREYICQNORREAJOVBRIHFVLXNIGIVZDMFJCTGYFWC
25-JUN-11         11         16
1 row selected.

Execution Plan
------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |     1 |    83 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |     1 |    83 |     3   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | CHHE_PK    |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets

--and now the IOT table

select * from child_IOT where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 21:23:41','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- -------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11
CSIGBHSXWNDDTCFRCNWYPRNLEQWPCRYTXQQZHACDEXHOBEYXLNYBHRUHJ
27-JUN-11          7         52
1 row selected.

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |    83 |     2   (0)| 00:00:01 |
|*  1 |  INDEX UNIQUE SCAN| CHIO_PK |     1 |    83 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

{I had to look up the exact values of CRE_DATE of a couple of records to do the above queries}

To look up a single row with the heap table you can see that the explain plan was to carry out a unique scan on the primary key and then look up the row via the rowid and took 4 consistent gets. 3 to walk down the index and get the rowid, one to look up the row block.

For the IOT table the explain plan reveals that there was simply an index unique scan of the Primary Key, nothing more. All data for the row was there in the index entry rather than the rowid. Thus only 3 consistent gets were required.

For single row lookups on the Primary Key, IOTS are more efficient than traditional Heap tables with a Primary Key index. {Please, no one point out that if all the columns you need are in the index you also do not need to go to the table, that is a different topic}.

Quite a few people have shown this efficiency before but the next step is far, far more interesting and shows a much more significant impact of IOTs. That is the topic of the next post :-) .

For now, I am going to finish off with what happens with range scans as I suggested they could slow down with an IOT.
Below, I select count(*) for just one of the parent values.

select count(*) from child_heap where pare_id = 2

  COUNT(*)
----------
        98

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHHE_PK |   100 |   400 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

--
--

select count(*) from child_iot where pare_id = 2

  COUNT(*)
----------
        93

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHIO_PK |   100 |   400 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets

Both statements carry out a range scan on the Primary Key of the table. For the normal HEAP table this takes 3 consistent gets, which is no suprise as we have an 8k block size and only 100 rows for a given parent, they happen to fit into one block of the index. So Oracle works down the depth of the index and looks at one block.

For the IOT the scan works down the index but has to scan three blocks. Even though there are fewer entries, 93 compared to 98, they span three blocks and thus the total number of consistent gets is 5.

Admittedly I was a little lucky in my example above. Sometimes the entries for one parent will scan 2 blocks for the heap table’s Primary Key and occasionally the entries for the IOT will fit into 2 blocks. But if you look at the number of leaf blocks in the earlier stats (4,034 for the normal and 17,855 for the IOT, both for 10,000 entries) usually the 100 or so entries for single parent in the normal index will all fall into one block and the entries for the IOT will fall into between 2 and 3 blocks.

A select count(*) will full scan the smallest segment that can satisfy the query. Let’s try it:

mdw11> select count(*) from child_heap

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |   989   (1)| 00:00:15 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHHE_PK |  1000K|   989   (1)| 00:00:15 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          2  db block gets
       4109  consistent gets
       4088  physical reads

mdw11> select count(*) from child_iot

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |  4359   (1)| 00:01:05 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHIO_PK |  1000K|  4359   (1)| 00:01:05 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      19298  consistent gets
      19246  physical reads

The number of consistent gets (and physical reads) are close to the number of leaf blocks in the two segments, though higher. This is because Oracle is scanning the whole index, leaf blocks and branch blocks. The scan is far more expensive for the IOT, simply as the index is so much larger. I’ve not shown timings but on my little laptop, the count(*) takes about 3 seconds on CHILD_HEAP and about 5 seconds on the CHILD_IOT.

That is enough for one post.


O-1 Again

I don’t often waste time searching the internet for evidence of my brilliance – but sometimes you’ve just got to do it ;) My O-1 visa (aliens of exceptional ability) was up for renewal, and the immigration authority decided to ask me for more evidence of the significance of my work.

In part, I think, this was because they had decided that Cary Millsap and Tom Kyte were colleagues of mine and therefore couldn’t be used as referees for Cost-Based Oracle. This put me into a weird “Catch-22″ situation – everyone I know that’s in a position to act as a referee is someone I know quite well, which means they’re just as much a “colleague” as Tom or Cary. So anyone I could ask to write a reference presumably wouldn’t be acceptable because I’d asked them!

In the end I asked the presidents of a number of international Oracle User Groups and editors of various magazines to make a statement for me – and they were all very helpful, for which I am grateful.

I also found a couple of dozen books and a handful of academic papers that reference my work – and that must have helped too. I also did was for salary surveys for the USA – it looks as if my fee income puts me somewhere between a top-rate gynaecologist and a top-rate surgeon (there’s an analogy lurking there somewhere).

Anyway, my passport arrived back from the US Embassy early this morning – so I’m ready for my next trip to the USA (California and Minnesota – 16th Aug)

OpenWord Suggest-a-Sessions @ Oracle Mix

There has been quite a bit written about Oracle Mix Voting (let me refer to it as OMV for short) this year and I don’t think I should waste your time restating any of this. What I wanted to do is to take a step back and try to list “stakeholders” of OMV. Before I [...]

Method-R Tools

It’s rare to find professional level tools  that are written by the people who use them. Most professional tools are written by teams of people who have little idea of how they are used and many users write tools that never reach a professional level. Thus it is a  pleasure to see the tools put out by Method R  with Cary Millsap and Jeff Holt.  Jeff and Cary have been well known in the industry for their work and contributions to the world of Oracle performance tuning. With Method R tools  they not only share their know how and  tools they use to tune systems, but they have polished them  into solid, clean , tight professional level tools that are dependable, high quality and written with integrity.

See: Pythians Blog on Method-R tools

Method R Tools

  1. MR Trace ($49.95) – automatically creates, collects and manages trace files for queries run in Oracle’s  SQL Developer using extended tracefile options targeted at collecting pertinent performance information for optimizing SQL statements.
  1. MR Tools (starting at $397)

mrskew - your trace file profiler and data miner, the core MR Tools product, check out Karl Arao’s examples

other tools

  • mrls - trace file lister, examples, list trace files sorted by  LIO or sorted by rows returned
  • mrcallrm - trace file cropper – retaining required info and filtering undesired

Trace file timings:

  • mrtim - timestamp converter
  • mrtimfix - your trace file tim value fixer, fix 11gR1 bug in trace files & change timing to UNIX epoch
  • mrnl - pre-pend lines in trace with line #, time, duration and other options as well
  1.  Method R Profiler (starting at $2,500) 

Is like the gold standard for the Method R tools . The profiler takes all the know how and analysis and wraps  it up in a clean crisp html UI that makes for easy browsing , exploration and analysis

I look forward to blogging soon on some of the innovative and powerful features of Method R Profiler

 

A quick summary of things in the report

 

1. Task-Level Profiles
1.1 Profile by subroutine
   subroutines:
     events like:
      db file sequential read
      db file scattered read >
   spark line
   table , each line is a bucket between 1us - 1000 secs, orders of 10
      call count
      total seconds
2.2 next subroutine

3. Contribution to Subroutine by Statement
3.1 Statement Contributions to <>
     statement text (with sql hash)
       seconds duration
       count
       mean
       min
       max
     drill down to stats
3.2 next subroutine

4. Statement Detail
4.1 Statement Report for <>
   Oracle hash value:
   Full statement text: drilldown
   Re-use: # distinct text
   Text size: x lines, y bytes
   # of distinct plans: X
   Optimizer goals:
   response time contribution:
4.1.1 Profile by Contribution to Parent Cursor
4.1.2 Profile by Subroutine
     waits, CPU, time by
     seconds, count, mean, min, max, skew sparkline
4.1.3 Profile by Database Call
      for
        FETCH
        beween calls
        EXEC
        PARSE
        cursor close
     show
       seconds
       CPU
       rwos processed
       bufer cache access total
         CR
         CU
      PIOs
      Library cache misses
4.1.4 Execution Plans and Placeholder Values
      Execution plan
        seconds for row !
        seconds including descendents
        rows returned
        CR LIOs
        PIOs read
        PIOs writes
4.1.5 Statement Text
4.2 ... next statement ...
5. Configuration information

The Oracle Exadata IO Resource Manager Limit Clause, part 2

In my previous post I described how the IO resource manager (IORM) in the exadata storage can be used both to guarantee a minimum amount of IO a database can get (which is what is covered in most material available), but also to set a maximum amount of IO. This is what Oracle calls an inter-database resource manager plan. This is set and configured at the cell level using the cellcli with ‘alter iormplan dbplan’.

So you don’t see any disks when trying to install ASM?

This is a post that highlights the difference between operating systems, but also the fact that sometime it is hard to break out of a habit once you got used to it. My background is that of a Linux enthusiast, even though I equally like Solaris and AIX but I have a little less exposure to those.

Background

I have recently been asked to look at RAC on SPARC, which I gladly did. The system I was given had the usual software stack for RAC at this customer’s site. It comprised of:

  • Solaris 10 Update 9 64bit on SPARC
  • EMC Power Path 5.1
  • EMC VMAX storage – 10x10G LUNs for a specific performance test

The Power Path configuration has already been in place when I got the machine, and I was allocated /dev/emcpower2[0-9] for my ASM testing. For the experienced Linux user who relies on device-mapper-multipath, the Power Path naming convention can be a bit confusing at first. For reference, the pseudo devices we are interested in for ASM are created under /dev/rdsk/-the “raw” device directory for “character” based access rather than the block device in /dev/dsk/.  By default, the Power Path devices are called “emcpower”, followed by a number and a letter (in SPARC). An example would be /dev/rdsk/emcpower20c.

The number (20) is just a sequence number and doesn’t have any other meaning as far as I know. It also doesn’t seem to be consistent across the cluster nodes by default. The suffix-letter (“c”) does have a meaning though: it indicates the slice of the device. For example, the latter “a” indicates slice 0, xxxc is the whole disk, and xxxg is slice 6.

You can use the format command to have a look at the slices currently defined. Simply type “partition” and then “print” to print it.

Now all the disks I was given had a slice 6, from cylinder 274 to the end of the disk:

# format
[disk selection not shown here]
format> partition
partition> print
Current partition table (original):
Total disk cylinders available: 10238 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 -   136      128.44MB    (137/0/0)     263040
  1       swap    wu     137 -   273      128.44MB    (137/0/0)     263040
  2     backup    wu       0 - 10237        9.37GB    (10238/0/0) 19656960
  3 unassigned    wm       0                0         (0/0/0)            0
  4 unassigned    wm       0                0         (0/0/0)            0
  5 unassigned    wm       0                0         (0/0/0)            0
  6        usr    wm     274 - 10237        9.12GB    (9964/0/0)  19130880
  7 unassigned    wm       0                0         (0/0/0)            0

partition> quit

If you’d like to see how the disks are connected to the array, of if there are any faults, use powermt display dev=all, or alternatively look at a specific device:

# powermt display dev=emcpower20
Pseudo name=emcpower20a
Symmetrix ID=00x2x4x0x6x4
Logical device ID=0x0x
state=alive; policy=SymmOpt; priority=0; queued-IOs=0;
==============================================================================
--------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
==============================================================================
3072 pci@400/pci@0/pci@c/fibre-channel@0/fp@0,0 c2t50000974C00A6118d17s0 FA  7eA   active  alive       0      0
3077 pci@500/pci@0/pci@9/fibre-channel@0/fp@0,0 c3t50000974C00A611Cd17s0 FA  8eA   active  alive       0      0

If you care to learn more about how your system administrator configured the paths (I have 2 HBAs as you can see), use “powermt display”. Also note the disks c2t5….s0 and c3t5…s0 are the native devices aggregated into emcpower20a.

I have changed the permissions on the pseudo devices to 664 and made them owned by oracle:oinstall before launching OUI. In the disk discovery screen I changed the disk discovery string to /dev/rdsk/emcpower2* but didn’t see a single disk in the OUI window. That was odd.

Troubleshooting disk discovery

This error lead me to look at /tmp/OraInstall/installActions.log which referenced a call to kfod, the disk discovery tool. The entry in the log file showed this content:

WARNING: SRVCTL not found as there is no Grid Infrastructure home on the box.
 INFO: Executing [/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod, nohdr=true, verbose=true, disks=all, status=true\
 op=disks, asm_diskstring='/dev/rdsk/emcpower2*']
 with environment variables {XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt, DISPLAY=localhost:12.0,\
 PWD=/u01/app/oracle/stage/grid, LC_CTYPE=en_GB.ISO8859-1, _=./runInstaller, USER=oracle,\
 NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat, LC_COLLATE=en_GB.ISO8859-1, CLASSPATH=, HOME=/export/home/oracle,\
 LC_NUMERIC=en_GB.ISO8859-1, SSH_CONNECTION=10.128.46.56 52883 10.129.48.214 22,\
 LD_LIBRARY_PATH=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/../lib,\
 ORACLE_HOME=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin, SHELL=/usr/bin/bash, LOGNAME=oracle,\
 SSH_TTY=/dev/pts/3, SHLVL=1, LC_MONETARY=en_GB.ISO8859-1, MANPATH=/usr/share/man:/usr/openwin/share/man:\
 /db/pub/man:/db/pub/infra/samba/current/man:/db/pub/infra/rsync/current/man:/db/pub/infra/rcs/current/man:\
 /db/pub/infra/tcpdump/current/man:/usr/dt/share/man:/usr/java1.2/man:/usr/apache/man:/usr/perl5/man:/usr/j2se/man:\
 /usr/local/man:/opt/SUNWconn/ge/man:/opt/SUNWconn/man:/opt/SUNWsan/man:/opt/VRTS/man:/opt/DBpam/man:\
 /opt/SUNWexplo/man:/opt/SUNWcstu/man:/opt/VRTSvlic/man, SSH_CLIENT=10.128.46.56 52883 22, \
 MAIL=/var/mail//oracle, TZ=GB-Eire, LC_TIME=en_GB.ISO8859-1, JAVA_HOME=, LC_MESSAGES=C, PS1=[\u@\h \W]> , \
 OLDPWD=/u01/app/oracle/stage, LC_ALL=, TERM=xterm, TMOUT=0, PATH=/usr/bin:/usr/ccs/bin:/usr/bin:/opt/EMCpower/bin:\
 /etc/emc/bin:/etc:/zones/app/oracle/stage/grid/install}
 INFO: Starting Output Reader Threads for process /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod
 INFO: Parsing KFOD-00311: Error scanning device /dev/rdsk/c3t50000974C00A611Cd16s7
[more errors skipped]
..

(Re-) Discovery of KFOD

So therein lies the problem! As a side effect I rediscovered the command line options used with kfod! Maybe I can use this for some experiments… All you need to do is to set environment variables for ORACLE_HOME and LD_LIBRARY_PATH as per the above output.

After a little fiddling around I remembered that unlike the tests with ZFS I did before I can’t pass the whole disk to ASM (the VTOC is in the first part of the disk!), but rather have to use a slice. So by changing the disk string to /dev/rdsk/emcpower2*g I got a successful report from kfod, executed as “oracle”:

$ /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod verbose=true, disks=all status=true op=disks \
> asm_diskstring='/dev/rdsk/emcpower2*g'
--------------------------------------------------------------------------------
 Disk          Size Header    Path                                     User     Group
================================================================================
   1:       9341 Mb CANDIDATE /dev/rdsk/emcpower20g                    oracle   oinstall
   2:       9341 Mb CANDIDATE /dev/rdsk/emcpower21g                    oracle   oinstall
   3:       9341 Mb CANDIDATE /dev/rdsk/emcpower22g                    oracle   oinstall
   4:       9341 Mb CANDIDATE /dev/rdsk/emcpower23g                    oracle   oinstall
   5:       9341 Mb CANDIDATE /dev/rdsk/emcpower24g                    oracle   oinstall
   6:       9341 Mb CANDIDATE /dev/rdsk/emcpower25g                    oracle   oinstall
   7:       9341 Mb CANDIDATE /dev/rdsk/emcpower26g                    oracle   oinstall
   8:       9341 Mb CANDIDATE /dev/rdsk/emcpower27g                    oracle   oinstall
   9:       9341 Mb CANDIDATE /dev/rdsk/emcpower28g                    oracle   oinstall
  10:       9341 Mb CANDIDATE /dev/rdsk/emcpower29g                    oracle   oinstall
KFOD-00311: Error scanning device /dev/rdsk/emcpower2g
ORA-15025: could not open disk "/dev/rdsk/emcpower2g"
SVR4 Error: 13: Permission denied
Additional information: 42
Additional information: 272256
Additional information: 12699536

I haven’t worked out which wildcard characters are valid in the ASM_DISKSTING (‘/dev/rdsk/emcpower2[0-9]g’ didn’t work), so I left it there. The emcpower2g device was not meant to be used in the test anyway.

With this setting I returned to the OUI window and continued the installation.

Lesson Learned

Oracle doesn’t always have useful logs, but in this case it was very useful to look at the install-actions file in /tmp. And also remember that creating ASM disks is different from Linux in Solaris and other platforms.

Logical I/O - Evolution: Part 2 - 9i, 10g Prefetching

In the initial part of this series I've explained some details regarding logical I/O using a Nested Loop Join as example.

To recap I've shown in particular:

- Oracle can re-visit pinned buffers without performing logical I/O

- There are different variants of consistent gets - a "normal" one involving buffer pin/unpin cycles requiring two latch acquisitions and a short-cut variant that visits the buffer while holding the corresponding "cache buffers chains" child latch ("examination") and therefore only requiring a single latch acquisition

- Although two statements use a similar execution plan and produce the same number of logical I/Os one is significantly faster and scales better than the other one