Search

Top 60 Oracle Blogs

Recent comments

August 2010

Parallel Query 10046 trace

前回の続き、

Parallel Query Trace

USE_CONCATヒントを入れないと、PX関連のeventがwait eventとなっていた:

select A.* from lineitem A, lineitem_org B, lineitem_comp C
where A.l_orderkey = B.l_orderkey
and A.l_orderkey = C.l_orderkey
and (
(A.l_receiptdate < B.l_commitdate and A.l_comment like '%obata%')
or (A.l_receiptdate > B.l_commitdate and B.l_comment like '%tanaka%')
or (A.l_receiptdate <> B.l_commitdate and A.l_comment like '%ichiro%')
or (A.l_receiptdate > C.l_commitdate and C.l_comment like '%tanaka%')
or (A.l_receiptdate <> C.l_commitdate and C.l_comment like '%ichiro%')
)

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 3.21 222.80 57423 57838 0 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 3.21 222.80 57423 57838 0 0
...
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
PX Deq: Execution Msg 67 0.01 0.03
Disk file operations I/O 1 0.00 0.00
direct path read 506 0.00 1.16
PX Deq Credit: need buffer 253 0.06 0.64
PX Deq Credit: send blkd 8375 0.25 217.64

Parallel Queryのチューニング

適当に考えて作った問題のSQL:

SQL> select A.* from lineitem A, lineitem_org B, lineitem_comp C
2 where A.l_orderkey = B.l_orderkey
3 and A.l_orderkey = C.l_orderkey
4 and (
5 (A.l_receiptdate < B.l_commitdate and A.l_comment like '%obata%')
6 or (A.l_receiptdate > B.l_commitdate and B.l_comment like '%tanaka%')
7 or (A.l_receiptdate <> B.l_commitdate and A.l_comment like '%ichiro%')
8 or (A.l_receiptdate > C.l_commitdate and C.l_comment like '%tanaka%')
9 or (A.l_receiptdate <> C.l_commitdate and C.l_comment like '%ichiro%')
10 )


少しやり方を変えてみる:


SQL> select A.* from lineitem A, lineitem_org B, lineitem_comp C
2 where A.l_orderkey = B.l_orderkey
3 and A.l_orderkey = C.l_orderkey
4 and A.l_receiptdate < B.l_commitdate and A.l_comment like '%obata%'
5 union
6 select A.* from lineitem A, lineitem_org B, lineitem_comp C
7 where A.l_orderkey = B.l_orderkey
8 and A.l_orderkey = C.l_orderkey
9 and A.l_receiptdate > B.l_commitdate and B.l_comment like '%tanaka%'
10 union
11 select A.* from lineitem A, lineitem_org B, lineitem_comp C
12 where A.l_orderkey = B.l_orderkey
13 and A.l_orderkey = C.l_orderkey
14 and A.l_receiptdate <> B.l_commitdate and A.l_comment like '%ichiro%'
15 union
16 select A.* from lineitem A, lineitem_org B, lineitem_comp C
17 where A.l_orderkey = B.l_orderkey
18 and A.l_orderkey = C.l_orderkey
19 and A.l_receiptdate > C.l_commitdate and C.l_comment like '%tanaka%'
20 union
21 select A.* from lineitem A, lineitem_org B, lineitem_comp C
22 where A.l_orderkey = B.l_orderkey
23 and A.l_orderkey = C.l_orderkey
24 and A.l_receiptdate <> C.l_commitdate and C.l_comment like '%ichiro%'
25 /

レコードが選択されませんでした。

Alex Hutton Podcast on data breach

Lindsay blogged about the recent data breach report from Verizon last week. Alex Hutton, one of the authors has just re-tweeted DennisF's tweet that he has done a podcast about the data breach report . Enjoy! EDITED: I incorrectly added....[Read More]

Posted by Pete On 16/08/10 At 07:46 PM

Parallel Queryを検証する

TPC-Hベンチマークの続き

CPUはたいして回らなかった、そしてdirect path read待ちばかり、、、
ディスク転送量は限界に近づくどころか、むしろ下がっていった。
絶対におかしい!!!

Parallel Queryを検証する

初心に帰って、Parallel Queryが「どのくらい価値のあるものなのか?」を検証してみた。

過去の検証で作った大きなテーブルを使い負荷の高そうなSQLを適当に作った。
LINEITEM     : partitioning+compress
LINEITEM_COMP  : compress
LINEITEM_ORG   : オリジナル

SQL> select A.* from lineitem A, lineitem_org B, lineitem_comp C
2 where A.l_orderkey = B.l_orderkey
3 and A.l_orderkey = C.l_orderkey
4 and (
5 (A.l_receiptdate < B.l_commitdate and A.l_comment like '%obata%')
6 or (A.l_receiptdate > B.l_commitdate and B.l_comment like '%tanaka%')
7 or (A.l_receiptdate <> B.l_commitdate and A.l_comment like '%ichiro%')
8 or (A.l_receiptdate > C.l_commitdate and C.l_comment like '%tanaka%')
9 or (A.l_receiptdate <> C.l_commitdate and C.l_comment like '%ichiro%')
10 )


まずはNO_PARALLELで動かしてみる:

SQL> ALTER SESSION FORCE PARALLEL QUERY PARALLEL 1;

SQL> select A.* from lineitem A, lineitem_org B, lineitem_comp C
2 where A.l_orderkey = B.l_orderkey
3 and A.l_orderkey = C.l_orderkey
4 and (
5 (A.l_receiptdate < B.l_commitdate and A.l_comment like '%obata%')
6 or (A.l_receiptdate > B.l_commitdate and B.l_comment like '%tanaka%')
7 or (A.l_receiptdate <> B.l_commitdate and A.l_comment like '%ichiro%')
8 or (A.l_receiptdate > C.l_commitdate and C.l_comment like '%tanaka%')
9 or (A.l_receiptdate <> C.l_commitdate and C.l_comment like '%ichiro%')
10 )
11 /

レコードが選択されませんでした。

経過: 00:14:33.61


実行計画
----------------------------------------------------------
Plan hash value: 2728639899

Joins – MJ

The final join mechanism in my “all joins are nested loop joins” argument is the Merge Join – a join mechanism that depends on both its row sources being pre-sorted on the join columns. The note on hash joins pointed out that a “traditional” nested loop join may result in repeated visits to data blocks [...]

InSync10 Conference Melbourne 16-17 August 2010

Looking forward to catching up with everyone at the InSync10 Conference in Melbourne tomorrow. Should be a great event with lots of interesting presentations. I’ll be flying in from (not so) sunny Canberra tomorrow morning in time for most of the morning sessions. My presentation on “Indexing New Features in Oracle11g R1 and R2” is on the Tuesday morning at [...]

The renamdg command revisited-ASMLib

I have already written about the renamedg command, but since then fell in love with ASMLib. The use of ASMLib introduces a few caveats you should be aware of.

USAGE NOTES

This document presents research I performed with ASM on a lab environment. It should be applicable to any environment, but you should NOT use this for production-the renamedg command still is buggy, and you should not mess with ASM disk headers in an important system such as production or staging/UAT. You set the importance here!  The recommended setup for cloning disk groups is to use a data guard physical standby database on a different storage array to create a real time copy of your production database on that array. Again, do not use you production array for this!

Walking through a renamdg session

Oracle ASMLib introduces a new value to the ASM header, called the provider string as the following example shows:

[root@asmtest ~]# kfed read /dev/oracleasm/disks/VOL1 | grep prov
kfdhdb.driver.provstr:     ORCLDISKVOL1 ; 0x000: length=12

This can be verified with ASMLib:

[root@asmtest ~]# /etc/init.d/oracleasm querydisk /dev/xvdc1
Device "/dev/xvdc1" is marked an ASM disk with the label "VOL1"

The prefix “ORCLDISK” is automatically added by ASMLib and cannot easily be changed.

The problem with ASMLib is that the renamedg command does NOT update the provider string, which I’ll illustrate by walking through an example session. Disk group “DATA”, setup with external redundancy and two disks, DATA1 and DATA2, is to be cloned to “DATACLONE”.

The renamedg command requires the disk group to be cloned to be stopped. To prevent nasty surprises, you should stop the databases using that diskgroup manually.

[grid@rac11gr2drnode1 ~]$ srvctl stop database -d dev
[grid@rac11gr2drnode1 ~]$ ps -ef | grep smon
grid      3424     1  0 Aug07 ?        00:00:00 asm_smon_+ASM1
grid     17909 17619  0 15:13 pts/0    00:00:00 grep smon
[grid@rac11gr2drnode1 ~]$ srvctl stop diskgroup -g data
[grid@rac11gr2drnode1 ~]$

You can use the new “lsof” command of asmcmd to check for open files:

ASMCMD> lsof
DB_Name  Instance_Name  Path
+ASM     +ASM1          +ocrvote.255.4294967295
asmvol   +ASM1          +acfsdg/APACHEVOL.256.724157197
asmvol   +ASM1          +acfsdg/DRL.257.724157197
ASMCMD>

So apart from files from other disk groups no files are open, especially not referring to disk group DATA.

Now comes the part where you copy the LUNs, and this entirely depends on your system. The EVA series of storage arrays I worked with in this particular project offered a “snapclone” function, which used COW to create an identical copy of the source LUN, with a new WWID (which can be an input parameter to the snapclone call). When you are using device-mapper-multipath then ensure that your sys admins add the newly created LUNs to the /etc/multipath.conf file on all cluster nodes!

I am using Xen in my lab, which makes it simpler-all I need to do is to copy the disk containers on the domO and then add the new block devices to the running domU (“virtual machine” in Xen language). This can be done easily as the following example shows:

Usage: xm block-attach     

xm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w!
xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata1.clone xvdg w!

xm block-attach rac11gr2drnode1 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w!
xm block-attach rac11gr2drnode2 file:/var/lib/xen/images/rac11gr2drShared/oradata2.clone xvdh w!

In the example, rac11gr2drnode{1,2} are the domU, the backend device is the copied file on the file system, the front end device in the domU is xvd{g,h}, and the mode is read/write, shareable. The exclamation mark here is crucial or else the second domU can’t mount the new block device-it is already exclusively mounted to another domU.

The fdisk command in my example immediately “sees” the new LUNs, with device mapper multipathing you might have to go through iterations of restarting multipathd and discovering partitions using kpartx. It is again very important to have all disks presented to all cluster nodes!

Here’s the sample output from my system:

[root@rac11gr2drnode1 ~]# fdisk -l | grep Disk | sort
Disk /dev/xvda: 4294 MB, 4294967296 bytes
Disk /dev/xvdb: 16.1 GB, 16106127360 bytes
Disk /dev/xvdc: 5368 MB, 5368709120 bytes
Disk /dev/xvdd: 16.1 GB, 16106127360 bytes
Disk /dev/xvde: 16.1 GB, 16106127360 bytes
Disk /dev/xvdf: 10.7 GB, 10737418240 bytes
Disk /dev/xvdg: 16.1 GB, 16106127360 bytes
Disk /dev/xvdh: 16.1 GB, 16106127360 bytes

I cloned /dev/xvdd and /dev/xvde to /dev/xvdg and /dev/xvdh.

Do NOT run /etc/init.d/oracleasm scandisks yet! Otherwise the renamedg command will complain about duplicate disk names, which is entirely reasonable.

I dumped all headers for disks /dev/xvd{d,e,g,h}1 to /tmp to be able to compare.

[root@rac11gr2drnode1 ~]# kfed read /dev/xvdd1 > /tmp/xvdd1.header
# repeat with the other disks

Start with phase one of the renamedg command:

[root@rac11gr2drnode1 ~]# renamedg phase=one dgname=DATA newdgname=DATACLONE \
> confirm=true verbose=true config=/tmp/cfg

Parsing parameters..

Parameters in effect:

 Old DG name       : DATA
 New DG name          : DATACLONE
 Phases               :
 Phase 1
 Discovery str        : (null)
 Confirm            : TRUE
 Clean              : TRUE
 Raw only           : TRUE
renamedg operation: phase=one dgname=DATA newdgname=DATACLONE confirm=true
  verbose=true config=/tmp/cfg
Executing phase 1
Discovering the group
Performing discovery with string:
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with
  disk number:0 and timestamp (32940276 1937075200)
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with
  disk number:1 and timestamp (32940276 1937075200)
Checking for hearbeat...
Re-discovering the group
Performing discovery with string:
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA1 with
  disk number:0 and timestamp (32940276 1937075200)
Identified disk ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:ORCL:DATA2 with
  disk number:1 and timestamp (32940276 1937075200)
Checking if the diskgroup is mounted
Checking disk number:0
Checking disk number:1
Checking if diskgroup is used by CSS
Generating configuration file..
Completed phase 1
Terminating kgfd context 0x2b7a2fbac0a0
[root@rac11gr2drnode1 ~]#

You should always check “$?” for errors-the message “terminating kgfd context” sounds bad, but isn’t. At the end of stage 1, there is no change to the header. Only at phase two there is:

[root@rac11gr2drnode1 ~]# renamedg phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg

Parsing parameters..
renamedg operation: phase=two dgname=DATA newdgname=DATACLONE config=/tmp/cfg
Executing phase 2
Completed phase 2

Now there are changes:

[root@rac11gr2drnode1 tmp]# grep DATA *header
xvdd1.header:kfdhdb.driver.provstr:    ORCLDISKDATA1 ; 0x000: length=13
xvdd1.header:kfdhdb.dskname:                   DATA1 ; 0x028: length=5
xvdd1.header:kfdhdb.grpname:               DATACLONE ; 0x048: length=9
xvdd1.header:kfdhdb.fgname:                    DATA1 ; 0x068: length=5
xvde1.header:kfdhdb.driver.provstr:    ORCLDISKDATA2 ; 0x000: length=13
xvde1.header:kfdhdb.dskname:                   DATA2 ; 0x028: length=5
xvde1.header:kfdhdb.grpname:               DATACLONE ; 0x048: length=9
xvde1.header:kfdhdb.fgname:                    DATA2 ; 0x068: length=5
xvdg1.header:kfdhdb.driver.provstr:    ORCLDISKDATA1 ; 0x000: length=13
xvdg1.header:kfdhdb.dskname:                   DATA1 ; 0x028: length=5
xvdg1.header:kfdhdb.grpname:                    DATA ; 0x048: length=4
xvdg1.header:kfdhdb.fgname:                    DATA1 ; 0x068: length=5
xvdh1.header:kfdhdb.driver.provstr:    ORCLDISKDATA2 ; 0x000: length=13
xvdh1.header:kfdhdb.dskname:                   DATA2 ; 0x028: length=5
xvdh1.header:kfdhdb.grpname:                    DATA ; 0x048: length=4
xvdh1.header:kfdhdb.fgname:                    DATA2 ; 0x068: length=5

Although the original disks (/dev/xvdd1 and /dev/xvde1) had their disk group name changed, the provider string remained untouched. So if we were to issue a scandisks command now through /etc/init.d/oracleasm, there’d still be duplicate disk names. This is a bug in my opinion, and a bad thing.

Renaming the disks is straight forward, the difficult bit is to find out which have to be renamed. Again, you can use kfed to figure that out. I knew the disks to be renamed were /dev/xvdd1 and /dev/xvde1 after consulting the header information.

[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvdd1 DATACLONE1
Renaming disk "/dev/xvdd1" to "DATACLONE1":                [  OK  ]
[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm force-renamedisk /dev/xvde1 DATACLONE2
Renaming disk "/dev/xvde1" to "DATACLONE2":                [  OK  ]

I then performed a scandisks operation on all nodes just to be sure… I had corruption of the disk group before :)

[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]
[root@rac11gr2drnode1 tmp]#

[root@rac11gr2drnode2 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]
[root@rac11gr2drnode2 ~]#

The output on all cluster nodes should be identical, on my system I found the following disks:

[root@rac11gr2drnode1 tmp]# /etc/init.d/oracleasm listdisks
ACFS1
ACFS2
ACFS3
ACFS4
DATA1
DATA2
DATACLONE1
DATACLONE2
VOL1
VOL2
VOL3
VOL4
VOL5

Sure enough, the cloned disks were present. Although everything seemed ok at this point, I could not start disk group DATA and had to reboot the cluster nodes to rectify that problem. Maybe there is some not so transient information stored somewhere about ASM disks. After the reboot, CRS started my database correctly, and with all dependent resources:

[oracle@rac11gr2drnode1 ~]$ srvctl status database -d dev
Instance dev1 is running on node rac11gr2drnode1
Instance dev2 is running on node rac11gr2drnode2

Stock Market Patterns: High Frequency Trade 'Crop Circle'


You'd think I 'd find this on "Information is Beautiful" but got the link from a friend on Facebook. The images below visually display the patterns of "High Frequency Trading" (HTF), which can consist of high frequency trades to manipulate the markets.

See original article (article part 2) (flash crash article)

"Aside from the fact that it is illegal to indicate a quote without a trade intent, this form of quote stuffing is in fact manipulative when conducted by HFT repeaters in specific "shapes" as it actually moves the NBBO actively higher or lower, in cases pushing the bid/offer range up to 10% higher without even one trade ever having occurred, simply by masking a big block order which other algos interpret as bid interest and pull all offers progressively or step function higher (or vice versa, although we have rarely if ever seen the walking down of a stock over the past 18 months)."

BATS "Flag Repeater". 15,000 quotes in 11 seconds, dropping the ASK price 1 penny each quote from $9.36 to $8.58 and back up again.

07-29-10
"The Crown". While not a large number of quotes, this NASDAQ/BATS Bidsize sequence was just too unusual to bypass.

07-27-10

NASDAQ "Blotter". One of the more unusual repeating Asksize cycles.


TPC-Hをチューニングする(200%アップ?)

前回までのおさらい

  1. ディスク転送量を上げるために1セッションのテストを行ってきた
  2. その結果420から450MB/sがでた
  3. qphを上げるためにpartitioning+compressを行い最高300%アップが達成できた

当初はdirect path readを効率化すれば、おのずとディスク転送量も上がり、結果TPC-Hベンチマークの得点もあがるはずと考えていた。しかし現在は、

  • qphは上がったがディスク転送量は200MB/s程度となった
    Compressして読み出す量が減ったから下がるのは当然?
    月単位のPartitioningでParallel Queryが効果的に働いたから?
  • CPU%は最高でも20%程度しか回っていないのに
    Compressして読み出す量が減ったら、もっと次をdirect path readしてほしい!
    月単位のPartitioningでParallel Queryを効率的にしたのだから、もっとdirect path read量を増やしてほしい!

大体、安定していないのは反則の300%アップだ!

セッション数を増やしてみる

DOP=4でセッション数2->4->8と増やしてみると:CPUは回りましたが、殆んどがdirect path read待ちの元の悪い状況に逆戻りしました。

DB OptimizerではCPU Max(赤線)をProcessor Queue Lengthで見ている。
Unix環境であればRun Queueだ。
このブログでもCPU%を指標として書いてきたがProcessor Queue Lengthを指標としたほうが的確だ。
今回のテストの場合、8セッションでCPU%はMax70%程度。しかしProcessor Queue Lengthは実装Core数を超え、処理はQueuingされるだけとなっていることが分かる。

CPU%はdirect path read待ちでProcessor Queuingされ回らない。

最後に、

今回の試作機でTPC-HをThink Time無しに激しく動かすと同時2セッションが限界ということになる。
結局4台のSSDでRAID0にして520MB/sのSequential Read幅を持たせてもセッション数の増加とともにdirect path read待ちが指数関数的に増加する。520MB/sじゃ足りない、、、
同時2セッションくらいで限界?

The Last Airbender…

The Last Airbender is really hard to make a judgement on. At some points the acting is really cringeworthy, followed by scenes with Dev Patel who is really cool. There are sections of dialog that could have been written by a five year old, then visuals that are absolutely fantastic.

I’m sure kids will like it as the problems with the dialog will probably not matter to them, and judging by the popularity of the Harry Potter and Twilight series, acting is not of prime importance either.

I hope they make the next two in the trilogy as I would really like to see what other visuals they come up with.

Cheers

Tim…