Investigating the full table direct path / buffered decision.

A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;

  COUNT(*)
----------
   1000000

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;
    169999

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1’ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0’.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on 11.2.0.3 and 12.1.0.2, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on 11.2.0.3, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.

Conclusions

The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on 12.1.0.2 gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

Tagged: buffer cache, exadata, nsmtio, oracle, performance, trace