Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

SLOB Can Now Be Downloaded From GitHub.

This is a quick blog entry to announce that the SLOB distribution will no longer be downloadable from Syncplicity. Based on user feedback I have switched to making the kit available on GitHub. I’ve updated the LICENSE.txt file to reflect this distribution locale as authorized and the latest SLOB version is 2.4.2.1.

Please visit kevinclosson.net/slob for more information.

#000000;" src="https://kevinclosson.files.wordpress.com/2018/04/git.png?w=500&h=211" alt="" width="500" height="211" srcset="https://kevinclosson.files.wordpress.com/2018/04/git.png?w=500&h=211 500w, https://kevinclosson.files.wordpress.com/2018/04/git.png?w=150&h=63 150w, https://kevinclosson.files.wordpress.com/2018/04/git.png?w=300&h=126 300w, https://kevinclosson.files.wordpress.com/2018/04/git.png 527w" sizes="(max-width: 500px) 100vw, 500px" />

#000000;" src="https://kevinclosson.files.wordpress.com/2018/04/downloadbutton-e1523920808264.png?w=500" alt="downloadbutton" srcset="https://kevinclosson.files.wordpress.com/2018/04/downloadbutton-e1523920808264.png?w=500 500w, https://kevinclosson.files.wordpress.com/2018/04/downloadbutton-e1523920... 150w, https://kevinclosson.files.wordpress.com/2018/04/downloadbutton-e1523920... 300w, https://kevinclosson.files.wordpress.com/2018/04/downloadbutton-e1523920... 768w, https://kevinclosson.files.wordpress.com/2018/04/downloadbutton-e1523920... 823w" sizes="(max-width: 500px) 100vw, 500px" />

A look into oracle redo, part 10: commit_wait and commit_logging

The redo series would not be complete without writing about changing the behaviour of commit. There are two ways to change commit behaviour:

1. Changing waiting for the logwriter to get notified that the generated redo is persisted. The default is ‘wait’. This can be set to ‘nowait’.
2. Changing the way the logwriter handles generated redo. The default is ‘immediate’. This can be set to ‘batch’.

There are actually three ways these changes can be made:
1. As argument of the commit statement: ‘commit’ can be written as ‘commit write wait immediate’ (statement level).
2. As a system level setting. By omitting an explicit commit mode when executing the commit command, the setting as set with the parameters commit_wait (default: wait) and commit_logging (default: immediate).
3. As a session level setting. By omitting an explicit commit mode, but by setting either commit_wait or commit_logging it overrides the settings at the system level.

At this point I should say that in my personal opinion, if you need to change this, there is something very wrong with how the database is used in the first place. This can enhance performance a bit (totally depending on what you are doing and how your hardware looks like), but it does nothing magic, as you will see.

a) commit wait/nowait
I ran a pin tools debugtrace on a session that commits explicitly with the write mode explicitly set to wait (the default), and a session that commits explicitly with the write mode set to nowait. If you took the time to read the other redo related articles you know that a commit generates changes vectors that are written in the public redo strand, changes the transaction table in the undo segment header and then signals the logwriter to write in kcrf_commit_force_int, releases all transactional control on the rows in the transaction that are committed, after which kcrf_commit_force_int is called again in order to wait for the logwriter to get notified that the change vectors have been persisted.

When commit is set to nowait, actually what happens is very simple: everything that is executed in ‘wait mode’ commit is executed in ‘nowait mode’ too, except for calling the kcrf_commit_force_int a second time, which is the functionality to wait for the notification from the logwriter.

commit wait:

 | | < kpoal8+0x000000000f8c returns: 0x2
 | | > ksupop(0x1, 0x7a87a9a0, ...)
 | | | > ksugit_i(0x11526940, 0x7a87a9a0, ...)
 | | | < ksugit_i+0x00000000002a returns: 0
 | | | > _setjmp@plt(0x7ffda5959c50, 0x7a87a9a0, ...)
 | | | <> __sigsetjmp(0x7ffda5959c50, 0, ...)
 | | | <> __sigjmp_save(0x7ffda5959c50, 0, ...)
 | | | < __sigjmp_save+0x000000000025 returns: 0
 | | | > kcbdsy(0x7ffda5959c50, 0x7f3011cbc028, ...)
 | | | <> kcrf_commit_force_int(0x7f3011d75e10, 0x1, ...)
...
 | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1
 | | | > kslws_check_waitstack(0x3, 0x7f3011d82f40, ...)
 | | | < kslws_check_waitstack+0x000000000065 returns: 0
 | | | > kssdel(0x7a87a9a0, 0x1, ...)
 | | | | > kpdbUidToId(0, 0x1, ...)
 | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | > kss_del_cb(0x7ffda5959b50, 0x7f3011d82f40, ...)
 | | | | | > kpdbUidToId(0, 0x7f3011d82f40, ...)
 | | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | | > ksudlc(0x7a87a9a0, 0x1, ...)

commit nowait:

 | | < kpoal8+0x000000000f8c returns: 0x2
 | | > ksupop(0x1, 0x63c82a38, ...)
 | | | > ksugit_i(0x11526940, 0x63c82a38, ...)
 | | | < ksugit_i+0x00000000002a returns: 0
 | | | > _setjmp@plt(0x7fff43332a50, 0x63c82a38, ...)
 | | | <> __sigsetjmp(0x7fff43332a50, 0, ...)
 | | | <> __sigjmp_save(0x7fff43332a50, 0, ...)
 | | | < __sigjmp_save+0x000000000025 returns: 0
 | | | > kslws_check_waitstack(0x3, 0x7fd1cea22028, ...)
 | | | < kslws_check_waitstack+0x000000000065 returns: 0
 | | | > kssdel(0x63c82a38, 0x1, ...)
 | | | | > kpdbUidToId(0, 0x1, ...)
 | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | > kss_del_cb(0x7fff43332950, 0x7fd1ceae8f40, ...)
 | | | | | > kpdbUidToId(0, 0x7fd1ceae8f40, ...)
 | | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | | > ksudlc(0x63c82a38, 0x1, ...)

Yes, it’s that simple. In normal commit mode, commit wait, in ksupop (kernel service user pop (restore) user or recursive call) a call to kcbdsy is executed, which performs a tailcall to kcrf_commit_force_int. In nowait commit mode, kcbdsy is simply not called in ksupop, which actually exactly does what nowait means, the waiting for the logwriter notification is not done.

b) commit immediate/batch
I ran a pin tools debugtrace on a session that commits explicitly with the write mode explicitly set to immediate, and a session that commits explicitly with the write mode set to batch. If you read the other redo related articles you know that a commit generates changes vectors that are written in the public redo strand, changes the transaction table in the undo segment header and then signals the logwriter to write in kcrf_commit_force_int, then releases all transactional control on the rows in the transaction that are committed, after which kcrf_commit_force_int is called again in order to wait for the logwriter to get notified that the change vectors have been persisted.

When commit is set to batch, actually what happens is very simple: everything is done exactly the same in ‘immediate mode’ commit, except for calling the kcrf_commit_force_int the first time, which is the functionality that triggers the logwriter to write. So it looks like ‘batch mode’ is not explicitly batching writes for the logwriter, but rather the disablement of the signal to the logwriter to write right after the change vectors have been copied and the blocks are changed. But that is not all…

I noticed something weird when analysing the calls in the debugtrace of ‘commit write batch’: not only was the first invocation of kcrf_commit_force_int gone, the second invocation of kcrf_commit_force_int was also gone too! That is weird, because the Oracle documentation says:

WAIT | NOWAIT

Use these clauses to specify when control returns to the user.

The WAIT parameter ensures that the commit will return only after the corresponding redo is persistent in the online redo log. Whether in BATCH or IMMEDIATE mode, when the client receives a successful return from this COMMIT statement, the transaction has been committed to durable media. A crash occurring after a successful write to the log can prevent the success message from returning to the client. In this case the client cannot tell whether or not the transaction committed.

The NOWAIT parameter causes the commit to return to the client whether or not the write to the redo log has completed. This behavior can increase transaction throughput. With the WAIT parameter, if the commit message is received, then you can be sure that no data has been lost.

If you omit this clause, then the transaction commits with the WAIT behavior.

The important, and WRONG thing, is in the last line: ‘if you omit this clause, then the transaction commits with the WAIT behavior’. Actually, if the commit mode is set to batch, the commit wait mode flips to nowait with it. It does perform the ultimate batching, which is not sending a signal to the logwriter at all, so what happens is that change vectors in the public redo strands are written to disk by the logwriter only every 3 seconds, because that is the timeout for the logwriter sleeping on a semaphore, after which it obtains any potential redo to write via information in kcrfsg_ and KCRFA structures. This is important, because with NOWAIT behaviour, there is no guarantee changes have been persisted for the committing session.

I was surprised to find this, which for me it meant I was searching for ‘kcrf_commit_force_int’ in the debugtrace of a commit with the ‘write batch’ arguments, and did not find any of them. Actually, this has been reported by Marcin Przepiorowski in a comment on an article by Christian Antognini on this topic.

Can this commit batching be changed to include waiting for the logwriter? Yes, actually it can if you explicitly include ‘wait’ with the commit write batch. It is very interesting the kcrf_commit_force_int function then comes back at a totally different place:

 | | | | | | | | | | | | | < ktuulc+0x000000000119 returns: 0
 | | | | | | | | | | | | | > ktudnx(0x69fc8eb0, 0, ...)
 | | | | | | | | | | | | | | > ktuIMTabCacheCommittedTxn(0x69fc8eb0, 0x7ffe9eb79e74, ...)
 | | | | | | | | | | | | | | < ktuIMTabCacheCommittedTxn+0x000000000071 returns: 0
 | | | | | | | | | | | | | | > kslgetl(0x6ab9d6e8, 0x1, ...)
 | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | > kslfre(0x6ab9d6e8, 0x6ab9ce00, ...)
 | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | | | | | | | | | | < ktudnx+0x0000000005e4 returns: 0
 | | | | | | | | | | | | | > ktuTempdnx(0x69fc8eb0, 0, ...)
 | | | | | | | | | | | | | < ktuTempdnx+0x000000000083 returns: 0
 | | | | | | | | | | | | | > kcb_sync_last_change(0x69fc8eb0, 0x6df64df8, ...)
 | | | | | | | | | | | | | <> kcrf_commit_force_int(0x7f525ba19c00, 0x1, ...)
...
 | | | | | | | | | | | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1
 | | | | | | | | | | | | | > kghstack_free(0x7f525bb359a0, 0x7f525690ead8, ...)
 | | | | | | | | | | | | | < kghstack_free+0x00000000005a returns: 0
 | | | | | | | | | | | | < ktucmt+0x000000000e0c returns: 0

Instead of simply keeping the separate call after the transaction in the ksupop function, described above with commit wait/nowait, which is kcrf_commit_force_int with second argument set to 1, which means it notifies the logwriter as well as waits for the logwriter notification of the write, it is now is called after the function to clear the TX enqueue (ktuulc) and the undo transaction count has been lowered (ktudnx) at the end of the ktucmt function as a tailcall of kcb_sync_last_change, which wasn’t called before. Of course this limits the IO batching opportunities.

Conclusion
Do not change your database or even your session to make your commit faster. If you must, read this article carefully and understand the trade offs. One trade off which hasn’t been highlighted is: this might change in a different version, and it requires some effort to investigate. And again: if you still are considering this: probably you have a different problem that you should look at. Do not take this option in desperation to hope for a magical restoration of performance.

The commit_write option nowait does trigger the logwriter to write (the first invocation of the kcrf_commit_force_int function), but it does not wait for write confirmation. The commit_logging option batch does something different than the documentation says it does, it does not issue a signal to the logwriter, nor wait for it. This way the logwriter can wait the full three seconds before it times out on its semaphore and write what is in the public redo strands. But there is no way to tell if the redo for your change has been persisted yet, because that wait is gone too (that wait is the infamous ‘log file sync’ wait). If you want batching but still want a write notification, you must set commit_write to wait explicitly. By doing that you do not get the optimal batching because then waiting for the logwriter, including sending a signal to write is executed, which I suspect to be in the same ballpark as regular committing, but I haven’t checked that.

Application Engine in Process Scheduler: PSAESRV Server Process -v- Standalone PSAE executable

Whether to use the Application Engine server process (PSAESRV) in the process scheduler tuxedo domain or the standalone PSAE executable is a frequently discussed point amongst PeopleSoft administrator.  Over the years, I have written various things on the subject.  I am going to draw them together in this blog, and restate Oracle’s now clear advice about when to use which option.

In PeopleTools 8.4, the Process Scheduler became a fully fledged Tuxedo domain.  The PSAESRV process was also introduced at this time.  It is a persistent process that handles both Application Engine and Optimization Engine requests.  Each server process creates persistent database connections.  A number of these server processes are started with the domain.   The PSAESRV process does not spawn like other Tuxedo server processes.  Instead, you must configure the number of server processes to match the maximum number of concurrent Application Engine process requests and concurrent Optimization Engine requests that the process scheduler can run.  The server was introduced to handle very short-lived Application Engine programs thus avoiding the overhead of instantiating a new process and new database sessions for each process request.  CRM typically uses Application Engine in this fashion, but generally, you do not see this in other PeopleSoft products.

Oracle has not always been clear what they mean by a short-lived process.  It has been suggested that if Application Engine processes are typically taking less than 10-30 seconds, or if you run more than 1000 Application Engine processes requests per hour (note 651970.1) you should use PSAESRVs.
PeopleBooks advises you should use PSAESRV because it delivers improved system performance.  However, PeopleTools Performance Guidelines Red Paper (Doc ID 747389.1) contradicts this somewhat.  Ultimately, if you have any doubts, you should it test each way and determine whether one way gives a clear advantage over the other.

Oracle Support Note "What will be the impact of disabling PSAESRV on the Process Scheduler (Doc ID 651970.1)" explains that if PSAESRV is disabled in the Tuxedo domain configuration, the Process Scheduler goes back to the legacy behaviour and spawns a stand-alone PSAE process for each Application Engine request.  “The Application Engine will take a bit longer to start, [the] time delay may be range from millisecond to seconds” depending on hardware and configuration.

The stand-alone process has several advantages.

  • At the end of the Application Engine program, it disconnects from the database and terminates.  Thus releasing resources from the process and the database session.  Whereas the persistent Application Engine process has been reported to accumulate allocated memory over time.
  • If you are using Oracle database Global Temporary Tables in an application engine, then you should not use PSAESRV because the tables are always created PRESERVE ON COMMIT and so are only released when the database session terminates.
  • If you set any session parameters within an Application Engine program run via PSAESRV, or enable database trace, then these settings will carry forward from one Application Program to the next unless you reset the parameter at the end of the program, or the start of the next.  This is not a concern with standalone PSAE processes.

However, there is at least one case where you must use the server process:

  • If you are using Oracle Active Data Guard and wish to mark some Application Engine programs as read-only then they must be run via the PSAESRV process

Conclusion

  • PeopleTools Performance Guidelines Red Paper (Doc ID 747389.1 sums it up very nicely: “PSAE is as good as PSAESRV for most practical purposes.  If you have an application engine job that runs longer than 10 seconds, PSAE is equivalent to PSAESRV.  PSAE has the added advantage of being recycled at the end of each application engine job, cleaning up any outstanding SQL cursors to the database that may have been left behind.  Because PSAE recycles after each use, PSAE does not have any possible memory leakage problem that may occupy the precious system memory.  In short, PSAE is a cleaner workhorse.”
  • I think it is reasonable to use PSAESRV in CRM.  For all other products, I recommend that PSAESRV should be disabled from all Process Schedulers.
    • If you do have some Application Processes that are both short-lived (i.e. less than 10 seconds) and run frequently, then consider creating other process schedulers with PSAESRV processes that are dedicated to running only these process.  You can then move these processes to a new Process Scheduler category that only runs on these new Process Scheduler.
  • PSAESRV is configured by default, so if you don’t want to use it, and mostly you won’t, then you have to remember to disable it.

Covering indexes in Oracle, and branch size

A covering index is an index that contains all the columns required by your query, so that you don’t have to do a TABLE ACCESS BY INDEX ROWID, which is the major cost of an index range scan. You don’t need any special feature to do that in Oracle. Just add the required columns at the end of the index. In the execution plan you will see the columns used as index keys for the range scan displayed in ‘access’ predicates, and the further filtering done on the remaining columns with ‘filter’ predicates. The ‘projection’ shows the columns that are returned in the rowset result.
However you may have seen that SQL Server has a special ‘INCLUDING’ keyword to separate those non-key columns added only for filtering or projection but not for access. What does it bring that Oracle doesn’t have?

An index entry is composed of a key and data associated to the key. The index is sorted on the key. The data for each key have no special order, like in a heap table. The idea of the SQL Server INCLUDING keyword is to separate the columns belonging to the key and the columns belonging to the data. It is not mandatory. You can add all columns to the key but depending on the implementation, the benefit can be:

  • some data types may not be allowed in the key but allowed as data
  • sorting the data when not required may be a performance overhead
  • there can be limitations on the size of the key
  • having a larger key may require more space in the branches
  • adding sorted columns may change the clustering factor

In Oracle, there are very few data types that cannot be indexed (like LONG). The limitation on the size of the key may come into play for large 12c Extended Datatypes. You can substring them, but that defeats the goal of covering indexes. I see two reasons why ‘INCLUDING’ indexes can be useful. The first reason is about the clustering factor. The second about sorting the whole index entry and referencing it from the branches. I’ll detail those reasons later, but first here is an example.


SQL> create table DEMO (UNIQU ,RANGE ,RANDOM_TEXT ,CONSTANT_TEXT ) as select rownum UNIQU , mod(rownum,4) RANGE , dbms_random.string('u',80) RANDOM_TEXT , lpad('x',80,'x') CONSTANT_TEXT from xmltable('1 to 100000');
Table DEMO created.
SQL> commit;
Commit complete.

This table has an all-distinct-values column UNIQ, a few-distinct-values on (RANGE) and I’ll use them for the key. And I’ve two columns I’ll add as additional column for covering queries: one is with lot of distinct values (RANDOM_TEXT) and the other has few distinct values (CONSTANT_TEXT).
The first rows look like this:

SQL> select * from DEMO order by ROWID fetch first 5 rows only;
UNIQU RANGE RANDOM_TEXT CONSTANT_TEXT
----- ----- -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
1 1 XCFNWCRCFBEPJPSHREUVVVTBUCCXLZMRPJPNQDTHWYRZRUORBPDOBCIRFHLICETULTCZTMPOCMUNQITV xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2 2 XUSPNDOMPQKOIRCVDDTVYAGKRDGIXOSVUNMRAQLSRQGYKOFEXRQMCPXPYZYKRHHKDXGIINOUUAUJOLOO xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
3 3 ZBCVFTDSRUFIUTSIWOOOBWIRMEFUXNWLADAPUPFNPVYDLPQTOUZVXJKMGIPCGZESXFXOIYVMKNSMMZKB xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
4 0 VOIRCXFVSRVZQRZDRLQRHZWNGQJAAWJXWXJKRCJVPWYDJSZLJIOEWAMCFSRCUPSPPEKITJYHHOUQSVYQ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
5 1 UUSAMEVRWNLPGCUVMJWVVPDAENRYKIWWMIHTUJSZRQASMTYOVQNCGZGZIJZWNSOJVSIBMMUEAXOHJCOA xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I’m adding indexes fo access on RANGE as the index key, with only the key, or covering the random or constant text:

SQL> create index DEMO_RANGE on DEMO(RANGE) pctfree 50;
Index DEMO_RANGE created.
SQL> create index DEMO_RANGE_COVERING_RANDOM on DEMO(RANGE,RANDOM_TEXT) pctfree 50;
Index DEMO_RANGE_COVERING_RANDOM created.
SQL> create index DEMO_RANGE_COVERING_CONSTANT on DEMO(RANGE,CONSTANT_TEXT) pctfree 50;
Index DEMO_RANGE_COVERING_CONSTANT created.

An additional one adding the unique column in-between:

SQL> create index DEMO_RANGE_COVERING_WITH_PK on DEMO(RANGE,UNIQU,CONSTANT_TEXT) pctfree 50;
Index DEMO_RANGE_COVERING_WITH_PK created.

And now for access with the unique column as a key:

SQL> create index DEMO_UNIQU_COVERING_RANDOM on DEMO(UNIQU,RANDOM_TEXT) pctfree 50;
Index DEMO_UNIQU_COVERING_RANDOM created.
SQL> create index DEMO_UNIQU_COVERING_CONSTANT on DEMO(UNIQU,CONSTANT_TEXT) pctfree 50;
Index DEMO_UNIQU_COVERING_CONSTANT created.

Here are some interesting stats:

SQL> exec dbms_stats.gather_table_stats(user,'DEMO');
PL/SQL procedure successfully completed.
SQL> select index_name,blevel,leaf_blocks,num_rows,clustering_factor from user_indexes where table_name='DEMO' order by 2,3;
INDEX_NAME BLEVEL LEAF_BLOCKS NUM_ROWS CLUSTERING_FACTOR
-------------------------------- ------ ----------- -------- -----------------
DEMO_RANGE 1 353 100000 9757
DEMO_RANGE_COVERING_RANDOM 2 2440 100000 99967
DEMO_RANGE_COVERING_CONSTANT 2 2440 100000 9757
DEMO_UNIQU_COVERING_RANDOM 2 2500 100000 2440
DEMO_UNIQU_COVERING_CONSTANT 2 2500 100000 2440
DEMO_RANGE_COVERING_WITH_PK 2 2565 100000 9757
6 rows selected.

Leaf size

About the size, the covering indexes have approximately the same number of leaf blocks because the included column (RANDOM_TEXT or CONSTANT_TEXT) has the same size (80 bytes). Of course, the non-covering index is smaller (but will need table access to query additional column). The key on UNIQU is slightly larger than the one on RANGE because the numbers go higher. The index with 3 columns is the largest.

Clustering factor

About the clustering factor, there’s one outlier here which deserves an explanation. But before that, you must understand that this higher clustering factor is not important for a query using the covering index, such as a SELECT RANDOM_TEXT WHERE RANGE=0, because in that case you don’t read the table. However for some queries you may cover only the filter predicates and go to the table for projection.
But the big problem is that when you add a column to an index to address a specific query, you don’t want to risk a side effect on another query, and changing the clustering factor is a risk here. One solution is to keep the old non-covering index (DEMO_RANGE) but then the side effect is on DML overhead.

To understand the change in clustering factor we must go deeper on Oracle index key and data implementation. The ‘data’ part exists in Oracle indexes even when not specified explicitely with an INCLUDING clause. The ROWID is the data part. An index entry associates a key (the indexed columns) with a pointer to the table row (the ROWID). At least, this is for UNIQUE indexes where each key is unique.

Non-unique indexes are a special case. Actually, Oracle implements only unique key indexes. When the indexed columns are not unique, the ROWID is stored on the key part of the index entry, and there is no data part. You should read Richard Foote, Differences between Unique and Non-Unique Indexes for detailed explanation.

Branch size

The previous statistics displayed only the number of branch level, which was the same, but we can have more detail about the branch size with an ANALYZE INDEX.

The non-covering index has only one branch block, the root, which references all the 353 leaf blocks containing the 100000 entries, with an average of 5479/352=15 bytes per branch entry:

SQL> analyze index DEMO_RANGE validate structure offline;
Index DEMO_RANGE analyzed.
SQL> select height,blocks,lf_blks,lf_rows_len,br_rows,br_blks,br_rows_len,most_repeated_key,btree_space,used_space,pct_used,rows_per_key,blks_gets_per_access,opt_cmpr_pctsave,lf_uncmp_rows_len,lf_uncmp_blks from index_stats
HEIGHT BLOCKS LF_BLKS LF_ROWS_LEN BR_ROWS BR_BLKS BR_ROWS_LEN MOST_REPEATED_KEY BTREE_SPACE USED_SPACE PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS OPT_CMPR_PCTSAVE LF_UNCMP_ROWS_LEN LF_UNCMP_BLKS
------ ------ ------- ----------- ------- ------- ----------- ----------------- ----------- ---------- -------- ------------ -------------------- ---------------- ----------------- -------------
2 384 353 1375000 352 1 5479 25000 2830616 1380479 49 25000 12502.5 19 1375000 353

The covering index with lot of distinct values for the non-key columns has more branch blocks, with an average of 34623/2439=14 bytes per branch entry:

SQL> analyze index DEMO_RANGE_COVERING_RANDOM validate structure offline;
Index DEMO_RANGE_COVERING_RANDOM analyzed.
SQL> select height,blocks,lf_blks,lf_rows_len,br_rows,br_blks,br_rows_len,most_repeated_key,btree_space,used_space,pct_used,rows_per_key,blks_gets_per_access,opt_cmpr_pctsave,lf_uncmp_rows_len,lf_uncmp_blks from index_stats
HEIGHT BLOCKS LF_BLKS LF_ROWS_LEN BR_ROWS BR_BLKS BR_ROWS_LEN MOST_REPEATED_KEY BTREE_SPACE USED_SPACE PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS OPT_CMPR_PCTSAVE LF_UNCMP_ROWS_LEN LF_UNCMP_BLKS
------ ------ ------- ----------- ------- ------- ----------- ----------------- ----------- ---------- -------- ------------ -------------------- ---------------- ----------------- -------------
3 2560 2440 9475000 2439 6 34623 1 19558408 9509623 49 1 4 2 9475000 2440

Here the number of branches is higher only because there are more leaves (as we have more columns), but not because of the size in the branch entries, which are even smaller. They are smaller because the branch does not have to store the full value of all columns in order to identify one leaf block. Then, only the first bytes are needed and not the full 80 bytes of them.

The covering index with few of distinct values for the non-key columns has a lot more branch blocks, with an average of 234755/2439=96 bytes per branch entry:

SQL> analyze index DEMO_RANGE_COVERING_CONSTANT validate structure offline;
Index DEMO_RANGE_COVERING_CONSTANT analyzed.
SQL> select height,blocks,lf_blks,lf_rows_len,br_rows,br_blks,br_rows_len,most_repeated_key,btree_space,used_space,pct_used,rows_per_key,blks_gets_per_access,opt_cmpr_pctsave,lf_uncmp_rows_len,lf_uncmp_blks from index_stats
 
HEIGHT BLOCKS LF_BLKS LF_ROWS_LEN BR_ROWS BR_BLKS BR_ROWS_LEN MOST_REPEATED_KEY BTREE_SPACE USED_SPACE PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS OPT_CMPR_PCTSAVE LF_UNCMP_ROWS_LEN LF_UNCMP_BLKS
------ ------ ------- ----------- ------- ------- ----------- ----------------- ----------- ---------- -------- ------------ -------------------- ---------------- ----------------- -------------
3 2560 2440 9475000 2439 31 234755 25000 19759108 9709755 50 25000 12503.5 86 9475000 2440

So, here the size of the branch blocks is higher because we have multiple leaves blocks with the value of COVERING_CONSTANT the second column is not sufficient to identify only one leaf block. The full 80 bytes must be stored, and the rowid in addition to it.

When the indexed column has only unique values, there is no need to store more in the branches (not the additional columns, not the rowid) and only 12 bytes are needed here on average:

SQL> analyze index DEMO_UNIQU_COVERING_RANDOM validate structure offline;
Index DEMO_UNIQU_COVERING_RANDOM analyzed.
SQL> select height,blocks,lf_blks,lf_rows_len,br_rows,br_blks,br_rows_len,most_repeated_key,btree_space,used_space,pct_used,rows_per_key,blks_gets_per_access,opt_cmpr_pctsave,lf_uncmp_rows_len,lf_uncmp_blks from index_stats
 
HEIGHT BLOCKS LF_BLKS LF_ROWS_LEN BR_ROWS BR_BLKS BR_ROWS_LEN MOST_REPEATED_KEY BTREE_SPACE USED_SPACE PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS OPT_CMPR_PCTSAVE LF_UNCMP_ROWS_LEN LF_UNCMP_BLKS
------ ------ ------- ----------- ------- ------- ----------- ----------------- ----------- ---------- -------- ------------ -------------------- ---------------- ----------------- -------------
3 2560 2500 9688892 2499 5 29737 1 20030140 9718629 49 1 4 0 9688892 2500

As the second column is not needed, the size of branch is the same whether we use RANDOM_TEXT or CONSTANT_TEXT:

SQL> analyze index DEMO_UNIQU_COVERING_CONSTANT validate structure offline;
Index DEMO_UNIQU_COVERING_CONSTANT analyzed.
SQL> select height,blocks,lf_blks,lf_rows_len,br_rows,br_blks,br_rows_len,most_repeated_key,btree_space,used_space,pct_used,rows_per_key,blks_gets_per_access,opt_cmpr_pctsave,lf_uncmp_rows_len,lf_uncmp_blks from index_stats
 
HEIGHT BLOCKS LF_BLKS LF_ROWS_LEN BR_ROWS BR_BLKS BR_ROWS_LEN MOST_REPEATED_KEY BTREE_SPACE USED_SPACE PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS OPT_CMPR_PCTSAVE LF_UNCMP_ROWS_LEN LF_UNCMP_BLKS
------ ------ ------- ----------- ------- ------- ----------- ----------------- ----------- ---------- -------- ------------ -------------------- ---------------- ----------------- -------------
3 2560 2500 9688892 2499 5 29737 1 20030140 9718629 49 1 4 0 9688892 2500

Now, the last one is my workaround for the higher size when adding a column that do not have a lot of distinct values: just add a column before with more distinct values. Here I use the UNIQU one, but you probably have one that can be useful for your queries.

SQL> analyze index DEMO_RANGE_COVERING_WITH_PK validate structure offline;
Index DEMO_RANGE_COVERING_WITH_PK analyzed.
SQL> select height,blocks,lf_blks,lf_rows_len,br_rows,br_blks,br_rows_len,most_repeated_key,btree_space,used_space,pct_used,rows_per_key,blks_gets_per_access,opt_cmpr_pctsave,lf_uncmp_rows_len,lf_uncmp_blks from index_stats
 
HEIGHT BLOCKS LF_BLKS LF_ROWS_LEN BR_ROWS BR_BLKS BR_ROWS_LEN MOST_REPEATED_KEY BTREE_SPACE USED_SPACE PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS OPT_CMPR_PCTSAVE LF_UNCMP_ROWS_LEN LF_UNCMP_BLKS
------ ------ ------- ----------- ------- ------- ----------- ----------------- ----------- ---------- -------- ------------ -------------------- ---------------- ----------------- -------------
3 2688 2565 9963892 2564 6 37456 1 20557908 10001348 49 1 4 2 9963892 2565

Now you get the idea. When creating an index, or adding columns for covering index, and you have the choice of column order, then try to have their first bytes selective enough so that the branch needs only a small substring to identify each leaf block (or lower level branches).

Block dumps

If you want to see the details about the branch length, here are some info from block dumps. I got them with the following:

SQL> column value new_value tracefile
SQL> select value from v$diag_info where name='Default Trace File';
VALUE
/u01/app/oracle/diag/rdbms/cdb1/CDB1/trace/CDB1_ora_6799.trc
SQL> exec for i in (select header_file, header_block from dba_segments where owner='DEMO' and segment_name='DEMO_RANGE') loop execute immediate 'alter system dump datafile '||i.header_file||' block '||(i.header_block+1); end loop;
PL/SQL procedure successfully completed.
SQL> host tail -20 &tracefile

Here is the last branch entry for the root block of DEMO_RANGE where the first column is not very selective and then the rowid is required in the branch:

row#351[3279] dba: 113261807=0x6c03cef
col 0; len 2; (2): c1 04
col 1; len 6; (6): 07 00 05 7b 00 25

Here is the last branch entry for the root block of DEMO_RANGE_COVERING_RANDOM where instead of the rowid the 3 first bytes of the RANDOM_TEXT column are sufficient:

row#3[8006] dba: 113263037=0x6c041bd
col 0; len 2; (2): c1 04
col 1; len 3; (3): 53 51 52
col 2; TERM

Here is the last branch entry for the root block of DEMO_RANGE_COVERING_CONSTANT where the full 80 bytes of CONSTANT_TEXT are not even sufficient, and the ROWID is needed as a 3rd column:

row#28[5316] dba: 117444566=0x7000fd6
col 0; len 2; (2): c1 04
col 1; len 80; (80):
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
78 78 78 78 78
col 2; len 6; (6): 07 00 05 43 00 25

Here is the last branch entry for the root block of DEMO_UNIQU_COVERING_CONSTANT where the first column is sufficient:

row#2[8026] dba: 117447160=0x70019f8
col 0; len 4; (4): c3 09 0d 04
col 1; TERM

So what?

We probably don’t need a feature like SQL Server INCLUDING indexes in most of the cases. However, this may require thinking about the order of columns, mainly:

  • ensure that selective columns appear as early as possible (without compromising the index access efficiency of course) in order to lower the bytes required to address branches and leaves
  • when adding columns, try to add first a column that will keep the clustering factor you had with the rowid, such as a date of insert

Added 14-APR-2018

The conclusion above was only focused at columns added for covering indexes (I wrote it after reading wrong things in this stackoverflow thread), and it is not a general statement about putting selective columns first, which is a common misconception. Columns like this CONSTANT_TEXT (which is an extreme case of non-selective) can have a better index compression (Enterprise Edition feature) when in front. Read this tread, answers and links from Richard Foote: https://twitter.com/OracleSK/status/984906294879539200

 

Cet article Covering indexes in Oracle, and branch size est apparu en premier sur Blog dbi services.

New Oracle Security Public Training Dates Available

Due to some very critical close family health issues in the last few months I have delayed advertising any public training dates this year for my Oracle Security classes as I have had to be available for family support during....[Read More]

Posted by Pete On 13/04/18 At 10:10 AM

Concurrency … the path to success and the path the failure

Let’s face it. Concurrency is a good thing when it comes to database applications. After all, if there is only a single user of your application, then chances are, it is not a successful application Smile.  Of course there are exceptions to this rule, but by and large, most of the applications we build are going to be used by large populations of users.  And given the recent publicity of users data and privacy, we can also be pretty confident that we want the data in our applications to be correct at all times.

As a developer, it is absolutely critical to keep concurrency in mind when building applications.  Here is a simple demonstration to thrust home the point. 

The task assigned to the developer here is simple – transfer move all of the rows satisfying a particular condition from table T1 to table T2.  Let’s create our database objects first:


SQL> create table t1 ( x int, y int, z int );

Table created.

SQL> create table t2 as select * from t1 where 1=0;

Table created.

SQL>
SQL> insert into t1
  2  select rownum, rownum, rownum
  3  from dual connect by level <= 10000;

10000 rows created.

SQL> commit;

Commit complete.

So table T1 has 10,000 rows and T2 is empty, and our condition to determine which rows to transfer will be X being non-zero.  That happens to be all of the rows, but of course, it could be any filtering rule on the existing data. I’ve chosen all of the rows so I can keep the numbers easy to digest in the upcoming examples.

There is no such thing as a MOVE command in SQL, so we can implement this with two statements: 

  • INSERT-SELECT to copy the rows from T1 to T2
  • DELETE to remove the rows we just copied.

Here is the simple PLSQL routine to do that, and a first test run of the code.


SQL>
SQL> set serverout on
SQL> begin
  2      insert into t2 select * from t1 where x != 0;
  3      delete from t1 where x != 0;
  4      commit;
  5  end;
  6  /

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from t1;

  COUNT(*)
----------
         0

1 row selected.

SQL> select count(*) from t2;

  COUNT(*)
----------
     10000

1 row selected.

SQL>

Everything looks fine doesn’t it? But it’s not.  In fact, I’ll do what any good developer should be doing, and add some instrumentation to the code to get more information about its operations.


SQL> create table t1 ( x int, y int, z int );

Table created.

SQL> create table t2 as select * from t1 where 1=0;

Table created.

SQL>
SQL> insert into t1
  2  select rownum, rownum, rownum
  3  from dual connect by level <= 10000;

10000 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> set serverout on
SQL> begin
  2      insert into t2 select * from t1 where x != 0;
  3      dbms_output.put_line(sql%rowcount);           
  4      delete from t1 where x != 0;
  5      dbms_output.put_line(sql%rowcount);
  6      commit;
  7  end;
  8  /
10000
10100      <=========== !!!!!

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from t1;

  COUNT(*)
----------
         0

1 row selected.

SQL> select count(*) from t2;

  COUNT(*)
----------
     10000

1 row selected.

SQL>
SQL>

Look at the output from the PL/SQL block.  I copied(inserted ) 10,000 rows from T1 to T2, but the deletion phase removed more than 10,000 rows. How was that possible?  It is because databases are about concurrency.  In another session, I had a simple script running that just adds new rows to the table continuously.  As a developer, it is vital that you do not assume that you are the only person accessing or manipulating a table.  My code in its existing form has corrupted the database, because 100 rows have been deleted that were never copied to T2.  They’re gone forever!

So….onto some potential solutions.  Lets try locking the rows first


SQL> create table t1 ( x int, y int, z int );

Table created.

SQL> create table t2 as select * from t1 where 1=0;

Table created.

SQL>
SQL> insert into t1
  2  select rownum, rownum, rownum
  3  from dual connect by level <= 10000;

10000 rows created.

SQL> commit;

Commit complete.

SQL> set serverout on
SQL> declare
  2    cursor c is select * from t1 where x != 0 FOR UPDATE;
  3  begin
  4    open c;
  5    insert into t2 select * from t1 where x != 0;
  6    dbms_output.put_line(sql%rowcount);               
  7    delete from t1 where x != 0;
  8    dbms_output.put_line(sql%rowcount);
  9    commit;
 10    close c;
 11  end;
 12  /
10000
10100

PL/SQL procedure successfully completed.

SQL>
SQL> select count(*) from t1;

  COUNT(*)
----------
         0

1 row selected.

SQL> select count(*) from t2;

  COUNT(*)
----------
     10000

1 row selected.

As you can see, that has not yielded any benefit because locking is only of benefit for locking existing rows.  The new rows arriving from a separate session are unhindered, and hence my DELETE statement still picks up the new rows when it should not. I could workaround this by locking the entire table, but what if my filtering condition was only a small percentage of the rows?  It seems overly aggressive to lock everyone out of the table just to copy a small subset of rows.  In this case, I can use PL/SQL to help me out. Using the BULK COLLECT / BULK BIND facilities, I can restrict my DML operations to just the ROWIDs of the rows I am copying. 


SQL> create table t1 ( x int, y int, z int );

Table created.

SQL> create table t2 as select * from t1 where 1=0;

Table created.

SQL>
SQL> insert into t1
  2  select rownum, rownum, rownum
  3  from dual connect by level <= 10000;

10000 rows created.

SQL> commit;

Commit complete.

SQL> set serverout on
SQL> declare
  2    cursor c is select rowid rid, t1.* from t1 where x != 0 for update;
  3    type array is table of c%rowtype index by pls_integer;
  4    l_rows array;
  5    l_idx int;
  6  begin
  7    open c;
  8    loop
  9      fetch c bulk collect into l_rows limit 20000;
 10      exit when l_rows.count = 0;
 11
 12      forall i in 1 .. l_rows.count
 13         insert into t2 values (
 14            l_rows(i).x,
 15            l_rows(i).y,
 16            l_rows(i).z );
 17      dbms_output.put_line(sql%rowcount);       
 18      forall i in 1 .. l_rows.count
 19        delete from t1
 20        where rowid = l_rows(i).rid;
 21      dbms_output.put_line(sql%rowcount);
 22    end loop;
 23  end;
 24  /
10000
10000

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> select count(*) from t1;

  COUNT(*)
----------
       147

1 row selected.

SQL> select count(*) from t2;

  COUNT(*)
----------
     10000

1 row selected.

I am collecting the 10000 ROWID’s of interest into an array, and then deleting only those rows.  So even though new rows are arriving, I will never see them because they will not be in the set of ROWIDs that I am moving to table T2.  We can see that during the operation, 147 new rows arrived in T1 from other sessions.

(Note:  Normally you would choose a bulk collect size of perhaps 1000 or similar, and loop around repeated to avoid excessive PGA use.  I bulk collected the entire set to show the rowcounts more intuitively)

You might be thinking that perhaps we no longer need the “FOR UPDATE” clause on the cursor, but it is still a required element.  Without it, it is possible that a different session could update a row from the candidate data set, and that updated row would still be deleted.  Or a different session could even delete, commit, and insert a fresh row that happens to pick up the same ROWID as one from the original set.

Bottom Line – keep concurrency front and foremost in your mind when you are writing code to manipulate data.

New option for configuring multipathing in Oracle Linux 7

Teaching is and remains the best way for picking up new things :) While updating notes for a class I came across an interesting change in the way the device-mapper multipath works in Oracle Linux 7.4.

In the past, everyone including me used scsi_id to get the WWID of a LUN for use with dm-multipath. This is still the way Oracle documents it for Oracle Linux 7.

The utility was specified as an argument to getuid_callout in /etc/multipath.conf, as shown here:

defaults {
[...]
    getuid_callout        "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
[...]
}

This is how it was done more or less since Red Hat 5 days. The location of scsi_id has changed over time, as expertly described on Oracle Base for example.

I don’t exactly know what happened to getuid_callout, but I can’t find it in the DM Multipath documentation for Red Hat Enterprise Linux 7. It certainly is present in its counterpart for Red Hat Enterprise Linux 6. A quick search using my favourite search engine did not reveal any clues as to what happened. The closest I got to an answer was a page in the SLES 11 SP3 documentation, which indicated that you should use a different parameter than getuid_callout.

Now I was curious! If I continued using getuid_callout, would the service stop working? Or just throw a warning?

The experiment

I quickly spun up a VM on KVM, running Oracle Linux 7.4 which I updated to the latest and greatest version of each package as of April 3rd 2018. The current versions of dm-multipath as of April 3rd 2018 are:

[root@server6 ~]# rpm -qa | grep -i multipath
device-mapper-multipath-0.4.9-111.el7_4.2.x86_64
device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64
[root@server6 ~]# 

I have presented a small “LUN” to the VM, using the virtual SCSI driver. I initially tried with the virtio driver but found it was not exporting necessary properties back to the VM that I need later in this post. It shouldn’t matter though, all of this is a playground and I am most interested in the concepts, not the details.

You will also notice that I am only showing snippets of the multipath.conf file. The reason for this is simple: if you don’t follow your storage vendor’s documentation detailing how to configure dm-multipath for the storage system in use, you might run into issues. Always consult your storage vendor’s official documentation!

Installing dm-multipath

I didn’t have multipathing installed initially and had to install it first. Here is the transcript:

[root@server6 ~]# yum install device-mapper-multipath

[...]

--> Running transaction check
---> Package boost-system.x86_64 0:1.53.0-27.el7 will be installed
---> Package boost-thread.x86_64 0:1.53.0-27.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================================
 Package                            Arch         Version                 Repository        Size
================================================================================================
Installing:
 device-mapper-multipath            x86_64       0.4.9-111.el7_4.2       ol7_latest       134 k
Installing for dependencies:
 boost-system                       x86_64       1.53.0-27.el7           ol7_latest        39 k
 boost-thread                       x86_64       1.53.0-27.el7           ol7_latest        57 k
 device-mapper-multipath-libs       x86_64       0.4.9-111.el7_4.2       ol7_latest       251 k
 librados2                          x86_64       1:0.94.5-2.el7          ol7_latest       1.7 M
Updating for dependencies:
 kpartx                             x86_64       0.4.9-111.el7_4.2       ol7_latest        72 k

Transaction Summary
================================================================================================
Install  1 Package  (+4 Dependent packages)
Upgrade             ( 1 Dependent package)

Total download size: 2.2 M

[...]

Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : boost-system-1.53.0-27.el7.x86_64                                            1/7 
  Installing : boost-thread-1.53.0-27.el7.x86_64                                            2/7 
  Installing : 1:librados2-0.94.5-2.el7.x86_64                                              3/7 
  Installing : device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64                        4/7 
  Updating   : kpartx-0.4.9-111.el7_4.2.x86_64                                              5/7 
  Installing : device-mapper-multipath-0.4.9-111.el7_4.2.x86_64                             6/7 
  Cleanup    : kpartx-0.4.9-111.el7.x86_64                                                  7/7 
  Verifying  : boost-system-1.53.0-27.el7.x86_64                                            1/7 
  Verifying  : 1:librados2-0.94.5-2.el7.x86_64                                              2/7 
  Verifying  : boost-thread-1.53.0-27.el7.x86_64                                            3/7 
  Verifying  : device-mapper-multipath-0.4.9-111.el7_4.2.x86_64                             4/7 
  Verifying  : kpartx-0.4.9-111.el7_4.2.x86_64                                              5/7 
  Verifying  : device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64                        6/7 
  Verifying  : kpartx-0.4.9-111.el7.x86_64                                                  7/7 

Installed:
  device-mapper-multipath.x86_64 0:0.4.9-111.el7_4.2                                            

Dependency Installed:
  boost-system.x86_64 0:1.53.0-27.el7                      boost-thread.x86_64 0:1.53.0-27.el7 
  device-mapper-multipath-libs.x86_64 0:0.4.9-111.el7_4.2  librados2.x86_64 1:0.94.5-2.el7     

Dependency Updated:
  kpartx.x86_64 0:0.4.9-111.el7_4.2                                                             

Complete!
[root@server6 ~]# 

First configuration attempt

Next I quickly created a config file based on what I used in the past, making sure I add getuid_callout in the defaults {} section. And sure enough, when starting the service it complains about illegal syntax:

[root@server6 ~]# systemctl status multipathd
● multipathd.service - Device-Mapper Multipath Device Controller
   Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-04-03 14:56:19 BST; 1min 28s ago
  Process: 586 ExecStart=/sbin/multipathd (code=exited, status=0/SUCCESS)
  Process: 568 ExecStartPre=/sbin/multipath -A (code=exited, status=0/SUCCESS)
  Process: 564 ExecStartPre=/sbin/modprobe dm-multipath (code=exited, status=0/SUCCESS)
 Main PID: 596 (multipathd)
   CGroup: /system.slice/multipathd.service
           └─596 /sbin/multipathd

Apr 03 14:56:19 server6 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Apr 03 14:56:19 server6 multipath[568]: Apr 03 14:56:19 | /etc/multipath.conf line 24, invalid keyword: getuid_callout
Apr 03 14:56:19 server6 systemd[1]: Started Device-Mapper Multipath Device Controller.
Apr 03 14:56:19 server6 multipathd[596]: /etc/multipath.conf line 24, invalid keyword: getuid_callout

Note the line reading “/etc/multipath.conf line 24, invalid keyword: getuid_callout”. It didn’t seem to prevent the meta-device from being created, but I didn’t like the error message popping up all the time.

After removing the getuid_callout from my multipath.conf file to let the defaults take over, the error went away after a restart of the service:

Apr 03 15:02:54 server6 systemd[1]: Stopped Device-Mapper Multipath Device Controller.

Apr 03 15:10:43 server6 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Apr 03 15:10:43 server6 systemd[1]: Started Device-Mapper Multipath Device Controller.
Apr 03 15:10:43 server6 multipathd[11593]: mpatha: load table [0 4194304 multipath 0 0 1 1 service-time 0 1 1 8:0 1]
Apr 03 15:10:43 server6 multipathd[11593]: mpatha: event checker started
Apr 03 15:10:43 server6 multipathd[11593]: path checkers start up

So I guess I have to rethink my approach to dm-multipath and device name persistence if I want to avoid this error message.

The New Way of doing things

I have once been told by a good friend: “when unsure, consult the documentation”. On my Linux system, my first stop is the man page. After a quick “man multipath.conf” I learned that getuid_callout is nowhere to be found. The SUSE note mentioned a new attribute, named uid_attribute, which seems to have taken its place:

uid_attribute:
The udev attribute providing a unique path identifier. Default value is ID_SERIAL

I also found this to be valuable about the “multipaths” section:

This section defines the multipath topologies. They are indexed by a World Wide Identifier (wwid), which is taken to be the value of the udev attribute given by the uid_attribute keyword.

Aha! The new model is potentially much more flexible than the old one.

Controlling aliases with multipath {} sections

I really like to assign alias names to LUNs mapped to my database systems. This way, I know that a device named “/dev/mapper/oracle_asm_data_001” is most likely used as the first ASM disk in disk group DATA. If I see /dev/mapper/mpatha instead, I know precisely nothing about its purpose.

Giving LUNs a name is not hard, you can do this in the multipaths {} section of /etc/multipath.conf. I really like doing this for reasons just mentioned. Which options do I have?

The easiest one is quite simply a change of getuid_callout to uid_attribute or its removal. Thankfully – at least in my case – the output of scsi_id and the ID_SERIAL attribute provided by udev are nearly identical:

[root@server6 ~]# /lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/sda
0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0
[root@server6 ~]# udevadm info /dev/sda | grep ID_SERIAL
E: ID_SERIAL=0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0
E: ID_SERIAL_SHORT=drive-scsi0-0-0-0
[root@server6 ~]# 

Note that I included the –replace-whitespace option in the scsi_id command, otherwise the output wouldn’t be identical. This is a concession to my lab, on real hardware I haven’t seen white space in WWIDs (this doesn’t mean there aren’t). Removing the getuid_callout from the default {} section was actually enough to create a persistent device name.

[...]
multipaths {

  multipath {
    wwid  "0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0"
    alias "oracle_asm_data_001"
  }
}

And yes, I get my alias after restarting the multipathing daemon.

Apr 03 16:40:49 server6 systemd[1]: Starting Device-Mapper Multipath Device Controller...
Apr 03 16:40:49 server6 systemd[1]: Started Device-Mapper Multipath Device Controller.
Apr 03 16:40:49 server6 multipathd[12271]: oracle_asm_data_001: load table [0 4194304 multipath 0 0 1 1 service-time 0 1 1 8:0 1]
Apr 03 16:40:49 server6 multipathd[12271]: oracle_asm_data_001: event checker started
Apr 03 16:40:49 server6 multipathd[12271]: path checkers start up

[root@server6 ~]# multipath -l
oracle_asm_data_001 (0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-0) dm-3 QEMU    ,QEMU HARDDISK   
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  `- 0:0:0:0 sda 8:0 active undef unknown
[root@server6 ~]# 

There is of course no second path, and there is a fair bit of information missing that your would otherwise get from an enterprise deployment, but you get the idea.

Answer: Anything Wrong With Query Performance? (Red Right Hand)

I of course attract a highly clever readership :). As some have commented, for a single table to require 1000+ consistent gets to retrieve 1000 rows implies that each row needs to be accessed from a different block. This in turn implies the Clustering Factor for this index to be relatively bad and the associated […]

exp catch

No-one should be using exp/imp to export and import data any more, they should be using the datapump equivalents expdp/impdp – but if you’re on an older (pre-12c) version of Oracle and still using exp/imp to do things like moving tables with their production statistics over to test systems then be careful that you don’t fall into an obsolescence trap when you finally upgrade to 12c (or Oracle 18).

exp/imp will mess up some of your histograms if you’re still using them to move tables/statistics in 12c.

Remember that 12c can create “Top-N” and “hybrid” histograms – and exp/imp were written long before these new histogram types came into existence. The code has not been updated to allow for the new histogram types so if you happen to generate any histograms of these type in a 12c system and then use exp/imp to move some table stats (and it’s particularly an issue relating to stats) from one system to another – the stats that arrive at the destination system won’t match the stats that left the source system.

Here’s a little sample code to build a model that I can use to demonstrate the problem. It creates a table with three columns that will make it easy for me to create one frequency histogram, one Top-N histogram and one hybrid histogram. I’ve included a couple of substitution variables in the code so that you can specify an Oracle instance to connect to and a directory for the export file that expdp is going to produce. Don’t forget to check that the directory I create in this script doesn’t overwrite a directory that already exists for other reasons on your test system.


rem
rem     Script:         12c_histograms.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2015
rem
rem     Define m_service to be the service name you connect to
rem     Define m_directory to be the O/S directory you to use for
rem     the export/import/log files
rem
rem     Make sure this code is not over-writing an existing 
rem     definition for a directory called DMPDIR before you 
rem     start


define m_service = 'orcl'
define m_service = 'or32'

define m_directory = '/mnt/working'

host rm &m_directory/expdat.dmp
host rm &m_directory/expdp.dmp

create or replace directory dmpdir as '&m_directory';

drop table t1 purge;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        trunc(sqrt(rownum + 0)) frequency,
        trunc(sqrt(rownum + 0)) top_n,
        trunc(sqrt(rownum + 0)) hybrid
from
        generator
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for columns frequency size 254 for columns top_n size 95 for columns hybrid size 50'
        );
end;
/

select
        column_name,
        num_distinct,
        histogram,
        num_buckets
from
        user_tab_cols
where
        table_name = 'T1'
;

column endpoint_actual_value format a22
break on column_name skip 1

select 
        column_name, 
        endpoint_number, endpoint_value, endpoint_actual_value, 
        endpoint_repeat_count 
from 
        user_tab_histograms 
where 
        table_name = 'T1'
order by 
        column_name, endpoint_number
;

Here’s an extract, from a 12.1.0.2 instance, of the results of the two queries with a large number of the rows from the histogram data deleted:


COLUMN_NAME              Distinct HISTOGRAM          Buckets
-------------------- ------------ --------------- ----------
FREQUENCY                     100 FREQUENCY              100
TOP_N                         100 TOP-FREQUENCY           95
HYBRID                        100 HYBRID                  50

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE  ENDPOINT_REPEAT_COUNT
-------------------- --------------- -------------- ---------------------- ---------------------
FREQUENCY                          3              1 1                                          0
                                   8              2 2                                          0
                                  15              3 3                                          0
...
                                9800             98 98                                         0
                                9999             99 99                                         0
                               10000            100 100                                        0


HYBRID                             3              1 1                                          3
                                 224             14 14                                        29
                                 440             20 20                                        41
...
                                9603             97 97                                       195
                                9800             98 98                                       197
                               10000            100 100                                        1

TOP_N                              1              1 1                                          0
                                  16              7 7                                          0
                                  33              8 8                                          0
...
                                9753             98 98                                         0
                                9952             99 99                                         0
                                9953            100 100                                        0

The most important detail is the endpoint_repeat_count column of the hybrid histogram, although you should note that the endpoint_actual_value columns is populated with a copy of the endpoint_value for all three histograms.

Now I’m going to use exp / drop table / imp (or the datapump equivalents) to export, drop, and re-import the table with (one hopes) the exact same statistics. To do this I’ll be using the imp command with the option “statistics=always” with the intention of copying the stats from the export file into the destination database (you’ll have to substitute your own userid/password, of course):


host exp   userid=test_user/test@&m_service file=expdat.dmp tables='(t1)'
-- host expdp userid=test_user/test@&m_service DIRECTORY=dmpdir DUMPFILE=expdp.dmp TABLES='(t1)'

drop table t1 purge;

host imp   userid=test_user/test@&m_service file=expdat.dmp tables='(t1)' statistics=always
-- host impdp userid=test_user/test@&m_service DIRECTORY=dmpdir DUMPFILE=expdp.dmp TABLES='(t1)'

So what do we see now when we re-run the two queries to report the histogram information:


COLUMN_NAME              Distinct HISTOGRAM          Buckets
-------------------- ------------ --------------- ----------
FREQUENCY                     100 FREQUENCY              100
TOP_N                         100 FREQUENCY               95
HYBRID                        100 FREQUENCY               50

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE  ENDPOINT_REPEAT_COUNT
-------------------- --------------- -------------- ---------------------- ---------------------
FREQUENCY                          3              1                                            0
                                   8              2                                            0
                                  15              3                                            0
...
                                9800             98                                            0
                                9999             99                                            0
                               10000            100                                            0

HYBRID                             3              1                                            0
                                 224             14                                            0
                                 440             20                                            0
...
                                9603             97                                            0
                                9800             98                                            0
                               10000            100                                            0

TOP_N                              1              1                                            0
                                  16              7                                            0
                                  33              8                                            0
...
                                9753             98                                            0
                                9952             99                                            0
                                9953            100                                            0

The histograms on all three columns are now labelled as FREQUENCY.
The endpoint_actual_value is null for all three – but that may be a purely cosmetic detail with no side effects.
The “hybrid” column really has become a frequency histogram – and that’s the critical one – the endpoint_repeat_count columns are all zero.

tl;dr

If you’re still using exp/imp instead of expdp/impdp to move tables (and, more importantly, their statistics) from one database to another then the upgrade to 12c may mean you end up with hybrid histograms on the source system that are “downgraded” to frequency histograms on the destination system, with the effect that execution plans vary between the two systems.

A look into oracle redo, part 9a: commit – concurrency considerations

During the investigations of my previous blogpost about what happens during a commit and when the data becomes available, I used breaks in gdb (GNU debugger) at various places of the execution of an insert and a commit to see what is visible for other sessions during the various stages of execution of the commit.

However, I did find something else, which is very logical, but is easily overlooked: at certain moments access to the table is blocked/serialised in order to let a session make changes to blocks belonging to the table, or peripheral blocks like undo, for the sake of consistency. These are changes made at the physical layer of an Oracle segment, the logical model of Oracle says that writers don’t block readers.

The first question is if this is a big issue. Well, for that you should simply look at any normal usage of an Oracle database. I don’t believe these serialisation moments are an issue, because the majority of my client’s does not experience serialisation problem.

However, this can become an issue if database processes can not run at will. What that means is that if processes are randomly stopped from execution by the operating system or hypervisor, it could be that the database process can be at an earlier mentioned ‘serialisation point’, which then means that access for all other processes remains blocked until the process is made running again, which then gives the process the opportunity to free exclusive access to a resource.

What are issues when a database process can not run at will?
– CPU oversubscription. If more processes are running than CPU’s are available (think about core’s and threads, and what it means for your CPU architecture), it means the operating system needs to make a decision on what it wants to be running, and what it wants to wait. No operating system scheduler understands or can know when an Oracle database process is running in a critical part of code, and therefore should not stop execution of it.
– Memory oversubscription. Whenever your system starts actively swapping in and out, your system is swapping, which means it’s moving active pages onto the swap device, and reads back active pages from the swap device for a process that demands them. This will activate additional tasks, and make processes get stuck waiting for memory to become available at all kinds of points during execution. Often, swapping is the beginning of the end, which means the services on the server stop functioning altogether for multiple reasons, like the Out Of Memory killer, or simply timing out for client’s of a process.
– Oversubscription of CPU or memory using virtualisation. As a troubleshooter, I would say this is actually worse than the ‘simple oversubscription’ mentioned in the two earlier points. Because typically, the visible part is the virtual machine. The virtual machine in this case itself is not oversubscribed, the oversubscription takes place at a layer invisible to the virtual machine, which unexplainably performs unpredictable. It is simply not okay to oversubscribe any machine that needs to run something that is latency sensitive like the Oracle database, unless you fully understand all the trade-offs that are coming with it, and you have the ability to understand when it does start to occur.

Whenever database processes can not run at will, you will see waits you didn’t see when processes could run at will. Typical waits in that case are (not an exhaustive list):
– any latch waits
– buffer busy waits
– any cursor: pin waits
Please mind these waits are not unique for this issue, any of these waits can occur for other reasons too.

Let me show a couple of examples for which the wait is artificially created by stopping execution at a point where this serialisation takes place, thereby mimicking getting put off cpu due to load, using the earlier insert and commit execution:

a) latch: cache buffers chains
Whenever a process needs to read or modify a database buffer, it needs to ‘pin’ the buffer. Such a pin is also known as a buffer handle or a buffer state object. The pin, which is a memory area, must be obtained from a freelist and attached to the buffer header in order to pin the buffer. A pin is obtained using the kcbzgs (kernel cache buffers helper functions get state object) function, which calls the kssadf_numa_intl (kernel service state objects add from freelist numa internal) function which initialises the state object.

The function that performs the actual pin of a block is not a single one, pinning a block is done in multiple functions. Two of them are kcbgtcr (kernel cache buffers get consistent read) and kcbgcur (kernel cache buffers get current). Under normal circumstances, concurrency for a consistent read of a buffer (kcbgtcr) does not lead to any blocking, because the latch (cache buffers chains) can be taken in shared mode, and the pinning is done in ‘fast mode’ (my interpretation):

> kcbgtcr(0x7f88a8b79db0, 0, ...)
| > kscn_to_ub8_impl(0x7f88a8b88d4c, 0x6baefdc0, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0x287562
| > ktrexf(0x7fff998123f8, 0x7f88ae5d7f30, ...)
| | > ktrEvalBlockForCR(0x7f88a8b88d4c, 0x7f88ae5d7f30, ...)
| | < ktrEvalBlockForCR+0x0000000000f4 returns: 0x1
| < ktrexf+0x0000000000a4 returns: 0x9
| > kcbzgsf(0x1, 0x7fff998120b8, ...)
| | > kssadf_numa_intl(0x36, 0x6ddc0b50, ...)
| | | > kpdbIdToUid(0, 0x6ddc0b50, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x6bf84908
| < kcbzgsf+0x0000000001a0 returns: 0x6bf84908
| > kcbz_fp_cr(0x8dff9478, 0x6bf84998, ...)
| < kcbz_fp_cr+0x000000000089 returns: 0
| > kscn_to_ub8_impl(0x7f88a8b88d70, 0x1, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0
< kcbgtcr+0x000000000b38 returns: 0x8df94014

Inside the kcbgtcr function, kcbzgsf (kernel cache buffers helper functions test state object fast) is called, and after that, the state object is pinned to the block in kcbz_fp_cr (kernel cache buffers helper functions fast pin for consistent read). Do you see the absence of ksl_get_shared_latch function to serialise access to add this pin? Actually the cache buffers chains latch is gotten, but it’s done inside the kcbgtcr function. The latch is gotten before the ktrexf (kernel transaction redo examine block fast) in shared mode, and the latch is released after the kcbz_fp_cr function. Multiple processes can pin the block for consistent read without serialisation or blocking each other because of the cache buffers chains latch being a shared latch, and this latch is taken in shared mode.

Why do I show this? Well, in the “old” days, which is Oracle 8.1, Oracle did not use shared latches (at least on the platforms I worked on, but if my memory serves me right, shared latches were introduced starting from Oracle 9), which meant for any logical read the cache buffers chains latch had to be obtained. This could get painful for very frequently visited blocks, like index root blocks, and high concurrency, and additional to that, because there was no post/wait messaging for latch waits but only spinning, this meant lots of processes waiting for latch number 98, and CPU usage going through the roof.

Back to today. Below is shown how the function calls look like when DML is done. When a buffer change is done (data is changed), a buffer has to be obtained in current mode, which is done using the kcbgcur function:

> kcbgcur(0x7ffda5957c70, 0x2, ...)
| > kcbzgs(0x1, 0x2, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kti_is_imu(0x7867ca50, 0xb0f71018, ...)
| | > ktcgtx(0x76f29d50, 0xb0f71018, ...)
| | < ktcgtx+0x00000000001e returns: 0x76f29d50
| < kti_is_imu+0x000000000039 returns: 0
< kcbgcur+0x0000000009fb returns: 0xb0302014

This looks reasonably the same as the previous call overview which shown kcbgtcr, however the kcbzgsf function changed to kcbzgs, so minus the ‘f’ for fast, and a few other functions are missing. Another interesting thing (not visible) is the cache buffers chains latch is obtained after the kcbzgs function that obtains the buffer state object, and the latch is obtained with special bit 0x1000000000000000 set to indicate the shared latch is obtained in non-shared mode. After the kti_is_imu function the pin is attached to the buffer and the latch is freed.

To make things a little more complicated, a buffer can be gotten in current mode, but still get the cache buffers chains in shared mode. This is how that looks like:

> kcbgcur(0x7ffda5955620, 0x1, ...)
| > kcbzgs(0x1, 0x1, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kcbz_fp_shr(0xb0fa3cb8, 0x7867ca50, ...)
| < kcbz_fp_shr+0x000000000085 returns: 0x7f3011d82f40
< kcbgcur+0x0000000009fb returns: 0xb07a0014

The function kcbzgs without ‘f’ is called, but the cache buffers chains latch is gotten in shared mode (not visible; done in the kcbgcur function after kcbzgs), and there’s the function kcbz_fp_shr (kernel cache buffers helper function fast pin shared) to pin the buffer in shared mode.

In order to be allowed to change a buffer, the cache buffers chains latch must be taken in non-shared mode to change the buffer header to indicate the buffer is busy and guarantee only the process itself can access it. Obtaining the cache buffers chains latch in non-shared mode means that access to the hash bucket (multiple hash buckets actually) to which the buffer is linked is blocked until the latch is freed. Please mind the period that the latch is obtained here is very short.

However, this blocking scenario can be easily replayed:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on the kti_is_imu function (break kti_is_imu) and continue.
4. session 1: insert a row into a table (insert into test values (‘a’);). the breakpoint will fire in gdb, stopping execution.
5. session 3: startup another database foreground session.
6. session 3: select the table on which the insert just broke (select * from test;). This will hang.
7. session 4: startup a SYSDBA session, and obtain the wait state of the foreground sessions:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------
59,53450,@1     28166    SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:98
102,53150,@1    28083    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:66867.83ms,event:ON CPU:SQL,seq#:33
101,6637,@1     28186    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:344013.82ms,event:latch: cache buffers chainsp1:1811473056p2:289p3:24599> Blocked by (inst:sid): 1:102,seq#:39

SID 101 is waiting on the cache buffers chains latch, because SID 102 is holding that. SID 102 is showing ON CPU because we broke execution outside of a wait event, so Oracle thinks it’s busy executing, while it is actually stopped by gdb.

Again, this is a wait (latch: cache buffers chains) that is unlikely be an issue, and if it is, you either ran into a bug, or this is only the way a much bigger problem (oversubscription, bad application design) is showing itself.

b) buffer busy waits
The name of the wait event ‘buffer busy wait’ is self explanatory. The technical implementation is way lesser known. What technically happens during a buffer busy wait, is that a session needs to read a certain buffer, and in order to read it, it performs the usual actions of selecting the correct cache buffers chains bucket, following the double linked list to the buffer headers to find the best best buffer and then obtain a buffer state object. The critical part for running into the buffer busy wait is when the session checks the state of the buffer via the buffer header and finds the change state (X$BH.CSTATE) to be higher than 0, indicating the block is currently being changed. At this point the session calls kcbzwb (kernel cache buffers helper function wait for buffer), and must wait for the buffer state to be changed and therefore accessible, for which the wait event is the buffer busy wait event. When the change state is reverted back to 0, the buffer is accessible again.

Once a buffer is pinned in current mode using kcbgcur en kcbzgs, the buffer is still fully accessible for all sessions. Only once the database truly starts to change the buffer, which is done in kcbchg1_main, the cache buffers chains latch is taken non-shared, and the CSTATE field in the buffer header is changed to 1 and then the latch is freed. My current investigations show that the CSTATE is changed to 2 after the kcopcv and kcrfw_copy_cv functions, so when the redo is written to the public redo strand, and set to 4 after the changes have been written to the buffers, which is done in kcbapl. At the end of the kcbchg1_main function, when all the changes are done, the cache buffers chains latch is taken non-shared again, the CSTATE field is set to 0 and the latch is freed to enable access to the buffer again.

68801:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0/0()
68802:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0x1000000000000022/1152921504606847010()
..
68897:kcbchg1_main+3318:0x00000000967eddda(Variable Size(pgsz:2048k)|+901701082 buffer header|9668A000+66 ):W:1:0x1/1()
..
68939:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0x1000000000000022/1152921504606847010()
68940:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0/0()

Above you see a cache buffers chains latch being obtained non shared (special bit 0x1000000000000000 is set), a little later the buffer header is updated which sets the value 1 at offset 66. Once this happens, this is visible in the CSTATE column of X$BH. A little further the latch is freed.

Why am I showing all of this? Well, there is quite some work that is done while the buffer is in a changed state (CSTATE>0) and therefore inaccessible. In normal situations, this is not a problem because despite all the work this is done very quickly, and therefore it’s hardly noticeable/measurable, so there’s no or very little time spend in buffer busy waits in most databases

However… if processes do get stuck randomly, it’s possible that a process gets stuck while making changes to a buffer, which then means any additional access to the buffer results in buffer busy waits. Also mind that a change to a table buffer requires (at least; in the most simple case) three buffers to be changed: the buffer containing the table data, the buffer containing the undo segment transaction table and the buffer containing the actual undo. A commit requires one buffer: the undo segment transaction table.

Let me show you an example:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on ktugur and kcopcv and continue.
4. session 3: startup a database foreground session.
5. session 1: insert a row into a table: insert into test values (‘a’);
6. session 2: gdb breaks execution of session 1 because it encountered ktugur.
7. session 3: select * from test; this runs unblocked.
8. session 4: startup a database foreground session as sysdba.
9. session 4: select * from v$lock; this will hang.

This one wasn’t obvious. I found this upon doing the tests. When a foreground session is in ktugur, it is creating the change vectors (kernel transaction undo generate undo and redo), and it holds two buffers in current mode:

SQL> l
  1  select 	ts.name tablespace,
  2  	dbablk,
  3  	class,
  4  	state,
  5  	decode(state,0,'free',1,'xcur',2,'scur',3,'cr', 4,'read',5,'mrec',6,'irec',7,'write',8,'pi', 9,'memory',10,'mwrite',11,'donated', 12,'protected',  13,'securefile', 14,'siop',15,'recckpt', 16, 'flashfree',  17, 'flashcur', 18, 'flashna') state_decoded,
  6  	mode_held,
  7  	changes,
  8  	cstate,
  9  	flag,
 10  	flag2,
 11  	rflag,
 12  	sflag,
 13  	lru_flag,
 14  	dirty_queue
 15  from 	  x$bh b
 16  	, v$tablespace ts
 17  where 	1=1
 18  and	not (us_nxt = us_prv and wa_nxt = wa_prv and to_number(wa_nxt,'xxxxxxxxxxxxxxxx') = to_number(us_nxt,'xxxxxxxxxxxxxxxx') + 16)
 19* and	b.ts#=ts.ts#
SQL> /

TABLESPACE                         DBABLK      CLASS      STATE STATE_DECO  MODE_HELD    CHANGES     CSTATE       FLAG      FLAG2      RFLAG      SFLAG   LRU_FLAG DIRTY_QUEUE
------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         17          1 xcur                2          1          0    2097152          0          0          0          0           0
TS                                  24597          1          1 xcur                2          1          0    2097153          0          0          0          0           0

The thing that was not obvious at first was that the blocks have CSTATE 0; so they can be accessed and read for a consistent read. Then I look at the wait and the p1/2/3 of the wait:

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------------------------------------------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:2519.48ms,event:buffer busy waitsp1:3p2:128p3:17> Blocked by (inst:sid): 1:59,seq#:476
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:47115.23ms,event:SQL*Net message from client,seq#:226
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3165
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:31292.51ms,event:ON CPU:SQL,seq#:111

Above SID 100 is session 4; it hangs on a buffer busy wait. The datafile is #2, which is the undo table space UNDOTBS1 and the buffer that it is waiting for is 128. Why does SID 100 needs to wait for the buffer, while another session (session 3 in the above runbook) can run without getting blocked for a buffer that is held exactly in the same state (xcur and CSTATE is zero)?

The answer can be gotten when executing pstack on the waiting process (or attach with gdb and obtain a backtrace):

(gdb) bt
#0  0x00007fd6a085bbda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000015c5e5e in kcbzwb ()
#6  0x0000000010b3748e in kcbgcur ()
#7  0x0000000010ab9b17 in ktuGetUsegHdr ()
#8  0x00000000014075d3 in ktcxbcProcessInPDB ()
#9  0x0000000001406de2 in ktcxbc ()
#10 0x0000000010ea8781 in qerfxFetch ()
#11 0x0000000010bb919c in rwsfcd ()
#12 0x00000000035ae618 in qeruaFetch ()
#13 0x00000000035ad837 in qervwFetch ()
#14 0x0000000010bb919c in rwsfcd ()
#15 0x0000000010ea1528 in qerhnFetch ()
#16 0x0000000010bb919c in rwsfcd ()
#17 0x0000000010ea1528 in qerhnFetch ()
#18 0x0000000010c9eeb3 in opifch2 ()
#19 0x000000000268c3de in kpoal8 ()
#20 0x0000000010ca5b3d in opiodr ()
#21 0x0000000010f37a29 in ttcpip ()
#22 0x0000000001eb4674 in opitsk ()
#23 0x0000000001eb914d in opiino ()
#24 0x0000000010ca5b3d in opiodr ()
#25 0x0000000001eb04ed in opidrv ()
#26 0x0000000002afedf1 in sou2o ()
#27 0x0000000000d05577 in opimai_real ()
#28 0x0000000002b09b31 in ssthrdmain ()
#29 0x0000000000d05386 in main ()

The important bit is at #6: kcbgcur (kernel cache buffers get buffer in current mode)! In other words: the block is gotten in current mode while another session is holding the block in current mode, which means it has to wait. The function kcbzwb means kernel cache buffer helper function wait for buffer.

I really wondered why the undo segment buffer is gotten in current mode; this is a read, there is no intention to change something. The current educated guess is this a way to be sure this buffer contains the current state of the transactions across all nodes in a RAC cluster. So this might be a clever optimisation to be absolutely sure the current state of the undo segment transaction table is gotten.

10. session 2: continue. This will break on kcopcv.
11. session 3: select * from test; this will hang.

Now we have two hanging sessions waiting for the wait event buffer busy wait:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -----------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:120790ms,event:buffer busy waitsp1:3p2:128p3:35> Blocked by (inst:sid): 1:59,seq#:501
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:1175715.03ms,event:buffer busy waitsp1:4p2:24597p3:1> Blocked by (inst:sid): 1:59,seq#:250
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3309
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:9544.97ms,event:ON CPU:SQL,seq#:116

SID 100 still hangs in wait event buffer busy wait for file 3, block 128, but now the session doing the full table scan also hangs in wait event buffer busy wait, but for file 4, block 24597. Let’s look at the buffers held in current mode again:

SQL> @pinned_blocks

TABLESPACE                         DBABLK      CLASS      STATE STATE_DECO  MODE_HELD    CHANGES     CSTATE       FLAG      FLAG2      RFLAG      SFLAG   LRU_FLAG DIRTY_QUEUE
------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         35          1 xcur                2          0          1    2097156          0          0          0          0           0
UNDOTBS1                             3127         36          1 xcur                2          0          1    2097156          0          0          0          0           0
TS                                  24597          1          1 xcur                2          0          1    2097157          0          0          0          0           0

Because the insert progressed to kcopcv, the CSTATE of all three buffers involved in the insert, the undo segment transaction table, the undo segment undo entry and the buffer which has the inserted row are set to 1. This means that the two buffer busy waits are actually two different reasons: the select * from v$lock still waits to get a buffer in current mode, and the select * from test, which does a consistent read, now is waiting for the CSTATE to get set back to 0. In order to solve the hanging, go to session 2 and continue. This will break again on kcoapl, and again if you continue, because there are 3 blocks involved in this insert. Another way to make the insert finish is to disable all breakpoints (disable command) or to quit gdb.

Just for the fun of it, here is another example of how a stuck session can escalate: a process that gets stuck executing the kcbapl (kernel cache buffers apply a change to a buffer). This scenario can be replayed using the following steps:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on kcbapl and continue.
4. session 3: startup a database foreground session.
5. session 3: insert a row into a table: insert into test values (‘a’);
6. session 1: insert a row into a table: insert into test values (‘a’);
7. session 2: gdb broken execution of session 1 because it encountered kcbapl.
8. session 3: commit; this will hang

Can you reason why session 3 is hanging? Both the sessions are doing inserts to the same table…Give it a thought and replay it yourself.

Explanation: the breakpoint stopped execution before actually doing the changes. CSTATE for the buffer is set to 1, it will get set to 2 inside kcbapl to indicate the change vectors have been copied. But the important bit is that a redo copy latch was taken before executing kcbapl. The commit in session 3 sent a message to the logwriter asynchronously asking it to write (first invocation of kcrf_commit_force_int), and then freed the logical locking of the insert, and then invoked kcrf_commit_force again to wait for the logwriter to finish writing and get notified. However, the logwriter needs to obtain all the redo copy latches (it actually reads a metadata structure that indicates the redo copy latches are freed or not), at which it needs to wait because session 1 stuck holding a redo copy latch and therefore needs to wait. Not only does this show how dangerous it is when sessions can’t run, it also shows how dangerous it can be to run gdb in a production database, it can be quite hard to figure out what will be the implication of attaching with gdb to the process. So the actual outcome of above is that session 3 waits in the event ‘log file sync’, and the logwriter is waiting in ‘LGWR wait for redo copy’.

Summary
The Oracle database is carefully setup to be able to handle concurrency. In most normal situations you will see no to maybe very little time spend in wait events that indicate concurrency, like latches or buffer busy waits. If an Oracle database is on a machine that has more active processes than available CPU threads or cores (this depends on the hardware and CPU architecture; for Intel based architectures you should lean towards core’s, for SPARC you can lean more towards threads), the operating system needs to make choices, and will do that trying to give all processes an equal share. This means that processes will be put off CPU in order to give all processes a fair share of CPU run time. Also mind other oversubscription possibilities, like using more memory than physically available, and doing the same using virtualisation, which is handing out more VCPU’s than CPU threads, and handing out more memory than physically available. Because you can’t look outside your virtual machine, it’s often harder to troubleshoot virtualisation oversubscription issues.

Oracle really tries to minimise the amount of time spend while holding a non shared latch. Therefore with modern Oracle databases it’s unlikely that latch waits show up, even if a system is (not too much) oversubscribed. If you do have a latching problem, it’s most likely you have got another problem which causes this.

The most well documented reason for buffer busy waits is a buffer being written onto persistent media, and therefore the buffer is marked busy for the time of the write. Another reason that is documented, is getting a buffer in a mode that is not compatible with the mode it is held in. Having a buffer in current mode will block another request for the buffer in current mode. But, having a buffer pinned in current mode does not block consistent read access, it’s only when the buffer is being changed that requires the buffer to be marked busy. This state is externalised in the CSTATE column of the view X$BH, which shows fields of the buffer header struct that is used to manage the buffer cache buffers. Because there is quite some work to be done when a buffer is changed, and during making these changes the buffers are not usable (‘busy’) for other processes, a buffer busy wait is more likely to occur when oversubscription is taking place.

Thanks to Jonathan Lewis for explaining how to get the buffer header addresses.
Thanks to Abel Macias for brainstorming.