Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Advanced usage of gdb for profiling

This post is about how to use gdb, which is a debugger, so very simplistically put an aid for looking at C programs, as a profiler. I use gdb quite a lot for profiling because it’s the easiest way for profiling for me.

Lots of people which I know use other tools like perf, systemtap and dtrace for the same purpose and that’s fine. Each tools has its own advantages and disadvantages. One disadvantage of gdb is that it’s using ptrace to attach to a process, which makes it dead slow from a machine perspective, because everything it then does goes via another process, which is the debugger. That is how the debugger works.

Also lots of people use gdb like I do, and use basic functionality, which is breaking at functions, which makes it possible to find out the sequence of how functions are called, generating backtraces (stack traces) to understand the stack and maybe looking at functions arguments.

But there is more that you can do. Way more actually. These examples use the gdb that comes with RHEL7/OL7 via yum.

You can break conditionally at a function, which means that a break will be executed only if the condition is met. For example:

break sltrgatime64 if $caller_is("ksbabs")

This only breaks at a function called sltrgatime64 if it is called from a function that is called ksbabs.

But for what I wanted to do and investigate, I didn’t only want to do “something” when the sltrgatime64 function was called from ksbabs, I wanted be able to distinguish the two times this was called as the first one and the second one.

I knew that the “first time” was actually right after a function with the name of “dbktFlush” was called. So here’s what I did:

set $status = -1

break dbktFlush if $caller_is("ksbabs")
commands
 silent
 set $status = 0
 c
end

break sltrgatime64 if $caller_is("ksbabs")
commands
 silent
 if $status == 0
   set $status = 1
   printf "first invocation\n"
   c
 end
 if $status == 1
  set $status = 2
  printf "second invocation\n"
  c
 end
 if $status > 1
  printf "error. third or more invocations\n"
 end
end

This is much more advanced than using a single command, but I think for anyone who is able to program a few languages is reasonably simple to understand.

There’s two breaks, and both are conditional, the function from which the function to break on must be ksbabs.
In the break of dbktFlush the (convenience) variable $status is set to 0, indicating the the first invocation for the other break.

In the second break, there are 3 if statements. The first one picks up the variable $status set to 0, indicating the first invocation and then sets it to 1 for the second invocation and the second if picks up the second invocation indicated by the variable $status set to 1 and sets it to 2, so if I didn’t look carefully and there are more invocations of sltrgatime64, I will get notified.

If breaks would be activated when the second call to sltrgatime64 was executed, it would not come across dbktFlush, and thus the $status variable would (still) be set to -1, and the break would skip all ifs.

Of course this is a very specific case, you should tailor it for your own needs. However, this was the actual investigation I done to investigate the function of the above sltrgatime64 calls.

In fact, I gone a little deeper and wanted to understand what the timing of of these two reads of the wall clock time did. For that, I used the python interface in gdb, for which the support seems to be compiled in in the rpm based version.

You can use python, and use variables in python that remain state between gdb interactions. I used a dead simple interaction; calling time.time() to time the invocation of the sltrgatime64 executed right after dbktFlush was called:

python import time
python var_time=time.time()

break dbktFlush if $caller_is("ksbabs")
commands
 silent
 set $status = 0
 c
end

break sltrgatime64 if $caller_is("ksbabs")
commands
 silent
 if $status == 0
   set $status = 1
   python print time.time()-var_time
   python var_time=time.time()
   c
  else
   c
 end
end

If you set single logwriter to true and execute this against the logwriter, you will see that the logwriter tries to keep a 3 second pace by adjusting the sleep time on semtimedop().

My actual goal was to try to find out what is exactly timed for the message ‘Warning: log write elapsed time ????ms, size ?KB’ in the logwriter trace file, which Nenad Noveljic proved is not only about the actual IO time.

To be honest, I done it clumsy by going backward measuring all the calls that measured wall clock time and should have put a break on write and look at the backtrace. When I executed a backtrace of the write call that wrote the trace file entry, I could see this line was written in kcrfw_postprocess_write().

This is the gdb script with the timing:

python import time

break skgstmGetEpochTs if $caller_is("kcrfw_redo_write_driver")
commands
 silent
 python var_time=time.time()
 c
end

break kcrfw_do_write
 commands
 silent
 print $rip
 c
end

break skgstmGetEpochTs if $caller_is("kcrfw_postprocess_write")
commands
 silent
 shell sleep 1
 python print time.time()-var_time
 c
end

What does gdb script does, is always set the python variable var_time, which includes the idle cycles of the logwriter. If it goes into the function kcrfw_do_write it actually is going to write. That is also when potentially the line “Warning: log write elapsed time 1024ms, size 0KB” could be produced.
when the function skgstmGetEpochTs is called in the function kcrfw_postprocess_write, it executes a sleep for 1 second, and then prints the current time minus the time obtained when kcrfw_redo_write_driver was entered.

What this proofs, is that the timing for the log write time elapsed time warning is done from the start of the kcrfw_redo_write_driver function. That means that everything that the logwriter needs to do, excluding the the last bit of what the logwriter performs, posting the foregrounds, is timed, including dealing with the internal structures that keep track of the redo strands and the latching of it.

Get toast chunk_id from the user table tuples or from the toast index thanks to pageinspect

Introduction

TOAST stands for “The Oversized-Attribute Storage Technique” and allows to broke up large fields value into multiple physical rows (see the PostgreSQL documentation for more details).

The goal of this post is to provide a way to retrieve toast’s information from the user table tuples or from the toast index without querying the toast directly.

We will be able to link the user table tuples to the toast pages by using user table tuples and toast index data (not querying the toast at all).

Build the playground

Create a table with a TOAST-able field:

toasted=# CREATE TABLE bdttab (id int,message text,age int);
CREATE TABLE

toasted=# \d+ bdttab
                                   Table "public.bdttab"
 Column  |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description
---------+---------+-----------+----------+---------+----------+--------------+-------------
 id      | integer |           |          |         | plain    |              |
 message | text    |           |          |         | extended |              |
 age     | integer |           |          |         | plain    |              |
Access method: heap

Insert 10 rows:

toasted=# INSERT INTO bdttab
SELECT 1,(SELECT
string_agg(chr(floor(random() * 26)::int + 65), '')
FROM generate_series(1,10000)),6
FROM generate_series(1,10);
INSERT 0 10

check the toast and toast index names:

toasted=# select r.relname,t.relname as toast,i.relname as toast_index from pg_class r, pg_class i, pg_index d, pg_class t where r.relname = 'bdttab' and d.indrelid = r.reltoastrelid and i.oid = d.indexrelid and t.oid = r.reltoastrelid;
relname | toast | toast_index
---------+-----------------+-----------------------
bdttab | pg_toast_192881 | pg_toast_192881_index
(1 row)

Retrieve the chunk_id for each tuple directly from the user table

The chunk_id is part of the tuple’s data as explained in this slide (coming from this presentation):

https://bdrouvot.files.wordpress.com/2020/01/toasted_attributes.png?w=15... 150w, https://bdrouvot.files.wordpress.com/2020/01/toasted_attributes.png?w=30... 300w, https://bdrouvot.files.wordpress.com/2020/01/toasted_attributes.png?w=76... 768w, https://bdrouvot.files.wordpress.com/2020/01/toasted_attributes.png?w=10... 1024w, https://bdrouvot.files.wordpress.com/2020/01/toasted_attributes.png 2082w" sizes="(max-width: 1100px) 100vw, 1100px" />

Note that the toast relation id is also part of the tuple data.

As you can see from the slide we can get the information with pageinspect, so let’s use it to build a query to retrieve the chunk_id and the toast relation id from the tuples:

create the extension:

toasted=# create extension pageinspect;
CREATE EXTENSION

and query the user table to get the information:

toasted=# select
page_item_attrs.t_ctid,
page_item_attrs.t_attrs[2],
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3) as substr_for_chunk_id,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3) as substr_for_toast_relid,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as toast_relid
FROM
generate_series(0, pg_relation_size('bdttab'::regclass::text) / 8192 - 1) blkno ,
heap_page_item_attrs(get_raw_page('bdttab', blkno::int), 'bdttab'::regclass) as page_item_attrs
where
substr(page_item_attrs.t_attrs[2]::text,3,2)='01';
 t_ctid |                t_attrs                 | substr_for_chunk_id | chunk_id | substr_for_toast_relid | toast_relid
--------+----------------------------------------+---------------------+----------+------------------------+-------------
 (0,1)  | \x0112142700001027000077f1020074f10200 | 77f10200            |   192887 | 74f10200               |      192884
 (0,2)  | \x0112142700001027000078f1020074f10200 | 78f10200            |   192888 | 74f10200               |      192884
 (0,3)  | \x0112142700001027000079f1020074f10200 | 79f10200            |   192889 | 74f10200               |      192884
 (0,4)  | \x011214270000102700007af1020074f10200 | 7af10200            |   192890 | 74f10200               |      192884
 (0,5)  | \x011214270000102700007bf1020074f10200 | 7bf10200            |   192891 | 74f10200               |      192884
 (0,6)  | \x011214270000102700007cf1020074f10200 | 7cf10200            |   192892 | 74f10200               |      192884
 (0,7)  | \x011214270000102700007df1020074f10200 | 7df10200            |   192893 | 74f10200               |      192884
 (0,8)  | \x011214270000102700007ef1020074f10200 | 7ef10200            |   192894 | 74f10200               |      192884
 (0,9)  | \x011214270000102700007ff1020074f10200 | 7ff10200            |   192895 | 74f10200               |      192884
 (0,10) | \x0112142700001027000080f1020074f10200 | 80f10200            |   192896 | 74f10200               |      192884
(10 rows)

as you can see we are able to get the chunk_id and the toast relation id from the tuples.

Let’s verify that those values make sense, first checking the toast relation id:

toasted=# select relname from pg_class where oid=192884;
     relname
-----------------
 pg_toast_192881
(1 row)

and then checking the chunk_ids from the toast itself:

toasted=# select distinct(chunk_id) from pg_toast.pg_toast_192881;
 chunk_id
----------
   192894
   192895
   192889
   192888
   192893
   192892
   192896
   192891
   192887
   192890
(10 rows)

The values match the ones we got directly from the user table tuples.

Retrieve the chunk_id and chunk_seq directly from the toast index

The toast index contains those information, so let’s query it too:

toasted=# select
page_items.ctid,
page_items.data,
('x'||regexp_replace(substr(page_items.data,1,11),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
('x'||regexp_replace(substr(page_items.data,13,23),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_seq
FROM
generate_series(1, pg_relation_size('pg_toast.pg_toast_192881_index'::regclass::text) / 8192 - 1) blkno ,
bt_page_items('pg_toast.pg_toast_192881_index', blkno::int) as page_items;

  ctid  |          data           | chunk_id | chunk_seq
--------+-------------------------+----------+-----------
 (0,1)  | 77 f1 02 00 00 00 00 00 |   192887 |         0
 (0,2)  | 77 f1 02 00 01 00 00 00 |   192887 |         1
 (0,3)  | 77 f1 02 00 02 00 00 00 |   192887 |         2
 (0,4)  | 77 f1 02 00 03 00 00 00 |   192887 |         3
 (1,1)  | 77 f1 02 00 04 00 00 00 |   192887 |         4
 (1,2)  | 77 f1 02 00 05 00 00 00 |   192887 |         5
 (1,3)  | 78 f1 02 00 00 00 00 00 |   192888 |         0
 (1,4)  | 78 f1 02 00 01 00 00 00 |   192888 |         1
 (2,1)  | 78 f1 02 00 02 00 00 00 |   192888 |         2
 (2,2)  | 78 f1 02 00 03 00 00 00 |   192888 |         3
 (2,3)  | 78 f1 02 00 04 00 00 00 |   192888 |         4
 (2,4)  | 78 f1 02 00 05 00 00 00 |   192888 |         5
 (3,1)  | 79 f1 02 00 00 00 00 00 |   192889 |         0
 (3,2)  | 79 f1 02 00 01 00 00 00 |   192889 |         1
 (3,3)  | 79 f1 02 00 02 00 00 00 |   192889 |         2
 (3,4)  | 79 f1 02 00 03 00 00 00 |   192889 |         3
 (4,1)  | 79 f1 02 00 04 00 00 00 |   192889 |         4
 (4,2)  | 79 f1 02 00 05 00 00 00 |   192889 |         5
 (4,3)  | 7a f1 02 00 00 00 00 00 |   192890 |         0
 (4,4)  | 7a f1 02 00 01 00 00 00 |   192890 |         1
 (5,1)  | 7a f1 02 00 02 00 00 00 |   192890 |         2
 (5,2)  | 7a f1 02 00 03 00 00 00 |   192890 |         3
.
.
.
 (12,3) | 7f f1 02 00 02 00 00 00 |   192895 |         2
 (12,4) | 7f f1 02 00 03 00 00 00 |   192895 |         3
 (13,1) | 7f f1 02 00 04 00 00 00 |   192895 |         4
 (13,2) | 7f f1 02 00 05 00 00 00 |   192895 |         5
 (13,3) | 80 f1 02 00 00 00 00 00 |   192896 |         0
 (13,4) | 80 f1 02 00 01 00 00 00 |   192896 |         1
 (14,1) | 80 f1 02 00 02 00 00 00 |   192896 |         2
 (14,2) | 80 f1 02 00 03 00 00 00 |   192896 |         3
 (14,3) | 80 f1 02 00 04 00 00 00 |   192896 |         4
 (14,4) | 80 f1 02 00 05 00 00 00 |   192896 |         5
(60 rows)

Note that the chunk_ids coming from the index, the user table tuples and the toast itself match.

Use case example: a toast’s page is corrupted and I want to know which tuples from the user table are affected

Say the page 12 of the toast is corrupted, then we could get the affected chunk_id and chunk_seq from the toast index that way (by filtering on the ctid):

toasted=# select
page_items.ctid,
page_items.data,
('x'||regexp_replace(substr(page_items.data,1,11),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
('x'||regexp_replace(substr(page_items.data,13,23),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_seq
FROM
generate_series(1, pg_relation_size('pg_toast.pg_toast_192881_index'::regclass::text) / 8192 - 1) blkno ,
bt_page_items('pg_toast.pg_toast_192881_index', blkno::int) as page_items where ctid::text like '(12,%';

  ctid  |          data           | chunk_id | chunk_seq
--------+-------------------------+----------+-----------
 (12,1) | 7f f1 02 00 00 00 00 00 |   192895 |         0
 (12,2) | 7f f1 02 00 01 00 00 00 |   192895 |         1
 (12,3) | 7f f1 02 00 02 00 00 00 |   192895 |         2
 (12,4) | 7f f1 02 00 03 00 00 00 |   192895 |         3
(4 rows)

And then look back at the user table tuples that way by filtering on the chunk_id:

toasted=# select
page_item_attrs.t_ctid,
page_item_attrs.t_attrs[2],
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3) as substr_for_chunk_id,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3) as substr_for_toast_relid,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as toast_relid
FROM
generate_series(0, pg_relation_size('bdttab'::regclass::text) / 8192 - 1) blkno ,
heap_page_item_attrs(get_raw_page('bdttab', blkno::int), 'bdttab'::regclass) as page_item_attrs
where
substr(page_item_attrs.t_attrs[2]::text,3,2)='01' and ('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int = 192895;

 t_ctid |                t_attrs                 | substr_for_chunk_id | chunk_id | substr_for_toast_relid | toast_relid
--------+----------------------------------------+---------------------+----------+------------------------+-------------
 (0,9)  | \x011214270000102700007ff1020074f10200 | 7ff10200            |   192895 | 74f10200               |      192884
(1 row)

so that we know that this tuple with ctid (0,9) is linked to the corrupted toast page.

Remarks

  • t_attrs[2] is used in the query to retrieve the information from the user table. This is because the TOAST-able field (message) is the 2nd of the relation.
  • substr(page_item_attrs.t_attrs[2]::text,3,2)=’01’ is used to filter the user table tuples information. This is because 0x01 is used as TOASTed string marker (see the slide).

Conclusion

Thanks to pageinspect we have been able to retrieve the chunk_id directly from the user table tuples. We also have been able to link the user table tuples to the toast pages by using user table tuples and toast index data (not querying the toast at all).

push_having_to_gby() – 2

The problem with finding something new and fiddling with it and checking to see how you can best use it to advantage is that you sometimes manage to “break” it very quickly. In yesterday’s blog note I introduced the /*+ push_having_to_gby(@qbname) */ hint and explained why it was a useful little enhancement. I also showed a funny little glitch with a missing predicate in the execution plan.

Today I thought I’d do something a little more complex with the example I produced yesterday, and I’ve ended up with a little note that’s not actually about the hint, it’s about something that appeared in my initial testing of the hint, and then broke when I pushed it a little further. Here’s a script to create data for the new test:

rem
rem     Script:         push_having_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',50,'x')                padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

insert into t1 values (2, lpad(2,10,'0'), lpad('x',50,'x'));
commit;

alter table t1 modify id not null;
create index t1_i1 on t1(id) nologging;

create table t2 as select * from t1;
create index t2_i1 on t2(id) nologging;

I’ve created two tables here, one a clone of the other, with one id value out of 1 million having two rows. As we saw yesterday it’s quite simple to write some SQL that uses an index full scan on the t1_i1 index to check for duplicate id values without doing a massive sort or hash aggregation:


set serveroutput off
alter session set statistics_level = all;

select
        /*+
                qb_name(driver)
                index(@driver t1@driver)
        */
        id 
from
        t1
where   id is not null
group by 
        id 
having  
        count(1) > 1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));


-------------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |      1 |00:00:00.87 |    2229 |   2228 |
|   1 |  SORT GROUP BY NOSORT|       |      1 |  50000 |      1 |00:00:00.87 |    2229 |   2228 |
|   2 |   INDEX FULL SCAN    | T1_I1 |      1 |   1000K|   1000K|00:00:00.40 |    2229 |   2228 |
-------------------------------------------------------------------------------------------------

As we saw yesterday this plan simply walks the index in order keeping track of a “running count” and doesn’t allocate a large PGA to sort a million rows of data, but there’s no asterisk by any operation telling us that there’s a predicate being checked, and no Predicate Information section to report the “count(1) > 1” predicate that we know exists (and is used, since the query produces the right answer).

Having ascertained that there is one duplicated id in the table, let’s join to the (clone) t2 table to list the rows for that id – and lets use the initial query as an inline view:

select
        /*+ 
                qb_name(main)
        */
        t2.v1
from    (
        select
                /*+
                        qb_name(driver)
                        index(@driver t1@driver)
                        no_use_hash_aggregation(@driver)
                */
                id 
        from
                t1
        where   id is not null
        group by 
                id 
        having  
                count(1) > 1
        )                       v1,
        t2
where
        t2.id = v1.id
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |      2 |00:00:00.76 |    2234 |     87 |       |       |          |
|   1 |  NESTED LOOPS                |       |      1 |  50000 |      2 |00:00:00.76 |    2234 |     87 |       |       |          |
|   2 |   NESTED LOOPS               |       |      1 |        |      2 |00:00:00.75 |    2232 |     28 |       |       |          |
|   3 |    VIEW                      |       |      1 |  50000 |      1 |00:00:00.75 |    2228 |      0 |       |       |          |
|*  4 |     SORT GROUP BY            |       |      1 |  50000 |      1 |00:00:00.75 |    2228 |      0 |    53M|  2539K|   47M (0)|
|   5 |      INDEX FULL SCAN         | T1_I1 |      1 |   1000K|   1000K|00:00:00.26 |    2228 |      0 |       |       |          |
|*  6 |    INDEX RANGE SCAN          | T2_I1 |      1 |        |      2 |00:00:00.01 |       4 |     28 |       |       |          |
|   7 |   TABLE ACCESS BY INDEX ROWID| T2    |      2 |      1 |      2 |00:00:00.01 |       2 |     59 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(COUNT(*)>1)
   6 - access("T2"."ID"="V1"."ID")

As you can see from this plan, I didn’t get the “sort group by nosort” that I wanted – even though the inline view was not merged. In fact, you’ll notice the /*+ no_use_hash_aggregation() */ hint I had to include to get a sort group by rather than a hash group by. The logic behind resolving this query block changed significantly when it went into a more complex query.

Having tried adding several other hints (blocking nlj_prefetch, nlj_batching, batched index access, setting cardinality to 1, first_rows(1) optimisation) I finally came down to using a materialized CTE (common table expression / “with” subquery):

with v1 as (
        select
                /*+
                        qb_name(driver)
                        index(@driver t1@driver)
                        materialize
                */
                id 
        from
                t1
        where
                id is not null
        group by 
                id 
        having  
                count(1) > 1
)
select
        /*+ 
                qb_name(main)
        */
        t2.v1
from    
        v1,
        t2
where
        t2.id = v1.id
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |      1 |        |      2 |00:00:00.86 |    2236 |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |      1 |        |      2 |00:00:00.86 |    2236 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D66F8_E3B235A |      1 |        |      0 |00:00:00.86 |    2229 |
|   3 |    SORT GROUP BY NOSORT                  |                            |      1 |  50000 |      1 |00:00:00.86 |    2228 |
|   4 |     INDEX FULL SCAN                      | T1_I1                      |      1 |   1000K|   1000K|00:00:00.39 |    2228 |
|   5 |   NESTED LOOPS                           |                            |      1 |  50000 |      2 |00:00:00.01 |       6 |
|   6 |    NESTED LOOPS                          |                            |      1 |        |      2 |00:00:00.01 |       4 |
|   7 |     VIEW                                 |                            |      1 |  50000 |      1 |00:00:00.01 |       0 |
|   8 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D66F8_E3B235A |      1 |  50000 |      1 |00:00:00.01 |       0 |
|*  9 |     INDEX RANGE SCAN                     | T2_I1                      |      1 |        |      2 |00:00:00.01 |       4 |
|  10 |    TABLE ACCESS BY INDEX ROWID           | T2                         |      2 |      1 |      2 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("T2"."ID"="V1"."ID")

You’ll notice that the hinting is back to the bare minimum – with only the addition of the /*+ materialize */ hint in the CTE. You’ll also notice that the “count(1) > 1” predicate is still missing. But critically we do have the index full scan leading into a sort group by nosort and no huge memory allocation.

The price we have to pay is that we do direct path writes to the temporary tablespace to materialize the CTE and db file scattered reads to read the data back. But since this example is aimed at a large data set returning a small result set this may be a highly appropriate trade off.

It’s possible that a detailed examination of the 10053 trace file would give us a clue about why Oracle can find the sort group by nosort when the query block is a materialized CTE but not when it’s an inline view – but I’m happy to leave that investigation to someone else and just leave this here as a warning that sometimes (even in 19c) there’s a difference between a non-merged view path and a materizlied subquery path.

 

Where does the log writer spend its time on?

The Oracle database log writer is the process that fundamentally influences database change performance. Under normal circumstances the log writer must persist the changes made to the blocks before the actual change is committed. Therefore, it’s vitally important to understand what the log writer is exactly doing. This is widely known by the Oracle database community.

The traditional method for looking at log writer performance is looking at the wait event ‘log file parallel write’ and the CPU time, and comparing that to the ‘log file sync’ alias “commit wait time”. If ‘log file parallel write’ and ‘log file sync’ roughly match, a commit is waiting on the log writer IO latency, if it isn’t then it’s unclear, and things get vague.

Now we get to the gist of this blogpost: since Oracle 12, there are additional statistics in V$SESSTAT and v$SYSSTAT that describe where the log writer spending it’s time on. However, and sadly, these are not documented by Oracle. The statistics are:

– redo write gather time
– redo write schedule time
– redo write finish time
– redo write time (usec)
– redo write total time

Probably ‘redo write time (usec)’ doesn’t belong to these statistics because the timing is as good as the same as ‘redo write finish time’. Also the statistic ‘redo write time (usec)’ is updated by ‘kcrfw_postprocess_write’ right after the main writing function (kcrfw_do_write), whilst the other redo write timing statistics are updated/made visible by ksu_dispatch_tac outside of the actual writing functions, at the approximate rate of every 3 seconds.

All these statistics contain cumulative timing figure in microseconds (usec).

All these timings start at the start of the function kcrfw_redo_write_driver. This is the main function that handles all the redo writing, including investigating the fixed area structure (kcrfs_) and the shared pool structures (made visible using X$KCRFSTRAND) to see if there are changes.

1. redo write gather time.
This times reading the structures that control allocation of the redo structures, validating foreground processes are done writing the change vectors and applying these to the blocks, updating the LWN SCN. Part of this also happens when there is nothing to write, but then the time is not added to this statistic, only if it found something to write.

2. redo write schedule time.
This includes the time of 1, but now it found something to write, entered the writing function (kcrfw_do_write), and updated the ksfd IO related structures for writing.

3. redo write finish time.
This includes the time of 1 and 2, but this additionally times all the IO related functions for writing the changes it found in the public redo strands, as indicated by the shared pool and fixed area structures.

4. redo write time (usec).
This probably is not part of this group of statistics, but it works in exactly the same way: it’s a cumulative figure, it counts time since the start of the kcrfw_redo_write_driver and includes a tiny bit more code, it times until it has left the kcrfw_do_write function and enters the kcrfw_postprocess_write function.

5. redo write total time.
This includes the time of 1, 2 and 3, but additionally times the kcrfw_post function, which performs the signalling of the foreground processes that have committed and wait for a semop() signal.

This means that the Oracle database since Oracle 12 provides the timing information to see where the log writer is spending it’s time on. This can prevent a lot of guesswork.

push_having_to_gby()

I came across an interesting new hint recently when checking the Outline Data for an execution plan: /*+ push_having_to_gby() */  It’s an example of a “small” change designed to reduce CPU usage by reducing the volume of data that passes through the layers of calls that an execution plan represents. The hint appeared in 18.3 but I’ve run the following on 19.3 as a demonstration of what it does and why it’s a good thing:

rem
rem     Script:         push_having.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem
rem     Notes:
rem     New (18c) push_having_to_gby() hint
rem     Avoids one pipeline (group by to filter) in
rem     execution plans.
rem

create table t1
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',50,'x')                padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

alter table t1 modify id not null;
create index t1_i1 on t1(id) nologging;

set serveroutput off
alter session set statistics_level = all;

select
        /*+
                qb_name(driver)
        */
        id 
from
        t1
where   id is not null
group by 
        id 
having  
        count(1) > 1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

There aren’t very many options for the execution path for this query, and the default path taken on my database was an imdex fast full scan with hash aggregation:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |      1 |        |      0 |00:00:00.68 |    2238 |   2230 |       |       |          |
|*  1 |  HASH GROUP BY        |       |      1 |  50000 |      0 |00:00:00.68 |    2238 |   2230 |    55M|  7913K|   57M (0)|
|   2 |   INDEX FAST FULL SCAN| T1_I1 |      1 |   1000K|   1000K|00:00:00.20 |    2238 |   2230 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COUNT(*)>1)

You’ll notice that the query should return no rows – the way I’ve generated the id means it’s unique even though I haven’t declared a unique constraint/index. DId you also notice the common guess (5%) that the optimizer has used for the selectivity of the having clause ? But have you spotted the 18c enhancement yet ? If not, we’ll get to it in a moment.

It just so happens that I know there is a better execution path than this for this specific query with my specific data set, so I’m going to put in a minimalist hint to tell the optimizer about it, just to see what happens. The data is very well organized, so using an index scan with running total will be significantly more efficient than a big hash group by:


select
        /*+
                qb_name(driver)
                index(@driver t1@driver)
        */
        id 
from
        t1
where   id is not null
group by 
        id 
having  
        count(1) > 1
;

----------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |      0 |00:00:00.80 |    2228 |
|   1 |  SORT GROUP BY NOSORT|       |      1 |  50000 |      0 |00:00:00.80 |    2228 |
|   2 |   INDEX FULL SCAN    | T1_I1 |      1 |   1000K|   1000K|00:00:00.40 |    2228 |
----------------------------------------------------------------------------------------

Notice how the optimizer has obeyed my /*+ index(t1) */ hint and used an index full scan to walk through the t1_i1 index in order, doing a “sort group by” which doesn’t need to do any sorting, so its effectively using a simple running total to count repetitions. The timing (A-time) difference isn’t really something to trust closely when dealing with brief time periods and rowsource_execution_statistics, but eliminating 57M of PGA allocation for the hash join SQL workarea might be a significant benefit. But there’s something else to be seen in this plan – if you can manage to see the things that aren’t there.

So let’s push this query back to its 12c plan:

select
        /*+
                qb_name(driver)
                index(@driver t1@driver)
                optimizer_features_enable('12.2.0.1')
        */
        id 
from
        t1
where   id is not null
group by 
        id 
having  
        count(1) > 1
;

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |      1 |        |      0 |00:00:01.02 |    2228 |
|*  1 |  FILTER               |       |      1 |        |      0 |00:00:01.02 |    2228 |
|   2 |   SORT GROUP BY NOSORT|       |      1 |  50000 |   1000K|00:00:00.93 |    2228 |
|   3 |    INDEX FULL SCAN    | T1_I1 |      1 |   1000K|   1000K|00:00:00.45 |    2228 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COUNT(*)>1)

Notice the FILTER that appears as operation 1. Oracle generates the aggregate data (which happens to total 1M rows) at the Sort Group By (whether it’s the Nosort option from the index full scan, or the “proper” hash group by from the index fast full scan) and passes the 1M row result set (estimated at 50K rows) up to the parent operation where the filter takes place. In 18c onwards the separate filter operation disappears and the filtering takes place as part of the aggregation. This is probably a good thing but if you ever want to disable it without switching everything back to the 12c optimizer features then there’s a dedicated hint: (no_)push_having_to_gby():


select
        /*+
                qb_name(driver)
                index(@driver t1@driver)
                no_push_having_to_gby(@driver)
        */
        id 
from
        t1
where   id is not null
group by 
        id 
having  
        count(1) > 1
;

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |      1 |        |      0 |00:00:01.05 |    2228 |
|*  1 |  FILTER               |       |      1 |        |      0 |00:00:01.05 |    2228 |
|   2 |   SORT GROUP BY NOSORT|       |      1 |  50000 |   1000K|00:00:00.91 |    2228 |
|   3 |    INDEX FULL SCAN    | T1_I1 |      1 |   1000K|   1000K|00:00:00.36 |    2228 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COUNT(*)>1)

If you were looking very carefully (especially after my comment about “seeing things that aren’t there”) you may have noticed that there’s an odd detail in the example where I’ve hinted the index without blocking the push. There’s no Predicate Information section in that execution plan – and that wasn’t a mistake on my part – it simply doesn’t exist – and you’ll note that there’s no asterisk by any of the Operation lines to remind you that there should be should be some Predicate Information! The “count(1) > 1” just doesn’t get reported (even though it does get reported if you use dbms_xplan.display() after a call to explain plan). Fortunately, however, when I modified the model to duplicate one of the rows I did get the correct result – so even though the predicate is not reported it is still applied.  [Side Note: did you notice that my original count(1) changes to count(*) when it gets to the Predicate Information. People still ask which is faster, count(1) or count(*) – the argument should have died back in Oracle 7 days.]

Summary

18c introduced a new optimisation that pushes a “having” predicate down into a group by operation. This reduces CPU usage by eliminating the need to pass a potentially large result from a child operation up to a parent filter operation. Unfortunately you may find that the predicate becomes invisible when you pull the execution plan from memory.

In the unlikely event that you manage to find a case where this optimisation is used when it would have been better to bypass it then there is a hint /*+ no_push_having_to_gby(@qbname) */ to block it, and if it doesn’t appear when you think it should then the opposite hint /*+ push_having_to_gby(@qbname) */ is available.

 

 

Oracle REST Data Services (ORDS) 19.4 : A quick life update…

https://oracle-base.com/blog/wp-content/uploads/2019/12/ords-2-258x300.png 258w" sizes="(max-width: 238px) 85vw, 238px" />

Almost 2 weeks ago I wrote about the release of Oracle REST Data Services (ORDS), SQLcl, SQL Developer and SQL Developer Data Modeler 19.4.

I spent the holidays playing around with ORDS quite a bit, so I came back to work today and pushed it out across all Dev and Test installations.

As I’ve mentioned before, at work we run ORDS on Tomcat inside Docker containers. The build we use is very similar to this one I put on GitHub, but with some extra work-related bits added.

What did I have to do for this update?

Two things:

  • Build a new version of our ORDS Docker image with version 19.4 of the ORDS and SQLcl software.
  • Remove all the containers based on this image and fire up new containers.

How long did it take to deploy this to all Dev and Test instances?

The build of the new Docker image took about 5 minutes. It’s mostly just unzipping the software. This can be done before we touch any running containers, so there is no downtime associated with this.

The removal and creation of all the containers took about 5 minutes as well. Each container is created in a second, but the first run with a new version of ORDS has to do the ORDS upgrade in the database, which takes a few minutes sometimes. If there were no ORDS upgrade, the containers start really quickly.

So effectively, in 5 minutes we replaced all the “kit” and ran the ORDS upgrade across everything. I could have done production in that same 5 minute span too, but I’m not allowed to yet. </p />
</p></div>

    	  	<div class=

VirtualBox 6.1 : No compatible version of Vagrant yet! (or is there?)

VirtualBox 6.1 was released on the 11th of December and I totally missed it.

The downloads and changelog are in the usual places.

I spotted it this morning, downloaded it and installed in straight away. I had no installation dramas on Windows 10, macoS Catalina and Oracle Linux 7 hosts.

The problem *for me* was the current version of Vagrant (2.2.6) doesn’t support VirtualBox 6.1 yet. I can’t live without Vagrant these days, so I installed VirtualBOx 6.0.14 again and normal life resumed. See Update.

I’m sure there will be a new release of Vagrant soon that supports VirtualBox 6.1, but for now if you use Vagrant, don’t upgrade to VirtualBox 6.1 yet. I’m sure you won’t have to wait long… See Update.

Cheers

Tim…

Update 1 : A couple of people Peter Wahl and Andrea Cremonesi pointed me at this post by Simon Coter, which contains config changes to allow Vagrant 2.2.6 to run with VirtualBox 6.1.

Update 2 : I’ve followed Simon’s post and it worked fine. If you are using Windows 10 as the host and have done a default installation of Vagrant, the files he’s discussing are in these directories.

C:\HashiCorp\Vagrant\embedded\gems\2.2.6\gems\vagrant-2.2.6\plugins\providers\virtualbox\driver\

C:\HashiCorp\Vagrant\embedded\gems\2.2.6\gems\vagrant-2.2.6\plugins\providers\virtualbox\

Update 3 : I updated by work PC also. It required a couple of reboots to get things working. I think it may be something to do with the way we do security here. It’s working fine now.


VirtualBox 6.1 : No compatible version of Vagrant yet! (or is there?) was first posted on January 1, 2020 at 1:10 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

2019 : A Year in Review

https://oracle-base.com/blog/wp-content/uploads/2020/01/2019-2020-rollov... 300w" sizes="(max-width: 324px) 85vw, 324px" />

Well, it seems 2019 was another slightly bizarre year for me.

I just looked back on last year’s review (here) and I’m guessing I had “resting bitch face” while I was reading it…

Conferences

One of the things I mentioned last year was I was taking a year off presenting. I had two events I had already committed to at the start of the year, then dropped off the face of the earth for the rest of the year. I came out of retirement for OpenWorld. I was originally going to pull out of that also, but the wife persuaded me I should go. I’m glad I did.

The reason for not presenting was really to give myself a break. I had been having a lot of trouble travelling, which I suspect is mostly down to being so fat. It was good not to have the hassle of travelling, but I did miss seeing folks. I’ve done some presentations at work during the year, so it’s not like I’ve done nothing…

The problem with not doing conferences was that work was very busy and I didn’t take many holidays, so I was basically swapping one set of stress for another.

I’m thinking I might do a few things this year. I’m not going to go mad and try and speak at loads of events, but I’ve got to get back on the horse. Of course, the first thing is to think of something to present…

If you’ve followed the blog, you’ll know I’ve been going to a few local meetups for Docker, DevOps and Azure. I’m a tourist, rather than a speaker. It’s good to do something different!

Videos

Last year I said I wanted to start doing some videos on my YouTube Channel again. I was “on a break” for a while before that. Well I started at the end of January and I think I’ve done about 44 videos this year. There are a few new playlists, and some additions to existing ones listed here.

I like doing the videos, and it alleviates my guilt at not presenting, but they do take up a lot of time. I know some people have asked for longer, more in-depth videos, but I would estimate that for every minute of the final video, it takes me about 60 minutes of work, so a 5 minute video takes me about 5 hours. With that in mind, the chances of me doing a long-form video are remote.

I know some people can just record themselves talking, but presenting takes a lot of prep for me. For a live presentation I work for days/weeks so I can look casual. For videos it’s a little easier as you can edit stuff and trim it down, but it still takes time…

If you’ve watched my videos, you’ll know I put in little cameos of people saying “.com” to finish my introduction line. I went to put out the latest collection of them (Volume 3) and noticed I had never uploaded Volume 2, so today I posted two short videos. Thanks to everyone who has helped.

I know it’s stupid, but I like it. </p />
</p></div>

    	  	<div class=

The 2019 Year in Review

It’s that time of year where everyone is doing a “Year in Review” post and why not?

https://dbakevlar.com/wp-content/uploads/2019/12/2019_2-300x83.png 300w, https://dbakevlar.com/wp-content/uploads/2019/12/2019_2-768x211.png 768w, https://dbakevlar.com/wp-content/uploads/2019/12/2019_2.png 1203w" sizes="(max-width: 800px) 100vw, 800px" />

It’s good to gain perspective, plan for 2020 and celebrate all that was wonderful about 2019, (outside of politics, that was an utter failure on all fronts, pretty much most countries… :))

The DBAKevlar blog:

  • 46 published posts for 2019
  • Over 800 posts on the site
  • Over 600K in visitors for 2019
  • Daily views have increased steadily, with over 2500 visitors yesterday.
  • Somehow, still awarded top 60 blog sites for Oracle database blogs with all that Microsoft content…:)

Microsoft Employment for 2019:

  • My mobile office worked in 25 states, not even going to count how many locations.
  • Traveled to 37 customer locations
  • Working regularly with over 70 higherEd institutions and a number of K-12 customers on analytics and AI
  • Presented at 3 EdTech conferences
  • Contributed 4 blogs posts to the Microsoft SQL Server/Azure blog
  •  2 role redesigns to make the Oracle work I was doing “official”, (you rock, Denny Ramsey- Thank you for being an awesome boss!)
  • Over 70 projects either migrating Oracle from on-prem to Azure IaaS or part of the Oracle Cloud/Azure partnership.
  • Accepted a new engineering role in Customer Success starting on January 1st, 2020.

Webinars:

Conferences:

I love me a great SQL Saturday and this year I presented at 19 of them!  Its a few down from 2018, when I was an evangelist, but the number is still considerable.  I love getting to present to different audiences, hang out with my #SQLFamily and the opportunity to learn new things.  One of my goals for 2019 at Microsoft was to see if there was interest from the Oracle community in Power BI, so I presented at two Oracle events, RMOUG and KSCOPE to gauge response.  I really enjoyed these sessions, as I got to do something very knew with the technology and also had the opportunity to visit with my Oracle peeps.

I also presented at three sessions at PASS Summit this year.  I love this conference and really appreciate that they’ve let me continue to be an official blogger.  I gave up this year on keeping up with the social events-  there’s just too many and no one has that amount of energy at my age.  I did what I could, seriously.

Publications:

I started my WIT book, “Crushing the IT Gender Bias” in June, 2018.  I had just started at Microsoft, so it was an awesome time to start a book, too, (not.)  As much as I put in it, of course I can now think of so much more that should have been added.  I was really surprised the difference in audience and how often I see it in book stores and libraries.  It’s just very different from my technical books in where it will end up, which results in you feeling proud and vulnerable all at the same time.

I still owe Simple Talk a few more submissions to finish up the Linux series.  The book for Packt came in and stole much of my writing time and at some point, I’ll need to get my content finishes, but luckily, each of the series is stand alone, so everything I add is just gravy to the mashed potatoes posted. </p />
</p></div>

    	  	<div class=

Oracle wait event ‘log file parallel write’ change

This post is about a change in how the time is measured for the event ‘log file parallel write’. This is important for the performance tuning of any change activity in an Oracle database, because with the default commit settings, a foreground session that commits changes waits in the wait event ‘log file sync’, which is a wait on logwriter activity, for which the wait event ‘log file parallel write’ always has been the indicator of the time spend on IO.

Log file sync
First things first: a foreground session normally waits on the wait event ‘log file sync’ when it commits waiting for its change vectors to be written to the online redologfile(s) by the logwriter. It is wrong to always assume a ‘log file sync’ will be present. If, somehow, the logwriter manages to increase the ON DISK SCN to or beyond the foreground session’s commit SCN, there will be no ‘log file sync’ wait event.

I found the absence of the ‘log file sync’ wait event by artificially slowing the committing session down. However, with the improvements in parallelism in the Oracle code, and even more with the improvements on the hardware layer for lowering write latency (especially with persistent memory), this might be a situation that is actually visible in the near future. This is true for at least version 11.2.0.4 and higher (probably earlier too, but these are versions that I verified).

Log file parallel write
However, this post is about the event that in a lot of cases is the culprit for the ‘log file sync’ time: it’s about the ‘log file parallel write’ wait event, which takes place in the logwriter (when the logwriter itself writes, it could delegate that to the logwriter worker processes, LGnn), which normally would be measured by Oracle database tuners to validate logwriter IO latency in case of foreground sessions showing high ‘log file sync’ waits indicating waiting on the logwriter.

I was looking at the logwriter code path using the ‘debug trace’ tool from intel pin tools and used Tanel Poder’s snapper using begin and end snapshots on the logwriter process on Linux, non ASM, Oracle 19.5. I put the logwriter into single logwriter mode (“_use_single_log_writer”=true) and performed redo, which is bound to be picked up by the logwriter.

There I noticed the statistics indicated a write to have been taking place, whilst there was no ‘log file parallel write’ wait event. Weird… So I dug deeper and looked at the functions taking place. These are the functions that I found: https://gitlab.com/snippets/1926207

The important bits are:
– In the function ksfdaio() the IO request is submitted:
ksfdaio>ksfdgo>ksfd_skgfqio>skgfqio>skgfr_lio_listio64>io_submit
– And in the function ksfdblock() the IO request result is read (“reaped”):
ksfdblock>ksfdpoll>ksfdwtio>skgfrwat>skgfrliopo>io_getevents

But no wait event is called *at all!*

This is a clear change from earlier versions, where the logwriter write would always yield a ‘log file parallel write’ wait event. But when did this change happen? To answer that question, I installed version 12.1.0.2 (no patches) to look at the code path and found: https://gitlab.com/snippets/1926211

The important bits are:
– The wait interface is invoked: kslwtbctx()
– In the function ksfdgo() the IO request is submitted:
ksfd_sbio>ksfdgo>ksfd_skgfqio>skgfqio>skgfr_lio_listio64>io_submit
– And in the function ksfdwtio() to IO request result is read (“reaped”):
ksfd_sbio>ksfdwtio>skgfrwat>skgfospo>skgfrliopo>io_getevents
– And the wait interface is ended: kslwtectx()

Ah! So a clear change! So when did this change happen? I installed 12.2.0.1 and found the same ‘waiting’ codepath, then installed 18.3, and found the waiting codepath. So the change probably happened in Oracle 19! So I installed 19.3, and found the waiting codepath once again!! So was it a change in PSU 19.4 or 19.5? I installed 19.5, and found the waiting codepath too.

O.M.G. Why am I seeing different behaviour between two databases on the exact same version? Well, I set some undocumented parameters, so I replicated these in my newly installed database…and it didn’t change.

What else is different? Well…one is a multi-tenant database, and the other isn’t…but surely that doesn’t change the IO codepath?? Of course this is easy to check, so I dropped the non-CDB database and created a CDB one, and…it now showed the same “non waiting” codepath.

So, apparently, using the oracle database in multi-tenant mode changes the IO codepath behaviour. I checked the codepath going down in versions, and it turns out this change appeared in 12.2.0.1.0, so essentially in Oracle 12.2 (Oracle 18 and 19 are actually Oracle 12.2.0.2 and Oracle 12.2.0.3). To be clear on this: I didn’t find this change in 12.1.0.2.

Does this mean the wait events ‘log file parallel write’ does not show up at all? No, if the call io_getevents after io_submit does not return all submitted IOs, it must wait for it, and at that point invoke the wait interface and then go into a blocking io_getevents call, so the actual waiting is timed. This is shown in this snippet: https://gitlab.com/snippets/1926212

This shows ksfdblock>ksfdpoll>ksfdwtio>skgfrwat>skgfrliopo>io_getevents, which returns 0 (what means that it didn’t found any IOs in the completion queue), which then returns up to the ksfdpoll function, invokes the wait interface, kslwtbctx, and then dives to io_getevents again, but now in blocking mode to wait for the IO.

I also replayed the tests on ASM, which shows the exact same (non-)waiting behaviour.

Conclusion
For Oracle versions 12.1.0.2 up to 19.5, the wait event ‘log file parallel write’ includes both the submission of the IO request (io_submit call) and the waiting for the submitted IOs to finish (io_getevents call). This means the waiting time of the event in this case is the total latency of the IO request, including operating system processing.

Starting from version 12.2, only if the multi-tenant option is used, this is changed, and the wait event ‘log file parallel write’ only is shown if the submitted IOs are not available after they are submitted, and thus the logwriter process has to wait for them. If the wait is shown, the time in the event is the actual waiting time for the IOs to finish from the perspective of the logwriter process, not the IO latency time.

Of course if you have multiple log groups on disks with different latencies, this wait event waits for all IOs to finish, which means it times the slowest IO of all submitted IOs.