How the log writer and foreground processes work together on commit.

(warning: this is a rather detailed technical post on the internal working of the Oracle database’s commit interactions between the committing foreground processes and the log writer)

After the Trivadis Performance days I was chatting to Jonathan Lewis. I presented my Profiling the log writer and database writer presentation, in which I state the foreground (user/server) process looks at the commit SCN in order to determine if its logbuffer contents are written to disk by the logwriter(s). Jonathan suggested looking deeper into this matter, because looking at the commit SCN might not the way it truly works.

The reasoning is the foreground process flushes its log information from its private strand into the public logbuffer, and as such only needs to keep track of the log blocks it allocated in the public logbuffer to see if these are written. Different processes can allocate different blocks in the public log buffer, which potentially do not have to be written all in SCN order. In other words: it could be hard to figure out on write time what the exact on-disk SCN of the redo is.

In order to verify this, let’s first recap what I tell about how a foreground process behaves on commit:

This is a schematic drawing of a foreground process committing.
There are two timelines, one of the foreground process, and one of the logwriter.
The foreground process commits. The commit copies the changes (redo vectors) from the process’ private strand into the public logbuffer (not shown, all indicated with “commit;”), after which it signals (“posts”) the logwriter via the semctl() system call.

When idle, the logwriter sleeps in a systemcall called semtimedop() for 3 seconds after which it performs some “household” tasks (measure time and resource usage for example) then calls semtimedop() again. When signalled, or when it finds contents in the public log buffer, the log writer writes the public log buffer contents to the online redo log file(s) via AIO if available. This is indicated by the io_submit() and io_getevents() calls.

In the meantime, after executing semctl(), the foreground process enters the function kcrf_commit_force(), or the kcrf_commit_force_int() function with Oracle 12. Inside this function, it calls the function kcscur3() three times. If the log writer was able to write the public log buffer contents before kcscur3() has been executed 3 times, the foreground will not enter the wait event ‘log file sync’, and not sleep on semtimedop() in order to wait for being signalled by the log writer, which is why I put it in a gray bar (indicating “optional”). Also, in that case, because the foreground process is not awaiting to be signalled by the logwriter, the log writer does not execute semctl(). The log writer knows what process(es) it needs to signal via a list that administers posters and waiters, which is called the “post-wait queue”, which is not externalised. If the log writer did not write the public log buffer contents fast enough, the foreground process registers it started waiting (in the aforementioned post-wait queue), and starts sleeping in semtimedop() for 100ms, after which it executes kcscur3() two times, do some household actions like measuring time and resource usage, and then calls semtimedop() again. This repeats until the process is receiving a message from the log writer via semctl().

Starting from Oracle version 11.2.0.3, the underscore parameter “_use_adaptive_log_file_sync” is set to TRUE, which enables a feature called “adaptive log file sync”. This settings means the database can and will (!) switch between post-wait mode, which has just been described, and a mode called ‘polling’. Polling means the processes that requested the log writer to write will not wait until they are posted by the log writer, but look at the log writer write progress, and continue after the log writer has written their specific log blocks from the public log buffer.

This is how that schematically looks like:

As you can see, it is quite alike the post wait mode, only the sleeping is done using nanosleep() instead of semtimedop() and there is no semctl() on the log writer timeline. The foreground process issues the same kcscur3() functions, but because these measure SCN progress, the foreground can determine if its public log buffer contents have been written to disk, and stop waiting and continue processing.

I made the conclusion that the kcscur3() function is used to determine the commit SCN based on profiling the function call sequence and logic reasoning. Encouraged by Jonathan’s reasoning, let’s try to see if we can dig deeper.

The first thing we need to do, is see if we can obtain more information on the kcscur3() function. One way of doing that, is investigating the function’s arguments. This can be done without source code or debug symbols, because the arguments of a function are passed via CPU registers on X86_64 Linux: first to fourth arguments are in registers RDI, RSI, RDX, RCX.

A way to do this, is insert data into a table, then attach to this process using gdb, and execute the following gdb macro:

break kcscur3
commands
silent
printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
continue
end

After continuing the foreground via gdb, gdb shows kcscur3 is executed 6 times:

kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Interesting! But what are these numbers? Values? Addresses?
In order to understand if these arguments mean anything, let’s first get the addresses for the general shared memory area’s. This can be done using ipc command with oradebug:

SYS@v12102 AS SYSDBA> oradebug setmypid
Statement processed.
SYS@v12102 AS SYSDBA> oradebug ipc
IPC information written to the trace file
SYS@v12102 AS SYSDBA> @lt

@lt is a script to look at the current active trace file. The relevant contents of this trace file are the shared memory area’s:

Handle:            0x113c2940 `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102'
 Dump of unix-generic realm handle `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102', flags = 00000000
  key 604726764 actual_key 604726764 num_areas 4 num_subareas 4
  primary shmid: 112590854 primary sanum 3 version 3
  deferred alloc: FALSE (0) def_post_create: FALSE (0) exp_memlock: 2050M
 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002ca788 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2 112492547 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cb000 0000000000400000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000077000000 Minimum Subarea size 01000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0 112525316 0x00000061000000 0x00000061000000 0x00000061000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000077000000 0000000077000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008d35000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1 112558085 0x000000d8000000 0x000000d8000000 0x000000d8000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008d35000 0000000008e00000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000003000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3 112590854 0x000000e1000000 0x000000e1000000 0x000000e1000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000003000 0000000000003000 default       readwrite

We see the 4 shared area’s and their memory address:
– Fixed size, start address 0x60000000, size 0x2cb000
– Variable size, start address 0x61000000, size 0x77000000
– Redo buffers, start address 0xd8000000, size 0x8d35000
– Skgm overhead, start 0x1e000000, size 0x3000

If we combine this information with the kcscur3() arguments, we see that the first argument points to the ‘fixed size’ area, in other words: the fixed SGA. The fixed SGA variables metadata is listed in x$ksmfsv. Please mind this view does not list all the contents of the fixed SGA, for example the latches are in the fixed SGA too.

We got 3 addresses which are inside the fixed SGA on calling commit: 0x6001fbb0, 0x60027c68 and 0x60027c98. Let’s see if we can find them in x$ksmfsv:

SYS@v12102 AS SYSDBA> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
2  where to_number('6001fbb0','XXXXXXXX')
3  between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX') and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcsgscn_                                                         000000006001FBB0         48

SYS@v12102 AS SYSDBA> c/6001fbb0/60027c68/
  2* where to_number('60027c68','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

SYS@v12102 AS SYSDBA> c/60027c68/60027c98/
  2* where to_number('60027c98','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

The first fixed SGA address, 0x6001fbb0, points to a variable called kcsgscn_. Some sources on the internet report this is the current SCN. (Kernel Cache System Global SCN?)
Let’s test this! First I set a watchpoint on 0x6001fbb0 in gdb, then query v$database.current_scn:

(gdb) awatch *0x6001fbb0
SYS@v12102 AS SYSDBA> select current_scn from v$database;

This triggers the watchpoint!

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 11278094
New value = 11278163
0x000000000cc77983 in kcscur3 ()

Ah! So when querying the current_scn, it uses the kcscur3() function too, and apparently, kcscur3() can change a value too (we see an old value, and the new value). I press ‘c’ and enter to let the debugger continue the program it debugs. In fact, I have to do this multiple times in the functions kcscur3(), kcsgssn(), kcsgcsn() and kcsgbsn. The watchpoint shows memory address 0x6001fbb0 is left with a number exactly one higher than is shown in the query ‘select current_scn from v$database’.

The second and third fixed SGA addresses, 0x60027c68 and 0x60027c98, both point to a variable called kcrfsg_. Actually, they do not point to the starting address of the variable, but rather to “somewhere” in this variable. First let’s see what value is stored at these addresses in the variable kcrfsg_ using gdb:

(gdb) x/dw 0x60027c68
0x60027c68:	11278915
(gdb) x/dw 0x60027c98
0x60027c98:	11278917

These are large numbers, which are highly likely to be SCNs of some sort.

The variable kcrfsg_, which is quite probably a c “struct” (a variable composed of multiple variables, alike records in a table) is linked with the x$ view x$kcrfws:

SYS@v12102 AS SYSDBA> select addr from x$kcrfws;

ADDR
----------------
0000000060027C38

The address reported is 8 bits into kcrfsg_. The x$kcrfws view is only used by the v$ view v$xstream_capture, and because of that it is reported to have something to do with replication. That is incorrect. My current assumption is x$kcrfws means Kernel Cache Redo Write Status.

In order to figure out which field in x$kcrfws is linked to which memory address (0x60027c68 and 0x60027c98), I use gdb once again, and use a watchpoint on a memory address. Oradebug also provides this functionality, but it doesn’t work in my setup (mprotect error 22). I attach gdb to a SYSDBA session, and execute:

(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 1: *0x60027c68

Then query the fields in x$kcrfws one by one. It turns out, memory address 0x60027c68 is accessed for the fields lwn_scn_bas and lwn_scn_wrp, and memory address 0x60027c98 is accessed for the fields on_disk_scn_bas and on_disk_scn_wrp.

So, what do we know now? It becomes apparent Oracle uses the kcscur3() function for reading SCN values. The function seems to be dynamic and can be used for multiple locations holding different types of SCNs. We witnessed reading the instance current SCN, the on disk SCN and the LWN SCN.

The on disk SCN (x$kcrfws.on_disk_scn_(bas|wrp)) seems to be a registration of up to which SCN is written by the logwriter, and the LWN SCN (x$kcrfws.lwn_scn_(bas|wrp)) is a registration of up to which SCN is in the current log write number (LWN). The log write number (LWN) seems to be a number appointed to groups of redo blocks for writing them in batch.

This information is needed to make more sense of how my foreground works. In order to make the tracing of the foreground more meaningful, we need to add a break on semctl() to understand when all redo vectors are copied into the public log buffer, and the foreground actually starts waiting on the log writer, and peeks at its progress. It is also handy to add breaks to semtimedop() and nanosleep(), so we know what logwrite mode is in use:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000cc77970 
	breakpoint already hit 6 times
        silent
        printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
        continue
(gdb) break semctl
Breakpoint 2 at 0x3bfdaeb030
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "semctl\n"
>continue
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3bfdaeb060
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>printf "semtimedop\n"
>continue
>end
(gdb) break nanosleep
Breakpoint 4 at 0x3bfde0ef90
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>printf "nanosleep\n"
>continue
>end
(gdb) disable
(gdb)

I disabled all breakpoints (disable command), and do the insert. After the insert, the enable command enables all breakpoints, and watch what the function call sequence:

(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

So, after a foreground process has written its change vectors into the public log buffer and requests the log writer to write using semctl(), it doesn’t poll a single location for a SCN, but requests the on disk SCN and the current SCN! It also seemed my log writer was speedy enough to write, because there is no nanosleep() nor semtimedop() call.

It seems we need to manipulate the logwriter too, in order to get the call sequences I showed in my slides. In order to do that, I open another terminal session, and attached a debugger session to the lgwr too. At this point I should point out I made sure my Oracle (12.1.0.2) database was using the single lgwr process and not the log writer workers by setting the “_use_single_log_writer” undocumented parameter to true. (do NOT do this on a production database; both setting an undocumented parameter without the blessing of oracle support, nor attach gdb to a log writer process!!)

I added a break to the io_getevents_0_4 function, which is the function for reaping asynchronous submitted IO, and then disable the breakpoint:

break io_getevents_0_4
disable
continue

I also disabled the breakpoints in the other debugger session attached to the foreground session, and insert another record. After the insertion, enable the breakpoints in both gdb sessions, and enter commit. Now the log writer debugger session will break on io_getevents_0_4, and the foreground debugger session should show a repeating pattern:

(gdb) enable
(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
...

The nanosleep and kcscur3 calls for address 60027c98 and 60027c68 continue to be executed. After continuing the log writer debugger session, the foreground debugger session shows:

kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Bingo! We see the nanosleep() call, so we are in polling mode, not in post-wait mode. And because we artificially made the log writer stop from progressing , the foreground process is reading the on-disk SCN and LWN SCN, then calls nano sleep, then scans the on-disk and LWN SCNs again, etcetera, until the on-disk SCN gets higher than the foreground process commit SCN. Interestingly, it seems that once kcscur3() on address 0x60027c98 was able to identify the log writer progressed beyond it’s commit SCN, the next kcscur3() call does not read address 0x60027c68, but instead reads address 0x6001fbb0, alias the current SCN.

Okay, so this points to the on-disk SCN being responsible for commit progress. If this is actually true, we should now take a look at the log writer and see if the log writer indeed changes the on-disk SCN at memory location 0x60027c98 after the write. For this I can use the memory watch facility of the debugger again:

(gdb) awatch *0x60027c98
Hardware access (read/write) watchpoint 2: *0x60027c98
(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 3: *0x60027c68
(gdb) c
Continuing.

This will quite probably immediately break on access of these memory areas. The log writer is looking at these memory area’s a lot. Here’s a snippet of the logwriter approximately during the commit of the foreground:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 3: *0x60027c68

Old value = 11174890
New value = 11174892
0x000000000cc77ac4 in kcsnew3 ()
(gdb) c
Continuing.

Breakpoint 1, io_getevents_0_4 (ctx=0x7f165de70000, min_nr=2, nr=128, events=0x7ffc696611b8, timeout=0x7ffc696621b8)
    at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x000000000cc77983 in kcscur3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x0000000002d70bb9 in kcsadj3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Old value = 11174890
New value = 11174892
0x0000000002d70c13 in kcsadj3 ()

First we see a function called kcsnew3() change the value at 0x60027c68. Then we see the breakpoint at io_getevents() indicating it looks for submitted IOs to return, in other words: the log writer was asked to write something or found something to write. After the write we see the log writer reading the memory location 0x60027c98 using the kcscur3() function, just as our foreground process does. Then we find a function called kcsadj3() which first reads the on-disk SCN location 0x60027c98, and then changing it. This indicates the log writer is increasing the value of the on-disk SCN in 0x60027c98 after it has written using the function kcsadj3(), a safe bet is Kernel Cache Service Adjust, which the foreground in polling mode is reading to understand if the log writer has written the contents the process put in the public log buffer.

Of course this still isn’t the ultimate proof, I break at a few, arbitrary functions, there can be all kinds of other things going on. This means that still the foreground process can just keep the list of log buffer blocks and really use that, and only use the SCN values for verification, because I am looking at only a small subset of the functions it is executing.

However, there is a way to test this!
=> Another word of caution, the next manipulation can render your database into a smoking pile of bits and bytes! <=
The log writer mode needs to be polling. Also, for simplicity, in Oracle 12 “_use_single_log_writer” must be set to true, to be sure the log writer is performing all the tasks, and not log writer slaves.

In order to proof that a foreground session is truly looking at 0x60027c98 for the on-disk SCN to understand if the log writer has written its log buffer contents, we let the log writer stop before it has adjusted the on-disk SCN. That can be accomplished by attaching to the log writer with gdb, and break on io_getevents_0_4:

break io_getevents_0_4
disable
continue

Now the breakpoint is set, disabled, and execution is resumed.

The next thing to do, is go to the sqlplus foreground session and insert a row of data. Once the insert is done, go to the debugger session, stop execution, enter “enable”, and continue. The debugger resumed execution again, but now will break execution once it encounter io_getevents. Now go back to the foreground sqlplus session and enter commit.

This will trigger the log writer to write, and break execution because of the breakpoint. Because the log writer is stopped the foreground session will appear to be “hanging”. In fact what the foreground session is doing is sleeping in nanosleep(), then executing kcscur3() to look at the on-disk and LWN SCNs, then sleeping in nanosleep() again, etc.

This next part is experimental!
Because the log writer is stopped at the io_getevents() call, it didn’t execute kcsadj3() yet to update the on-disk SCN in 0x60027c98. What we will do, is read the current value in 0x60027c98, and increase the value using gdb. If the foreground is truly only waiting for the value at 0x60027c98, increasing the value at this memory address should give back the prompt of the sqlplus session, because it thinks the commit fully succeeded because the log writer performed all its functions.
First query the on-disk SCN:

(gdb) x/dw 0x60027c98
0x60027c98:	11175042

So, the on-disk SCN value is 11175042. Mind you the sqlplus session appears hanging, but really is “polling” the on-disk SCN, and the log writer has stopped executing, because of the breakpoint.

Let’s see if increasing the on-disk SCN makes the sqlplus session think the commit succeeded. Please mind I randomly increased the value from above by changing the 7 to 8 on the 5th position.

(gdb) set {int}0x60027c98 = 11185042

Voila! the foreground process continues and returns the prompt, indicating it is thinking the log writer has finished!

This means that Oracle uses the log writer on-disk SCN for processes to determine if their log buffer contents have been written to disk.

When the log writer is in post-wait mode, this mechanism is in place too, but the foreground process, once waiting, needs to wait for the log writer to post it via semctl(). There is one exception to this, which is true for both the polling and the post-wait mechanisms: right after a foreground process signalled the log writer to write, it will call kcscur3() looking for the on-disk SCN and if it finds the on-disk SCN beyond its commit SCN, there will not be a wait triggered. In all other cases the process will register a wait “log file sync”.

Conclusion
The kcscur3() function is a function to read and update SCNs. There are multiple SCNs, like current SCN, on-disk SCN and LWN SCN, which all can be read with kcscur3(). Also kcscur3() is not exclusive to commit, when querying the current_scn field in v$database, this function is used too.

When a process commits, right after it signalled the log writer, it checks for the log writer progress via the on-disk SCN. If the log writer was fast enough to have the contents flushed to disk and the on-disk SCN updated, the process will not wait for ‘log file sync’ at all.

Both log writer modi, post-wait and polling, look for the log writer write progress via the on-disk SCN. Both modi will show the wait ‘log file sync’ if the above “fast log writer exception” didn’t happen.

When the log writer is in post-wait mode, a process without the “fast log writer exception” will register itself in the post-wait queue and sleep on a semaphore waiting to be posted by the log writer, and register it is waiting for event ‘log file sync’. Still, it does keep track the on-disk SCN (twice) with a check of the LWN SCN in between, via calls to kcscur3(). After being signalled by the log writer, it again verifies the on-disk SCN and looks for the current SCN, after which it continues.

When the log writer is in polling mode, a process without the “fast log writer exception” will register a waiting for event ‘log file sync’. Now it sleeps in nanosleep() and only executes two checks: the on-disk SCN and the LWN SCN, via calls to kcscur3(). Once the log writer progressed the on-disk SCN beyond the process’ commit SCN, it will continue.

Tagged: AIO, commit, debugger, gdb, linux, oracle, performance