Top 60 Oracle Blogs

Recent comments


Oracle database features wish list – 2 (V$SQL_UNSHARED_CURSOR)

It’s been too long from a previous wish for Oracle. While reading a MOS updated articles (in a new HTML interface which looks nice, much better than flash-based introduction) I’ve seen this document that made me write this blog post.

RuOUG in Saint Petersburg

On February 10th I was in Saint Petersburg on a seminar organized by Russian Oracle User Group. Actually, it was mostly prepared by company called Devexperts. Seminar took place in their office, with most presentations done by their people.

11.2 New views

It looks like Laimutis Nedzinskas has spent some time producing  a list of new views, and changes to existing views, for trouble-shooting specialists. The lists cover the dynamic performance views and the AWR history views, and the differences between 10.2 and 11.2. The lists include, where relevant, convenient hyperlinks to the view definitions in the online copy of  the 11g Server Reference manual.

I haven’t checked the lists in detail, but I can see myself  taking advantage of his efforts from time to time:

Visualizing Active Session History (ASH) Data With R

One of the easiest ways to understand something is to see a visualization. Looking at Active Session History (ASH) data is no exception and I’ll dive into how to do so with R and how I used R plots to visually present a problem and confirm a hypothesis. But first some background…


Frequently DBAs use the Automatic Workload Repository (AWR) as an entry point for troubleshooting performance problems and in this case the adventure started the same way. In the AWR report Top 5 Timed Foreground Events, the log file sync event was showing up as the #3 event. This needed deeper investigation as often times the cause for longer log file sync times is related to longer log file parallel write times.

When to Fix & When to Find Out – Friday Philosophy

{warning, this is one of my long, rambling Friday Philosophy blogs. Technical blogs are on the way – though there are some nice AWR screen shots in this one :-) }

As a DBA (or System Administrator or Network admin or any other role that involves fixing things that are broken before the shouting starts) there is often a tension between two contending “best practices”:

- getting the system working again as soon as possible.
- understanding exactly what the problem is.

Some experts point out {and I generally agree} that unless you actually find out exactly what the problem was, what you did to fix it and via a test case demonstrate why the fix worked, you are not really solving the problem – You are just making it go away. You (hopefully!) know what you changed or the action you took so you have something that you can repeat which might fix it again next time. (NB this does not apply to the classic “turn it off, turn it back on again”, that nearly always is just an aversion therapy).

But it might not fix it next time.

Or you might be going to way more trouble to fix it than is needed {I’m remembering how flushing the shared pool used to get touted as a way to resolve performance issues, but it is a pretty brutal thing to do to a production database}.

You might even be making other problems worse {like slowing down everything on the production database as the caches warm up again, having flushed out all that data in the SGA, or that good-old-standard of rebuilding indexes that simply do not benefit from the rebuild}.

There is another issue with “just fixing it by trying things” in that you are not really learning about the cause of the issue or about how the technology you are looking after works. A big part of what makes an “expert” an expert is the drive, desire and opportunity to take the time to work this out. It’s that last part I sometimes get grumpy about, the opportunity.
For many of us, we do not have the luxury of investigating the root cause. Because we are under so much pressure to “get it fixed right now and worry about the detail later”. But we do not get to the detail as there is then the next “fix this problem right now” and so it goes on. {Kellyn Pot’Vin does a nice post about what she calls the Superman Conundrum on this topic}.

I’ve had exactly this dilema just a couple of months ago. Below are the details, it’s a bit long so skip to the end of the post if you like…

I was creating test data and I decided to use parallel processing to speed it up. I created a month’s worth of data with PL/SQL and then decided to copy it with a simple “insert into …select” statement, updating dates and a couple of other incrementing columns as part of the insert, using parallel. The creation of the second month’s data took longer than the PL/SQL code for the first month took. What the!!!??? I pulled up the AWR information and I could see that the problem was (possibly) with inter process communication between the parallel processes, as shown by the PX DEQ CREDIT:send blkd wait event.

The below screenshot shows the overall instance workload, the green is CPU and the Pink is “Other”. Only one SQL statement is responsible for all of this effort, via 5 sessions (four of which are parallel threads) You can see that the issue had been going on for over an hour {oh, and to a certain extent these pretty pictures are pointless – I am not looking for the exact solution now, but having loaded the pictures up to the blog site I am damn well going to pretty-up my blog with them}:

Drilling into that one session shows that the bulk of the waits by far is for PX DEq Credit: Send blkd:

By selecting that wait event, I got the histogram of wait times since the system started {I love those little histograms of wait times}:

Note that these waits are for long periods of time, around 1/10 of a second on average and some are for up to 2 or 4 seconds.

The thing is, I had anticipated this problem and increased my PARALLEL_EXECUTION_MESSAGE_SIZE to 16K from the default of 2K already, as I knew from experience that the default is way to small and has slowed down parallel execution for me before. So why was I seeing poorer performane now than I anticipated? I’m not understanding stuff. So I needed to change one thing and see what impact it has and repeat until I got to the bottom of it.

Except I could not – the next team in line was waiting for this data and they already had massive pressure to deliver. My job, what my employer was giving me money to do, was to fix this problem and move on. So, in this instance I just pragmatically got on with getting the task done as quickly as I could.

I did what we all do {don’t we?} under times of accute time pressure. I made a handful of changes, using the knowledge I already have and guessing a little, hoping that one of them would do the trick. This included reducing the degree of parallelism, adding the /*+ append */ hint (I simply forgot the first time around), pre-allocating the required space to the tablespace, muttering “pleaseopleaseoplease” under my breath….

It worked:

The job ran in less than 20 minutes and used less resource during that time. Well, it waited less anyway.
The wait histograms show lots and lots of shorter duration waits:

The duplication took 20 minutes when the previous attempt had been terminated after 2 hours when other factors forced it to be curtailed. Job done.

But the thing is, the problem was not fixed. I got the task done in a timescale that was required, I satisfied the need of the client, but I was and am not sure exactly why.

If I was a permanent employee I would consider pressing for being allowed to spend some time working this out, as my employer benefits from me extending my knowledge and skills. This is not always a successful strategy :-) {but it does indicate to me that my employer has a Problem and I need to address that}. In any case, as I was a consultant on this job, I was being paid to already know stuff. So it is now down to me, some personal time, the internet and people more knowledgeble than me who I can ask for help to sort this out.

And that was my main point. Often at work you have to get over the issue. Later on, you work out what the issue was. I wish I could get paid for the latter as well as the former. The real blow for me is that I no longer have access to that site and the information. My job was complete and, whether they have really shut down my access or not, I am not allowed to go back into the systems to dig around. I think I now know the issue, but I can’t prove it.

Why is my SYSAUX Tablespace so Big? Statistics_level=ALL

One of my most popular postings is about why your SYSTEM tablespace could be rather large. Recently I’ve had issues with a SYSAUX tablespace being considerably larger than I expected, so I thought I would do a sister posting on the reason.

The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version, btw, enterprise edition and 4-node RAC.

We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:

 select * from
 (select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m
 from dba_segments
 where tablespace_name = 'SYSAUX'
 where rownum < 40

OWNER              SEGMENT_NAME                                             SIZE_M
------------------ -------------------------------------------------- ------------
SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911            27,648
SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911         26,491
SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3537            23,798
SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3537         22,122
SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296            17,378
SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296         16,818
SYS                WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3          136
SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911                       96
SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3537                       72
SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_4296                       47
SYS                WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_           45
SYS                I_WRI$_OPTSTAT_H_OBJ#_ICOL#_ST~                              41
SYS                WRH$_SYSMETRIC_SUMMARY~                                      40
SYS                WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_           37

As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.

I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.

Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.

As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.

So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.

The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:

The internal SQL inserting into wrh$_latch_children was becoming demanding

This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.

The physical IO requirements and 15-20 second elapsed time made this out most demanding statement on the system

We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…

What was happening to the size of the SYSAUX tablespace?

Enter the tablespace (or leave null)> sys

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
SYSAUX               alloc  58,187,904  465,503,232  33,553,408       14
                     free       10,728       85,824      21,504       20
SYSTEM               alloc     128,000    1,024,000   1,024,000        1
                     free       68,360      546,880     546,752        3

4 rows selected.

select * from
(select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m
 from dba_segments
 where tablespace_name = 'SYSAUX'
where rownum < 40

OWNER    SEGMENT_NAME                                                     SIZE_M
-------- ------------------------------------------------------------ ----------
SYS      WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6201                     30262
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5817                     29948
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5435                     28597
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4675                     28198
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911                     27648
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5817                  27144
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6585                     26965
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6201                  26832
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4675                  26741
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911                  26491
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296                     26307
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5435                  26248
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296                  25430
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6585                  25064
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5058                     24611
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5058                  23161
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6966                      9209
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6966                   8462
         WRH$_SYSMETRIC_SUMMARY~                                             152
         WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3911               136
         WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911                               96


OWNER                          OBJ_PART_NAME                                SIZE_M
------------------------------ ---------------------------------------- ----------
SYS                            WRH$_LATCH_CHILDREN-WRH                  231745.063
SYS                            WRH$_LATCH_CHILDREN_PK-WRH               215573.063
SYS                            WRH$_SQLSTAT-WRH                           711.0625
SYS                            WRH$_LATCH_MISSES_SUMMARY_PK-WRH           439.0625
SYS                            WRH$_ACTIVE_SESSION_HISTORY-WRH            437.0625
SYS                            WRH$_LATCH_PARENT-WRH                      292.0625
SYS                            WRH$_LATCH-WRH                             276.0625
SYS                            WRH$_LATCH_MISSES_SUMMARY-WRH              273.0625
SYS                            WRH$_SEG_STAT-WRH                          268.0625
SYS                            WRH$_LATCH_PARENT_PK-WRH                   239.0625
SYS                            WRH$_SYSSTAT_PK-WRH                        237.0625

Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.

Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.

I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:

OWNER      OBJ_PART_NAME                                SIZE_M
---------- ---------------------------------------- ----------
SYS        WRH$_LATCH_MISSES_SUMMARY_PK-WRH           512.0625
SYS        WRH$_LATCH_MISSES_SUMMARY-WRH              350.0625
SYS        WRH$_LATCH-WRH                             304.0625
SYS        WRH$_SQLSTAT-WRH                           280.0625
SYS        WRH$_LATCH_PK-WRH                          259.0625
SYS        WRH$_SYSSTAT_PK-WRH                        247.0625
SYS        WRH$_SERVICE_STAT_PK-WRH                   228.0625
SYS        WRH$_PARAMETER_PK-WRH                      201.0625
SYS        WRH$_PARAMETER-WRH                         169.0625
SYS        WRH$_SYSSTAT-WRH                           169.0625
SYS        WRH$_SEG_STAT-WRH                          161.0625
SYS        WRH$_SYSTEM_EVENT_PK-WRH                   156.0625
SYS        WRH$_SYSMETRIC_SUMMARY-                         152
SYS        WRH$_SYSTEM_EVENT-WRH                      133.0625
SYS        WRH$_SERVICE_STAT-WRH                      123.0625
SYS        WRH$_ACTIVE_SESSION_HISTORY-WRH            115.0625

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
SYSAUX               alloc  58,251,904  466,015,232  33,553,408       15
                     free   57,479,400  459,835,200   4,063,232    1,208
SYSTEM               alloc     128,000    1,024,000   1,024,000        1
                     free       68,048      544,384     544,320        2

Now, how do we get that space back? I left that with the DBA team to resolve.

Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”

{Update, I’ve just spotted this posting by Colbran which is related. Last time I googled this I just got a stub with no information}

I am Neo off the Matrix (apparently)

I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.

As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:

This will not be an interesting picture to you, but to me it tells me a lot about my system

“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”

“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.

That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.

The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.

However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.

That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.

Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.

Network Events in ASH

Note - using ASH and the Top Activity screen require the use of the Diagnostics Pack License.

This post was prompted by yet another performance problem identified using
pretty pictures and Active Session History data. Although, as you'll see, some
pretty old-fashioned tools played their part too!

ASH entries only exist for certain SQL*Net events. As usual, I think the
design is very successful as long as you have a basic understanding of how ASH

The events all fall into one of three wait classes - Application, Network
and Idle.

SQL> select wait_class, name from v$event_name where name like 'SQL*Net%' order by 1, 2;

-------------------- ----------------------------------------
Application          SQL*Net break/reset to client
Application          SQL*Net break/reset to dblink
Idle                 SQL*Net message from client
Idle                 SQL*Net vector message from client
Idle                 SQL*Net vector message from dblink
Network              SQL*Net message from dblink
Network              SQL*Net message to client
Network              SQL*Net message to dblink
Network              SQL*Net more data from client
Network              SQL*Net more data from dblink
Network              SQL*Net more data to client
Network              SQL*Net more data to dblink
Network              SQL*Net vector data from client
Network              SQL*Net vector data from dblink
Network              SQL*Net vector data to client
Network              SQL*Net vector data to dblink

16 rows selected.

Hopefully that immediately dismisses the suggestion I've heard a few times that ASH/Top Activity ignores SQL*Net events. However, there are some events that are classed as Idle here that will not be captured in ASH data and will therefore not appear in the Top Activity screen. (In fact, even if you were to set the _ash_sample_all parameter you would not see these events in the Top Activity screen, even though ASH would contain entries for them. There is an additional filter applied to the data before it's displayed. In any case, I would *not* recommend setting _ash_sample_all except for fun.)

The most contentious of these is probably
SQL*Net message from client. This event indicates that the Oracle server is
waiting for the client to  do something. If the client is a user session, Oracle
is waiting for the user to do something - perhaps someone has a sql*plus session open that they're doing nothing with or they are filling in an application screen with data? From an Oracle perspective that session
is Idle and there isn't much we can do to tune it, including the waits would heavily skew the data we're using for our analysis and, because events in the Idle
class are specifically excluded from both ASH data and the OEM performance
pages,  such activity won't appear. That seems sensible to me.

However, if the client is actually an application server,
those supposedly idle events can tell us something about end-to-end system
performance. This blog post describes one real life situation where that event
helped me identify the performance problem.

Back to the problem at hand. Sadly, in the midst of a a very chaotic work situation, I didn't manage to grab any of the graphs in question but we had a persistent problem with our critical batch processes spending time waiting on 'SQL*Net more data from client', which showed up very clearly as network class waits.

Now there are a number of reasons why this might be happening. Just a couple off the top of my head

1) There is a performance problem with the application server so Oracle is expecting more data to arrive more quickly than the app server can deliver it.

2) A network problem is affecting the delivery speed.

Although the issue here turned out to be number 2, I suppose the true underlying issue is that this application is insanely 'chatty' and uses very frequent round trips to deliver data. Maybe the application design could be improved first?

Ultimately we enabled Extended Tracing on the relevant sessions and confirmed that, yes, the sessions were spending most of their time waiting for the next batch of data and also how long these waits were. In practise, I'm not sure how much value the trace files added in this particular case. In fact, I think this is an excellent example of the underlying beauty of the sampling approach. As an event became *slower* it appeared *more often* in the samples and we had plenty of examples of how long some of the waits were. (But only some - probably the longest ones.)

How did we solve this mysterious problem? With two very useful tools

- ping/tracert. I simply went on to the app server, pinged the db server and confirmed that the response time (from my memory) was in the range of 70-80 milliseconds - far higher than I would expect. One of the Java developers then ran a trace route to the server which highlighted that network performance was great until we hit the db server. It turned out that one of the network interface cards had auto-negotiated down to 100M/s and this was the root cause - easily fixed. Sometimes old-fashioned and simple tools are all that you need, once you know where you should focus your attention (and ASH is great at focussing attention in the right place).

- The next tool is one of the reasons I wanted to write this post. What made me try ping? Whilst I would have got there eventually, I thought I would try my performance optimisation secret weapon! One just for the database wizards! Something I've been keeping an eye on is Oracle Support's attempt to give people a methodical approach to solving performance issues via the Oracle Performance Diagnostic Guide on My Oracle Support - Note ID 390374.1. There are several guides on that landing page but here is the link to the Slow Performance PDF. (You'll need an account to be able to access this.) This is an example of the kind of advice there. (Note that this is for SQL*Net message from client, but you would tend to get most events together anyway and there's similar advice for more data from client.)

"Cause Identified: Slow network limiting the response time between client and database.
The network is saturated and this is limiting the ability of the client and database to communicate with each other.
Cause Justification
1. SQL*Net message from client waits are a large part of the overall time (see the overall summary
2. Array operations are used. This is seen when there are more than 5 rows per execution on average (divide total rows by total execution calls for both recursive and non-recursive calls)
3. The average time for a ping is about equal to twice the average time for a SQL*Net message from client wait and this time is more than a few milliseconds. This indicates that most of the client time is spent in the network."

I've been loathe to mention this document until now because it looks like a work in progress (albeit not updated since the start of 2009), isn't perfect, has gaps and I can imagine a number of my peers taking issue with some of the content. But for something to aid learning if you're not naturally great at solving performance problems, I can think of much worse starting places.

In the end, what could have been a really tricky performance problem (and might have taken a while to notice) was apparent with a brief glance at OEM and we could then follow up by applying the right analysis tools to the problem.

To finish off, when I searched Google for any examples of 'SQL*Net more data from client', one of the first results that cropped up was a chapter from one of my favourite books - 'Optimising Oracle Performance' by Cary Millsap and Jeff Holt. Well worth a read ....

Alternative Pictures Demo

Note - features in this post require the Diagnostics Pack license

Not long after I'd finished the last post, I realised I could reinforce the points I was making with a quick post showing another one of the example tests supplied with Swingbench - the Calling Circle (CC) application. Like the Sales Order Entry application, CC is a mixed read/write test consisting of small transactions. As always, there's more information at Dominic's website.

One of the main differences to the SOE test is that the CC test consumes data so you need to generate a new set of data before each test using the supplied ccwizard utility. I won't show you the entire workflow here but enough to give you a flavour of the process. The utility is the same one used to create the necessary CC schema in the first place but the option I'm looking for here is "Generate Data for Benchmark Run".

I'd already decided that my CC schema is populated with data for 1 million customers when I created it so I just need to specify the number of transactions the next test will be for. I happen to know that on my particular configuration, a 1000 transaction test will take around 5 minutes to run.

I ran the test twice. I've highlighted the first run here in the Top Activity page.

It should be clear that CC suffers significant log file sync waits on my particular test platform, just like the SOE test. Therefore I'll regenerate the test data set, enable asynchronous commits and re-run the test. Here I've highlighted the second test run.

As well as seeing a similar change in the activity profile according to the ASH samples (the log file sync activity has disappeared as has the LGWR System I/O), there's a significant difference to the SOE test. Because this test run is based on a specific workload volume, as defined by the size of the test data, rather than a fixed time period, the second test run completed more quickly than the first run. The activity only fills the 5 minute activity bar partially, rather than the first test which filled the whole bar.

If you test a specific and limited workload volume it is much clearer from the Top Activity page which test is processing transactions more quickly, based on the Time axis. That's why I didn't pick this example the first time - it's too obvious what's going on!

That Pictures demo in full

Note - features in this post require the Diagnostics Pack license

With so many potential technical posts in my pile, it was initially difficult to decide where to start again but I figured I should avoid the stats series until I'm back into the swing of things ;-) Instead I decided to fulfill a commitment I made to myself (and others, whether they knew about it or not) almost three months ago.

When I gave the evening demo session in the Amis offices I think the 2 hours went pretty well but, as usual with the OEM presentations, I got a little carried away and didn't conclude the demo properly. (This is also the demo I *would* have done at Hotsos last year if the damn thing had worked first time ;-)) It was a shame because as well as showing the neat and useful side of OEM Performance Pages, it also illustrates one of the common pitfalls in interpreting what the graphs are showing you.

I began by running a 4 concurrent user Sales Order Entry (SOE) test using Dominic Giles' Swingbench utility. I won't got into the details of the SOE test because I don't think it's particularly relevant here but you can always download and/or read about Swingbench for yourself at Dominics website.

I ran the test for a fixed period of 5 minutes using no think-time delay.

Using the capability to look at ASH data in the recent past, the OEM Top Activity page looks like this.

- There was a fairly consistent average of 4-5 active sessions over the 5 minutes period and, looking at the Top Sessions panel in the bottom right of the screen, these were four SOE sessions of similar activity levels and the LGWR process.

- The majority of time was spent on User I/O, System I/O and Commit Wait Class activity, with a little CPU.

- Three PL/SQL blocks were responsible for most of the Commit activity.

- The LGWR process was responsible for most of the System I/O activity.

I'll leave it there for now and won't drill down into any more detail.

In terms of optimising the performance of this test, what might I consider doing?

The most important aspect is to optimise the application to reduce the resource consumption to the minimum required to achieve our objectives. There's a whole bunch of User I/O activity that could perhaps be eliminated? But I'm going to ask you to accept the big assumption here that this application has been optimised and that I'm just using a Swingbench test as an illustration of the type of system-wide problem you could see. In that case, my eye is drawn to the Commit activity.

When I'm teaching this stuff, I'm usually deliberately simplistic (at least at the end of the process) and highlight that what I'm interested in 'tuning' is whatever most sessions are waiting on according to the ASH samples this screen uses. I used to explain how I'd look for the biggest areas of colour, drill down into those and identify what's going on. Sadly, I later heard that someone (I think it was JB at Oracle*) had already come up with a nifty acronym for this - COBS. Click on the Big Stuff! One day I will come up with a nifty acronym for something too, but you shouldn't hold your breath waiting.

So, if I click on the big stuff here, I can see that the Commit Class waits are log file sync.

How might I reduce the time that sessions are waiting for log file sync? Here are a few reasons why the test sessions might be waiting on log file sync more often or for longer than I'd like.

- Application design - committing too frequently
- CPU starvation
- Slow I/O to online redo log files

Whether waits are predominantly the result of CPU overload or slow I/O can be determined by looking at the underlying log file parallel write wait times on the LGWR process but that's a bigger subject for another time.

You can look into all of these in more depth - and should - but as this is designed to be a fun demo of the pretty pictures (it used to be 'the USB stick demo'), I'll simply try to eliminate that activity and re-run the same test. Here's how Top Activity looks now.

Oh. Maybe that wasn't what you expected? OK, the LGWR activity has disappeared, but it seems the system is almost as busy as it was before but that the main bottleneck is now User I/O activity. That's often the way, though - you eliminate one bottleneck in a system and it just shows up somewhere else. It must be good to get rid of log file sync waits though, right? User I/O seems like more productive work and I've managed to make the LGWR activity disappear completely.

But then if you were to look at this graph in terms of Average Active Sessions or DB Time or (as it's more likely to be expressed) how big that spike looks, the two tests would look similarly busy from a system-wide perspective. They were but the real question is - busy doing *what*? There's some important information missing here and Swingbench is able to provide it.

TotalCompletedTransactions 22,868

Mmmm, so I wonder what that value was for the first run?

TotalCompletedTransactions 12,232

Woo-hoo! *That's* what I call tuning a benchmark - processing almost twice the number of transactions in the same 5 minute period.

So it turns out that the sessions in the database *were* just as busy during the second run (not too surprising seeing as the test has no user think time so keeps hammering the database with as many requests as it can handle) but that they were busy doing the more productive work of reading and processing data rather than just waiting for COMMITs to complete.

I raised this issue of DB Time not showing activity details with Graham Wood* at Oracle in relation to a previous blog post. I think he made the point to me that that's why the OEM Performance Home Page is *not* the Top Activity page. If I take a look at that home page, it shows me the same information as the Swingbench results output did, albeit not as clearly

Looking at the Throughput graph, I can see that the second test processesd around double the number of Transactions per second for the same test running on the same system.

To wrap up (and be a little defensive) ...

- Yes, I could have traced one or more sessions and generated a complete and detailed response time profile that should have lead me to the same conclusion.

- Yes, as this is a controlled test environment and I'm the only 'user', AWR/Statspack would have been an even more powerful analysis tool in the right hands.

- The Top Activity page is not the most appropriate tool for this job but it is handy for illustrating concepts.

- Lest I seem a slavish pictures fan, I'm showing how people might misuse or misundersand ASH/Top Activity. In this case, the Home Performance Page is a much better tool because we're looking at system-wide data and not drilling into session or SQL details.

Oh, and what is my Top Secret Magic Silver Bullet Tuning Tip for OLTP-type applications? (only to be used by Advanced Oracle Performance Wizards)

alter system set commit_write='BATCH, NOWAIT';

Done! In fact, why not just use this on all of your systems, just in case people are waiting on log file sync?

(Leaves space below for angry responses and my withering humorous retorts)

* This is not name-dropping, this is giving due credit to the people who really know what they're talking about