Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

_cursor_obsolete_threshold

At the recent Trivadis Performance Days in Zurich, Chris Antognini answered a question that had been bugging me for some time. Why would Oracle want to set the default value of _cursor_obsolete_threshold to a value like 8192 in 12.2 ?

In 11.2.0.3 the parameter was introduced with the default value 100; then in 11.2.0.4, continuing into 12.1, the default value increased to 1,024 – what possible reason could anyone have for thinking that 8192 was a good idea ?

The answer is PDBs – specifically the much larger number of PDBs a single CBD can (theoretically) support in 12.2.

In fact a few comments, and the following specific explanation, are available on MoS in Doc ID 2431353.1 “High Version Counts For SQL Statements (>1024) Post Upgrade To 12.2 and Above Causing Database Slow Performance”:

The default value of _cursor_obsolete_threshold is increased heavily (8192 from 1024) from 12.2 onwards in order to support 4096 PDBs which was only 252 PDBs till 12.1. This parameter value is the maximum limit for obsoleting the parent cursors in an multitenant environment and cannot be increased beyond 8192.

Having said, this is NOT applicable for non-CDB environment and hence for those databases, this parameter should be set to 12.1 default value manually i.e. 1024. The default value of 1024 holds good for non-CDB environment and the same parameter can be adjusted case-to-case basis should there be a problem.

It’s all about PDBs – more precisely, it’s all about CDBs running a huge number of PDBs, which is not necessarily the way that many companies are likely to use PDBs. So if you’re a fairly typical companyy running a handful of PDBs in a single CDB then it’s probably a good idea to set the parameter down to the 12.1 value of 1024 (and for bad applications I’d consider going even lower) – and this MOS note actually makes that an official recommendation.

Impact analysis

What’s the worst that could happen if you actually have many PDBs all executing the same application and that application has a few very popular and frequently executed statements? Chris Antognini described a model he’d constructed and some tests he’d done to show the effects. The following code is a variation onhis work. It addresses the following question:

If you have an application that repeatedly issues (explicitly or implicitly) parse calls but doesn’t take advantage of the session cursor cache it has to search the library cache by hash_value / sql_id for the parent cursor, then has to walk the chain of child cursors looking for the right child. What’s the difference in the work done if this “soft parse” has to walk the list to child number 8,191 instead of finding the right cursor at child number 0.

Here’s the complete code for the test:


create table t1
select 1 id from dual
/

alter table t1 add constraint t1_pk primary key (id)
/

spool cursor_obsolete.lst

alter system flush shared_pool;
alter system flush shared_pool;

set serveroutput off
select /*+ index(t1) */ id from t1 where id > 0;
select * from table(dbms_xplan.display_cursor);

execute snap_my_stats.start_snap
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+8192 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

column sql_text format a60
select sql_id, child_number, loaded_versions, executions, sql_text from v$sql where sql_text like 'SELECT%T1%' order by child_number;

prompt  ===============
prompt  Low child reuse
prompt  ===============

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 100+1..100+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

prompt  ================
prompt  High child reuse
prompt  ================

set serveroutput off
execute snap_my_stats.start_snap

declare
        m_id number;
begin
        for i in 7168+1..7168+1024 loop
                execute immediate 'alter session set optimizer_index_cost_adj = ' || i ;
                select /*+ index(t1) */ id into m_id from t1 where id > 0;
        end loop;
end;
/

set serveroutput on
execute snap_my_stats.end_snap

spool off

I’ve created a table with just one row and given it a primary key. My testing query is going to be very short and simple. A query hinted to return that one row by primary key index range scan.

I’ve flushed the shared pool (twice) to minimise fringe contention from pre-existing information, then executed the statement to populate the dictionary cache and some library cache information and to check the execution plan.

The call to the package snap_my_stats is my standard method for reporting changes in v$mystat across the test. I’ve called the start_snap procedure twice in a row to make sure that its first load doesn’t add some noise to the statistics that we’re trying to capture.

The test runs in three parts.

  • First I loop 8192 times executing the same statement, but with a different value for the optimizer_index_cost_adj for each execution – this gives me the limit of 8192 child cursors, each reporting “Optimizer Mismatch” as the reason for not sharing. I’ve run a query against v$sql after this to check that I have 8192 child cursors – you’ll need to make sure your shared pool is a few hundred megabytes if you want to be sure of keeping them all in memory.
  • The second part of the test simply repeats the loop, but only for the first 1,024 child cursors. At this point the child cursors exist, so the optimizer should be doing “soft” parses rather than hard parses.
  • The final part of the test repeats the loop again, but only for the last 1,024 child cursors. Again they should exist and be usable, so the optimizer should again be doing “soft” parses rather than hard parses.

What I’m looking for is the extra work it takes for Oracle to find the right child cursor when there’s a very long chain of child cursors. From my memory of dumping the library cache in older versions of Oracle, the parent will point to a “segmented array” of pointers to child cursors, and each segment of the array will consist of 16 pointers, plus a pointer to the next segment. So if you have to find child cursor 8191 you will have to following 512 segment pointers, and 16 pointers per segment (totalling 8708 pointers) before you find the child you want – and you’re probably holding a mutex (or latch) while doing so.

One preipheral question to ask, of course, is whether Oracle keeps appending to the segmented array, or whether it uses a “pushdown” approach when allocating a new segment so that newer child cursors are near the start of the array. (i.e. will searching for child cursor 0 be the cheapest one or the most expensive one).

And the results, limited to just the second and third parts, with just a couple of small edits are as follows:


host sdiff -w 120 -s temp1.txt temp2.txt >temp.txt

===============                                            |    ================
Low child reuse                                            |    High child reuse
===============                                            |    ================

Interval:-  0 seconds                                      |    Interval:-  6 seconds

opened cursors cumulative                      2,084       |    opened cursors cumulative                      2,054
recursive calls                                6,263       |    recursive calls                                6,151
recursive cpu usage                               33       |    recursive cpu usage                              570
session logical reads                          1,069       |    session logical reads                          1,027
CPU used when call started                        33       |    CPU used when call started                       579
CPU used by this session                          37       |    CPU used by this session                         579
DB time                                           34       |    DB time                                          580
non-idle wait count                               16       |    non-idle wait count                                5
process last non-idle time                         1       |    process last non-idle time                         6
session pga memory                           524,288       |    session pga memory                            65,536
enqueue requests                                  10       |    enqueue requests                                   3
enqueue releases                                  10       |    enqueue releases                                   3
consistent gets                                1,069       |    consistent gets                                1,027
consistent gets from cache                     1,069       |    consistent gets from cache                     1,027
consistent gets pin                            1,039       |    consistent gets pin                            1,024
consistent gets pin (fastpath)                 1,039       |    consistent gets pin (fastpath)                 1,024
consistent gets examination                       30       |    consistent gets examination                        3
consistent gets examination (fastpath)            30       |    consistent gets examination (fastpath)             3
logical read bytes from cache              8,757,248       |    logical read bytes from cache              8,413,184
calls to kcmgcs                                    5       |    calls to kcmgcs                                    3
calls to get snapshot scn: kcmgss              1,056       |    calls to get snapshot scn: kcmgss              1,026
table fetch by rowid                              13       |    table fetch by rowid                               1
rows fetched via callback                          6       |    rows fetched via callback                          1
index fetch by key                                 9       |    index fetch by key                                 1
index scans kdiixs1                            1,032       |    index scans kdiixs1                            1,024
session cursor cache hits                         14       |    session cursor cache hits                          0
cursor authentications                         1,030       |    cursor authentications                         1,025
buffer is not pinned count                     1,066       |    buffer is not pinned count                     1,026
parse time cpu                                    23       |    parse time cpu                                   558
parse time elapsed                                29       |    parse time elapsed                               556
parse count (total)                            2,076       |    parse count (total)                            2,052
parse count (hard)                                11       |    parse count (hard)                                 3
execute count                                  1,050       |    execute count                                  1,028
bytes received via SQL*Net from client         1,484       |    bytes received via SQL*Net from client         1,486

Two important points to note:

  • the CPU utilisation goes up from 0.33 seconds to 5.7 seconds.
  • the number of hard parses is zero, this is all about searching for the

You might question are the 2,048-ish parse count(total) – don’t forget that we do an “execute immediate” to change the optimizer_index_cost_adj on each pass through the loop. That’s probably why we double the parse count, although the “alter session” doesn’t then report as an “execute count”.

The third call to a statement is often an important one – it’s often the first one that doesn’t need “cursor authentication”, so I ran a similar test executing the last two loops a second time – there was no significant change in the CPU or parse activity between the 2nd and 3rd executions of each cursor. For completeness I also ran a test with the loop for the last 1,024 child cursors ran before the loop for the first child cursors. Again this made no significant difference to the results – the low number child cursors take less CPU to find than the high number child cursors.

Bottom line

The longer the chain of child cursors the more time (elapsed and CPU) you spend searching for the correct child; and when a parent is allowed 8,192 child cursors the extra time can become significant. I would claim that the ca. 5 seconds difference in CPU time appearing in this test corresponds purely to an extra 5 milliseconds walking an extra 7,000 steps down the chain.

If you have a well-behaved application that uses the session cursor cache effectively, or uses “held cursors”, then you may not be worried by very long chains of child cursors. But I have seen many applications where cursor caching is not used and every statement execution from the client turns into a parse call (usually implicit) followed by a hunt through the library cache and walk along the child chain. These applications will not scale well if they are cloned to multiple PDBs sharing the same CDB.

Footnote 1

The odd thing about this “cursor obselete” feature is that I have a distinct memory that when  PDBs were introduced at an ACE Director’s meeting a few years ago the first thought that crossed my mind was about the potential for someone running multiple copies of the same application as separate PDBs seeing a lot of library cache latch contention or cursor mutex contention because any popular statement would now be hitting the same parent cursor from multiple PDBs. I think the casual (i.e. neither formal, nor official) response I got when I raised the point was that the calculation of the sql_id in future releases would take the con_id into consideration. It seems that that idea fell by the wayside.

Footnote 2

If you do see a large number of child cursors for a single parent then you will probably end up looking at v$sql_shared_cursor for the sql_id to see if that gives you some good ideas about why a particular statement has generated so many child cursors. For a list of explainations of the different reasons captured in this view MOS Doc Id  296377.1“Troubleshooting: High Version Count Issues” is a useful reference.

A refreshing look at PIVOT

We had an AskTOM question come in recently where our customer was very excited about the PIVOT operator in SQL which lets you transpose rows to columns. This is a very common requirement in applications that want to take data that has been modelled in “pure” relational form, and present in a more “human-digestible” form. There are plenty of posts and examples out there about PIVOT, but if you haven’t seen one, here’s a trivial example just so that you are up to speed for the rest of this post.


SQL> select deptno, sum(sal) tot_sal
  2  from   scott.emp
  3  group by deptno
  4  order by 1;

    DEPTNO    TOT_SAL
---------- ----------
        10       8750
        20      10875
        30       9400

3 rows selected.

--
-- which we can flip into a single row with three columns
--

SQL>
SQL> select *
  2  from   (select deptno, sal
  3          from   scott.emp)
  4  pivot  (sum(sal) as tot_sal for (deptno) in (10 as dept10, 20 as dept20, 30 as dept30));

DEPT10_TOT_SAL DEPT20_TOT_SAL DEPT30_TOT_SAL
-------------- -------------- --------------
          8750          10875           9400

More on PIVOT here in the docs.

Like any form of data aggregation, there is a cost is doing a PIVOT and our customer wanted to offset that cost by putting the results into a materialized view. And as the saying goes… “That is when the fight started”  Smile

For the purpose of demonstration, let’s assume we run a medical practice and we capture information about doctor’s offices and their patient. Here is my rudimentary data model with some sample data:


SQL> create table patient
  2   ( region      int,
  3     office       int,
  4     patient      int,
  5     some_date date );

Table created.

SQL>
SQL> alter table patient add primary key ( region, office, patient );

Table altered.

SQL>
SQL> insert into patient values (1,1,1,sysdate);

1 row created.

SQL> insert into patient values (1,1,2,sysdate);

1 row created.

SQL> insert into patient values (1,1,3,sysdate);

1 row created.

SQL>
SQL> create table patient_attrib
  2   ( region      int,
  3     office       int,
  4     patient      int,
  5     property      varchar2(10),
  6     val           number);

Table created.

SQL>
SQL>
SQL> alter table patient_attrib add primary key ( region, office, patient, property );

Table altered.

SQL> alter table patient_attrib add constraint patient_attrib_fk
  2  foreign key ( region,office,patient) references patient (region,office,patient);

Table altered.

SQL>
SQL> insert into patient_attrib values (1,1,2,'weight',60);

1 row created.

SQL> insert into patient_attrib values (1,1,2,'height',1);

1 row created.

SQL> insert into patient_attrib values (1,1,2,'bp',2);

1 row created.

SQL> insert into patient_attrib values (1,1,2,'heart',3);

1 row created.

SQL> insert into patient_attrib values (1,1,2,'chol',4);

1 row created.

SQL> insert into patient_attrib values (1,1,2,'fatpct',5);

1 row created.

SQL>
SQL> insert into patient_attrib values (1,1,3,'weight',61);

1 row created.

SQL> insert into patient_attrib values (1,1,3,'height',1.1);

1 row created.

SQL> insert into patient_attrib values (1,1,3,'bp',2.1);

1 row created.

SQL> insert into patient_attrib values (1,1,3,'heart',3.1);

1 row created.

SQL> insert into patient_attrib values (1,1,3,'chol',4.1);

1 row created.

SQL> insert into patient_attrib values (1,1,3,'fatpct',5.1);

1 row created.

We have patients and various measurements about those patients. For reporting purposes, we want to output the patient records in a pivoted style, and hence the PIVOT operator is a natural fit:


SQL> select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_val, fatpct_val
  2  from
  3  (
  4  select h.*, hs.property, hs.val
  5  from   patient h,
  6         patient_attrib hs
  7  where  h.region = hs.region
  8  and    h.office  = hs.office
  9  and    h.patient = hs.patient
 10  )
 11  pivot ( sum(val) as val for ( property ) in ('weight' as weight, 'height' as height, 
 12       'bp' as bp, 'heart' as heart, 'chol' as chol, 'fatpct' as fatpct));

    REGION     OFFICE    PATIENT SOME_DATE WEIGHT_VAL HEIGHT_VAL     BP_VAL  HEART_VAL   CHOL_VAL FATPCT_VAL
---------- ---------- ---------- --------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1          2 02-OCT-19         60          1          2          3          4          5
         1          1          3 02-OCT-19         61        1.1        2.1        3.1        4.1        5.1

2 rows selected.

As mentioned, as the patient volume increases, we’ll seek to make that output come from a fast refreshable materialized view. To support that, we’ll throw in some materialized view logs in the normal way


SQL> create materialized view log on patient
  2    with sequence, rowid (region,office,patient,some_date) including new values
  3  /

Materialized view log created.

SQL>
SQL> create materialized view log on patient_attrib
  2    with sequence, rowid (region,office,patient,property, val) including new values
  3  /

Materialized view log created.

Now I’ll take my existing PIVOT query and use that as the source for my materialized view


SQL> create materialized view MV
  2  refresh fast
  3  -- on commit
  4  as
  5  select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_Val, fatpct_val
  6  from
  7  (
  8  select h.*, hs.property, hs.val
  9  from   patient h,
 10         patient_attrib hs
 11  where  h.region = hs.region
 12  and    h.office  = hs.office
 13  and    h.patient = hs.patient
 14  )
 15  pivot ( sum(val) as val for ( property ) in ('weight' as weight, 'height' as height, 
 16          'bp' as bp, 'heart' as heart, 'chol' as chol, 'fatpct' as fatpct));
create materialized view MV
*
ERROR at line 1:
ORA-12015: cannot create a fast refresh materialized view from a complex query

Unfortunately for us, PIVOT is a no-go for a fast refresh materialized view. Even if I try to utilize the precreated table “trick” which sometimes can work around this issue, we’re still stuck.


SQL> create table MV as
  2  select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_Val, fatpct_val
  3  from
  4  (
  5  select h.*, hs.property, hs.val
  6  from   patient h,
  7         patient_attrib hs
  8  where  h.region = hs.region
  9  and    h.office  = hs.office
 10  and    h.patient = hs.patient
 11  )
 12  pivot ( sum(val) as val for ( property ) in ('weight' as weight, 'height' as height, 
 13          'bp' as bp, 'heart' as heart, 'chol' as chol, 'fatpct' as fatpct));

Table created.

SQL> create materialized view MV
  2  on prebuilt table
  3  refresh fast on commit
  4  as
  5  select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_Val, fatpct_val
  6  from
  7  (
  8  select h.*, hs.property, hs.val
  9  from   patient h,
 10         patient_attrib hs
 11  where  h.region = hs.region
 12  and    h.office  = hs.office
 13  and    h.patient = hs.patient
 14  )
 15  pivot ( sum(val) as val for ( property ) in ('weight' as weight, 'height' as height, 
 16     'bp' as bp, 'heart' as heart, 'chol' as chol, 'fatpct' as fatpct));
create materialized view MV
          *
ERROR at line 1:
ORA-12054: cannot set the ON COMMIT refresh attribute for the materialized view

But all is not lost. Before PIVOT arrived in 11g, SQL practitioners had other mechanisms for achieving the same result, albeit with a more verbose and unwieldy syntax, via DECODE:



SQL> select h.region, h.office, h.patient, h.some_date,
  2         sum(decode(hs.property, 'weight', hs.val, 0)) weight_val,
  3         sum(decode(hs.property, 'height', hs.val, 0)) height_val,
  4         sum(decode(hs.property, 'bp', hs.val, 0)) bp_val,
  5         sum(decode(hs.property, 'heart', hs.val, 0)) heart_val,
  6         sum(decode(hs.property, 'chol', hs.val, 0)) chol_Val,
  7         sum(decode(hs.property, 'fatpct', hs.val, 0)) fatpct_val
  8  from   patient h,
  9         patient_attrib hs
 10  where  h.region = hs.region
 11  and    h.office  = hs.office
 12  and    h.patient = hs.patient
 13  group by h.region, h.office, h.patient, h.some_date;

    REGION     OFFICE    PATIENT SOME_DATE WEIGHT_VAL HEIGHT_VAL     BP_VAL  HEART_VAL   CHOL_VAL FATPCT_VAL
---------- ---------- ---------- --------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1          2 02-OCT-19         60          1          2          3          4          5
         1          1          3 02-OCT-19         61        1.1        2.1        3.1        4.1        5.1

2 rows selected.

Armed with this, and knowing that we need some additional aggregates for fast refresh on commit materialized views, we can achieve our desired result.



SQL> create materialized view MV
  2  refresh fast
  3  on commit
  4  as
  5  select h.region, h.office, h.patient, h.some_date,
  6         count(*) c,
  7         count(decode(hs.property, 'weight', hs.val, 0)) weight_cnt,
  8         count(decode(hs.property, 'height', hs.val, 0)) height_cnt,
  9         count(decode(hs.property, 'bp', hs.val, 0)) bp_cnt,
 10         count(decode(hs.property, 'heart', hs.val, 0)) heart_cnt,
 11         count(decode(hs.property, 'chol', hs.val, 0)) chol_cnt,
 12         count(decode(hs.property, 'fatpct', hs.val, 0)) fatpct_cnt,
 13         sum(decode(hs.property, 'weight', hs.val, 0)) weight_val,
 14         sum(decode(hs.property, 'height', hs.val, 0)) height_val,
 15         sum(decode(hs.property, 'bp', hs.val, 0)) bp_val,
 16         sum(decode(hs.property, 'heart', hs.val, 0)) heart_val,
 17         sum(decode(hs.property, 'chol', hs.val, 0)) chol_Val,
 18         sum(decode(hs.property, 'fatpct', hs.val, 0)) fatpct_val
 19  from   patient h,
 20         patient_attrib hs
 21  where  h.region = hs.region
 22  and    h.office  = hs.office
 23  and    h.patient = hs.patient
 24  group by h.region, h.office, h.patient, h.some_date;

Materialized view created.

Now we can test out the refresh capabilities of the view with some standard DML



SQL> select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_Val,fatpct_val
  2  from mv order by 1,2,3;

    REGION     OFFICE    PATIENT SOME_DATE WEIGHT_VAL HEIGHT_VAL     BP_VAL  HEART_VAL   CHOL_VAL FATPCT_VAL
---------- ---------- ---------- --------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1          2 02-OCT-19         60          1          2          3          4          5
         1          1          3 02-OCT-19         61        1.1        2.1        3.1        4.1        5.1

2 rows selected.

SQL> insert into patient values (1,1,4,sysdate);

1 row created.

SQL>
SQL> insert into patient_attrib values (1,1,4,'weight',62);

1 row created.

SQL> insert into patient_attrib values (1,1,4,'height',1.2);

1 row created.

SQL> insert into patient_attrib values (1,1,4,'bp',2.2);

1 row created.

SQL> insert into patient_attrib values (1,1,4,'heart',3.2);

1 row created.

SQL> insert into patient_attrib values (1,1,4,'chol',4.2);

1 row created.

SQL> insert into patient_attrib values (1,1,4,'fatpct',5.2);

1 row created.

SQL>
SQL> commit;

Commit complete.

SQL>
SQL> select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_Val,fatpct_val
  2  from mv order by 1,2,3;

    REGION     OFFICE    PATIENT SOME_DATE WEIGHT_VAL HEIGHT_VAL     BP_VAL  HEART_VAL   CHOL_VAL FATPCT_VAL
---------- ---------- ---------- --------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1          2 02-OCT-19         60          1          2          3          4          5
         1          1          3 02-OCT-19         61        1.1        2.1        3.1        4.1        5.1
         1          1          4 02-OCT-19         62        1.2        2.2        3.2        4.2        5.2

3 rows selected.

SQL>
SQL> update patient_attrib
  2  set val = 65
  3  where patient = 3
  4  and property = 'weight';

1 row updated.

SQL>
SQL> commit;

Commit complete.

SQL>
SQL> select region, office, patient, some_date, weight_val, height_val, bp_val, heart_val, chol_Val,fatpct_val
  2  from mv order by 1,2,3;

    REGION     OFFICE    PATIENT SOME_DATE WEIGHT_VAL HEIGHT_VAL     BP_VAL  HEART_VAL   CHOL_VAL FATPCT_VAL
---------- ---------- ---------- --------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1          2 02-OCT-19         60          1          2          3          4          5
         1          1          3 02-OCT-19         65        1.1        2.1        3.1        4.1        5.1
         1          1          4 02-OCT-19         62        1.2        2.2        3.2        4.2        5.2

3 rows selected.

and like any materialized view, we can get a report on all of the capabilities available to us via DBMS_MVIEW 



SQL> @?/rdbms/admin/utlxmv.sql

Table created.

SQL> delete mv_capabilities_table;

0 rows deleted.

SQL> EXEC dbms_mview.explain_mview('MV');

PL/SQL procedure successfully completed.

SQL> select * from mv_capabilities_table
  2  @pr
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : PCT
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 1
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REFRESH_COMPLETE
POSSIBLE                      : Y
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 1002
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REFRESH_FAST
POSSIBLE                      : Y
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 2003
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REWRITE
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 3004
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : PCT_TABLE
POSSIBLE                      : N
RELATED_TEXT                  : PATIENT
RELATED_NUM                   : 840
MSGNO                         : 2068
MSGTXT                        : relation is not a partitioned table
SEQ                           : 4005
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : PCT_TABLE
POSSIBLE                      : N
RELATED_TEXT                  : PATIENT_ATTRIB
RELATED_NUM                   : 858
MSGNO                         : 2068
MSGTXT                        : relation is not a partitioned table
SEQ                           : 4006
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REFRESH_FAST_AFTER_INSERT
POSSIBLE                      : Y
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 5007
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REFRESH_FAST_AFTER_ONETAB_DML
POSSIBLE                      : Y
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 6008
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REFRESH_FAST_AFTER_ANY_DML
POSSIBLE                      : Y
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         :
MSGTXT                        :
SEQ                           : 7009
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REFRESH_FAST_PCT
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         : 2157
MSGTXT                        : PCT is not possible on any of the detail tables in the materialized view
SEQ                           : 8010
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REWRITE_FULL_TEXT_MATCH
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         : 2159
MSGTXT                        : query rewrite is disabled on the materialized view
SEQ                           : 9011
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REWRITE_PARTIAL_TEXT_MATCH
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         : 2159
MSGTXT                        : query rewrite is disabled on the materialized view
SEQ                           : 10012
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REWRITE_GENERAL
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         : 2159
MSGTXT                        : query rewrite is disabled on the materialized view
SEQ                           : 11013
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : REWRITE_PCT
POSSIBLE                      : N
RELATED_TEXT                  :
RELATED_NUM                   :
MSGNO                         : 2158
MSGTXT                        : general rewrite is not possible or PCT is not possible on any of the detail tables
SEQ                           : 12014
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : PCT_TABLE_REWRITE
POSSIBLE                      : N
RELATED_TEXT                  : PATIENT
RELATED_NUM                   : 840
MSGNO                         : 2068
MSGTXT                        : relation is not a partitioned table
SEQ                           : 13015
==============================
STATEMENT_ID                  :
MVOWNER                       : MCDONAC
MVNAME                        : MV
CAPABILITY_NAME               : PCT_TABLE_REWRITE
POSSIBLE                      : N
RELATED_TEXT                  : PATIENT_ATTRIB
RELATED_NUM                   : 858
MSGNO                         : 2068
MSGTXT                        : relation is not a partitioned table
SEQ                           : 13016

PL/SQL procedure successfully completed.

SQL>
SQL>

ORA-01950 Error on a Sequence - Error on Primary Key Index

I posted yesterday a blog about an error on a sequence of ORA-01950 on tablespace USERS - ORA-01950 Error on a Sequence . This was attributed to the sequence by me because that's where the error in Oracle was pointing....[Read More]

Posted by Pete On 01/10/19 At 01:12 PM

How to enlarge an #Exasol database by adding a node

Adding a cluster node will not only increase the available storage capacity but also the total compute power of your cluster. This scale-out is a quite common operation for Exasol customers to do.

My example shows how to change an existing 2+1 cluster into a 3+0 cluster. Before you can enlarge the database with an active node, this node has to be a reserve node first. See here how to add a reserve to a 2+0 cluster. Of course you can add another reserve node to change from 3+0 to 3+1 afterwards. See here if you wonder why you may want to have a reserve node at all.

Initial state – reserve node is present

I start with a 2+1 cluster – 2 active nodes and 1 reserve node:

https://uhesse.files.wordpress.com/2019/09/enlarge1.png?w=150&h=38 150w, https://uhesse.files.wordpress.com/2019/09/enlarge1.png?w=300&h=76 300w, https://uhesse.files.wordpress.com/2019/09/enlarge1.png?w=768&h=195 768w, https://uhesse.files.wordpress.com/2019/09/enlarge1.png 881w" sizes="(max-width: 620px) 100vw, 620px" />

For later comparison, let’s look at the distribution of rows of one of my tables:

https://uhesse.files.wordpress.com/2019/09/enlarge2.png?w=150&h=44 150w, https://uhesse.files.wordpress.com/2019/09/enlarge2.png?w=300&h=89 300w, https://uhesse.files.wordpress.com/2019/09/enlarge2.png 685w" sizes="(max-width: 620px) 100vw, 620px" />

The rows are roughly even distributed across the two active nodes.

Before you continue, it would be a good idea to take a backup on a remote archive volume now – just in case.

Shutdown database before volume modification

A data volume used used by a database cannot be modified while that database is up, so shut it down first:

https://uhesse.files.wordpress.com/2019/09/enlarge3.png?w=150&h=33 150w, https://uhesse.files.wordpress.com/2019/09/enlarge3.png?w=300&h=66 300w, https://uhesse.files.wordpress.com/2019/09/enlarge3.png?w=768&h=170 768w, https://uhesse.files.wordpress.com/2019/09/enlarge3.png 854w" sizes="(max-width: 620px) 100vw, 620px" />

After going to the Storage branch in EXAoperation, click on the data volume:

https://uhesse.files.wordpress.com/2019/09/enlarge4.png?w=150&h=27 150w, https://uhesse.files.wordpress.com/2019/09/enlarge4.png?w=300&h=54 300w, https://uhesse.files.wordpress.com/2019/09/enlarge4.png?w=768&h=137 768w, https://uhesse.files.wordpress.com/2019/09/enlarge4.png 834w" sizes="(max-width: 620px) 100vw, 620px" />

Then click on Edit:

https://uhesse.files.wordpress.com/2019/09/enlarge5.png?w=140 140w, https://uhesse.files.wordpress.com/2019/09/enlarge5.png?w=280 280w" sizes="(max-width: 472px) 100vw, 472px" />

Decrease volume redundancy to 1

Change the redundany from 2 to 1, then click Apply:

https://uhesse.files.wordpress.com/2019/09/enlarge6.png?w=115 115w, https://uhesse.files.wordpress.com/2019/09/enlarge6.png?w=230 230w" sizes="(max-width: 465px) 100vw, 465px" />

Why is the redundancy reduced from 2 to 1 here? Let’s try to explain that. Initially, I had 2 active nodes with a volume using redundancy 2:

https://uhesse.files.wordpress.com/2019/09/cluster.png?w=150&h=95 150w, https://uhesse.files.wordpress.com/2019/09/cluster.png?w=300&h=189 300w, https://uhesse.files.wordpress.com/2019/09/cluster.png?w=768&h=485 768w, https://uhesse.files.wordpress.com/2019/09/cluster.png 790w" sizes="(max-width: 620px) 100vw, 620px" />

A and B are master segments while A’ and B’ are mirrored segments. If I could add a node to this volume keeping the existing segments, it would look like this:

https://uhesse.files.wordpress.com/2019/09/badidea.png?w=133&h=150 133w, https://uhesse.files.wordpress.com/2019/09/badidea.png?w=267&h=300 267w, https://uhesse.files.wordpress.com/2019/09/badidea.png?w=768&h=863 768w, https://uhesse.files.wordpress.com/2019/09/badidea.png 854w" sizes="(max-width: 620px) 100vw, 620px" />

Of course this would be a bad idea. The redundancy is reduced to 1 before the new node is added to the volume:

https://uhesse.files.wordpress.com/2019/09/redundancy1.png?w=133&h=150 133w, https://uhesse.files.wordpress.com/2019/09/redundancy1.png?w=267&h=300 267w, https://uhesse.files.wordpress.com/2019/09/redundancy1.png?w=768&h=863 768w, https://uhesse.files.wordpress.com/2019/09/redundancy1.png 854w" sizes="(max-width: 620px) 100vw, 620px" />

Only distributed master segments with no mirrors at first. Then the redundancy is again increased to 2:

https://uhesse.files.wordpress.com/2019/09/redundancy2.png?w=133&h=150 133w, https://uhesse.files.wordpress.com/2019/09/redundancy2.png?w=267&h=300 267w, https://uhesse.files.wordpress.com/2019/09/redundancy2.png?w=768&h=863 768w, https://uhesse.files.wordpress.com/2019/09/redundancy2.png 854w" sizes="(max-width: 620px) 100vw, 620px" />

This way, every master segment can be mirrored on a neighbor node. That’s why the redundancy needs to be reduced to 1.

Add new node to volume

After having decreased the volume redundancy to 1, click Edit on the volume detail page again and add n13 as a new master node to the volume and click Apply:

https://uhesse.files.wordpress.com/2019/09/enlarge7.png?w=116 116w, https://uhesse.files.wordpress.com/2019/09/enlarge7.png?w=233 233w" sizes="(max-width: 468px) 100vw, 468px" />

Increase redundancy to 2

Now click Edit again and increase the redudancy to 2:

https://uhesse.files.wordpress.com/2019/09/enlarge8.png?w=116 116w, https://uhesse.files.wordpress.com/2019/09/enlarge8.png?w=232 232w" sizes="(max-width: 464px) 100vw, 464px" />

The state of the volume shows now as RECOVERING – don’t worry, it just means that mirrored segments are now created.

https://uhesse.files.wordpress.com/2019/09/enlarge9.png?w=141 141w, https://uhesse.files.wordpress.com/2019/09/enlarge9.png?w=282 282w" sizes="(max-width: 474px) 100vw, 474px" />

Enlarge the database

Now click on the database link on the EXASolution screen:

https://uhesse.files.wordpress.com/2019/09/enlarge10.png?w=150&h=33 150w, https://uhesse.files.wordpress.com/2019/09/enlarge10.png?w=300&h=66 300w, https://uhesse.files.wordpress.com/2019/09/enlarge10.png?w=768&h=169 768w, https://uhesse.files.wordpress.com/2019/09/enlarge10.png 848w" sizes="(max-width: 620px) 100vw, 620px" />

Select the Action Enlarge and click Submit:

https://uhesse.files.wordpress.com/2019/09/enlarge11.png?w=150&h=38 150w, https://uhesse.files.wordpress.com/2019/09/enlarge11.png?w=300&h=76 300w, https://uhesse.files.wordpress.com/2019/09/enlarge11.png?w=768&h=194 768w, https://uhesse.files.wordpress.com/2019/09/enlarge11.png 903w" sizes="(max-width: 620px) 100vw, 620px" />

Enter 1 and click Apply:

https://uhesse.files.wordpress.com/2019/09/enlarge12.png?w=150&h=51 150w, https://uhesse.files.wordpress.com/2019/09/enlarge12.png?w=300&h=102 300w, https://uhesse.files.wordpress.com/2019/09/enlarge12.png 642w" sizes="(max-width: 620px) 100vw, 620px" />

The database detail page looks like this now:

https://uhesse.files.wordpress.com/2019/09/enlarge13.png?w=150&h=37 150w, https://uhesse.files.wordpress.com/2019/09/enlarge13.png?w=300&h=74 300w, https://uhesse.files.wordpress.com/2019/09/enlarge13.png?w=768&h=189 768w, https://uhesse.files.wordpress.com/2019/09/enlarge13.png 912w" sizes="(max-width: 620px) 100vw, 620px" />

Reorganize

Technically, this is a 3+0 cluster now – but the third node doesn’t contain any data yet. If we look at the same table as before, we see that no rows are on the new node:

https://uhesse.files.wordpress.com/2019/09/enlarge2.png?w=150&h=44 150w, https://uhesse.files.wordpress.com/2019/09/enlarge2.png?w=300&h=89 300w, https://uhesse.files.wordpress.com/2019/09/enlarge2.png 685w" sizes="(max-width: 620px) 100vw, 620px" />

To change that, a REORGANIZE needs to be done. Either on the database layer, on schema layer or on table layer. Most easy to perform is REORGANIZE DATABASE:

https://uhesse.files.wordpress.com/2019/09/enlarge14.png?w=150&h=37 150w, https://uhesse.files.wordpress.com/2019/09/enlarge14.png?w=300&h=75 300w, https://uhesse.files.wordpress.com/2019/09/enlarge14.png?w=768&h=192 768w, https://uhesse.files.wordpress.com/2019/09/enlarge14.png 942w" sizes="(max-width: 620px) 100vw, 620px" />

Took me about 10 Minutes on my tiny database. That command re-distributes every table across all cluster nodes and can be time consuming with high data volume. While a table is reorganized, that table is locked against DML. You can monitor the ongoing reorganization by selecting from EXA_DBA_PROFILE_RUNNING in another session.

Final state

Let’s check the distribution of the previous table again:

https://uhesse.files.wordpress.com/2019/09/enlarge15.png?w=150&h=49 150w, https://uhesse.files.wordpress.com/2019/09/enlarge15.png?w=300&h=97 300w, https://uhesse.files.wordpress.com/2019/09/enlarge15.png 684w" sizes="(max-width: 620px) 100vw, 620px" />

As you can see above, now there are rows on the added node. Also EXAoperation confirms that the new node is not empty any more:

https://uhesse.files.wordpress.com/2019/09/enlarge16.png?w=150&h=37 150w, https://uhesse.files.wordpress.com/2019/09/enlarge16.png?w=300&h=74 300w, https://uhesse.files.wordpress.com/2019/09/enlarge16.png?w=768&h=190 768w, https://uhesse.files.wordpress.com/2019/09/enlarge16.png 900w" sizes="(max-width: 620px) 100vw, 620px" />

On a larger database, you would see that the volume usage of the nodes is less than before per node and every node is holding roughly the same amount of data. For failsafety, you could add another reserve node now.

Summary of steps

  1. Add a reserve node (if not yet existing)
  2. Take a backup on a remote archive volume
  3. Shutdown database
  4. Decrease volume redundancy to 1
  5. Add former reserve node as new master node to the volume
  6. Increase redundancy to 2
  7. Enlarge database by 1 active node
  8. Reorganize
  9. Add another reserve node (optionally)

ORA-01950 Error on a Sequence

UPDATE: I have updated information for this post and rather than make this one much longer i created a new post - please see ORA-01950 Error on a Sequence - Error on Primary Key Index Wow, its been a while....[Read More]

Posted by Pete On 30/09/19 At 01:42 PM

Negative Offload

At the Trivadis Performance Days 2019 I did a presentation on using execution plans to understand what a query was doing. One of the examples I showed was a plan from an Exadata system (using 11.2.0.4) that needed to go faster. The plan was from the SQL Monitor report and all I want to show you is one line that’s reporting a tablescan. To fit the screen comfortably I’ve removed a number of columns from the output.

The report had been generated while the statement was still running (hence the “->” at the left hand edge) and the query had scanned 166 segments (with no partition elimination) of a table with 4,500 data segments (450 range partitions and 10 hash sub-partitions – note the design error, by the way, hash partitioning in Oracle should always hash for a powert of 2).


SQL Plan Monitoring Details (Plan Hash Value=3764612084)  
============================================================================================================================================
| Id   |           Operation            | Name  | Read  | Read  | Write | Write |   Cell   | Mem  | Activity |       Activity Detail       |  
|      |                                |       | Reqs  | Bytes | Reqs  | Bytes | Offload  |      |   (%)    |         (# samples)         |   
============================================================================================================================================
| -> 5 |      TABLE ACCESS STORAGE FULL | TXN   |  972K | 235GB |       |       | -203.03% |   7M |    63.43 | Cpu (1303)                  | 
|      |                                |       |       |       |       |       |          |      |          | cell smart table scan (175) | 
============================================================================================================================================

In the presentation I pointed out that for a “cell smart table scan” (note the Activity Detail colum) this line was using a surprisingly large amount of CPU.

We had been told that the table was using hybrid columnar compression (HCC) and had been given some figures that showed the compression factor was slightly better than 4. I had also pointed out that the typical size of a read request was 256KB. (Compare Read Reqs with Read Bytes)

To explain the excessive CPU I claimed that we were seeing “double decompression” – the cell was decompressing (uncompressing) compression units (CUs), finding that the resulting decompressed data was larger than the 1MB unit that Exadata allows and sending the original compressed CU to the database server where it was decompressed again – and the server side decompression was burning up the CPU.

This claim is (almost certainly) true – but the justification I gave for the claim was at best incomplete (though, to be brutally honest, I have to admit that I’d made a mistake): I pointed out that the Cell Offload was negative 200% and that this was what told us about the double decompression. While double decompression was probably happening the implication I had made was that a negative offload automatically indicated double decompression – and that’s was an incorrect assumption on my part. Fortunately Maurice Müller caught up with me after the session was over and pointed out the error then emailed me a link to a relevant article by Ahmed Aangour.

The Cell Offload is a measure of the difference between the volume of data read and the volume of data returned to the server. If the cell reads 256KB from disc, but the column and row selection means the cell returns 128KB the Cell Offload would be 50%; if the cell returns 64KB the Cell Offload would be 75% (100 * (1 – 64KB/256KB)). But what if you select all the rows and columns from a compressed table – the volume of data after decompression would be larger than the compressed volume the cell had read from disc – and in this case we knew that we were reading 256KB at a time and the compression factor was slightly greater than 4, so the uncompressed data would probably be around 1MB, giving us a Cell Offload of 100 * (1 – 1024KB / 256KB) = negative 300%

Key Point: Any time that decompression, combined with the row and column selection, produces more data than the volume of data read from disc the Cell Offload will go negative. A negative Cell Offload is not inherently a problem (though it might hint at a suboptimal use of compression).

Follow-up Analysis

Despite the error in my initial understanding the claim that we were seeing double decompression was still (almost certainly) true – but we need to be a little more sophisticated in the analysis. The clue is in the arithmetic a few lines further up the page. We can see that we are basically reading 256KB chunks of the table, and we know that 256KB will expand to roughly 1MB so we ought to see a Cell Offload of about -300%; but the Cell Offload is -200%. This suggests fairly strongly that on some of the reads the decompressed data is slightly less than 1MB, which allows the cell to return the decompressed data to the database server, while some of the time the decompressed data is greater than 1MB, forcing the cell to send the original (compressed) CU to the databsae server.

We may even be able work the arithmetic backwards to estimate the number of times double decompression appeared.  Assume that two-thirds of the time the cell decompressed the data and successfully sent (just less than) 1MB back to the database server and one-third of the time the cell decompressed the data and found that the result was too large and sent 256KB of compressed data back to the server, and let’s work with the 972,000 read requests reported to see what drops out of the arithmetic:

  • Total data read: 972,000 * 256KB = 243,000 MB
  • Data sent to db server:  648,000 * 1MB + 324,000 * 256KB = 729,000 MB
  • Cell Offload = 100 * (1 – 729/243) = -200%   Q.E.D.

Of course it would be nice to avoid guessing – and if we were able to check the session activity stats (v$sessstat) while the query was running (or after it had completed) we could pick up several numbers that confirmed our suspicion. For 11.2.0.4, for example, we would keep an eye on:

	cell CUs sent uncompressed
	cell CUs processed for uncompressed
	EHCC {class} CUs Decompressed

Differences between these stats allows you to work out the number of compression units that failed the 1MB test on the cell server and were sent to the database server to be decompressed. There is actually another statistic named “cell CUs sent compressed” which would make life easy for us, but I’ve not seen it populated in my tests – so maybe it doesn’t mean what it seems to say.

Here’s an example from an 11.2.0.4 system that I presented a few years ago showing some sample numbers.

cell CUs sent uncompressed              5,601
cell CUs processed for uncompressed     5,601

EHCC CUs Decompressed                  17,903
EHCC Query High CUs Decompressed       12,302 

This reveals an annoying feature of 11g (continued in 12.1) that results in double counting of the statistics, confusing the issue when you’re trying to analyze what’s going on. In this case the table consisted of 12,302 compression units, and the query was engineered to cause the performance problem to appear. The first two statistics show us how many CUs were decompressed successfully (we’ll see a change appearing there in 12.1). We then see that all 12,302 of the table’s “query high” compression units were decompressed – but the “total” of all CUs decompressed was 17.903.

It’s not a coincidence that 12,302 + 5,601 = 17,903; there’s some double counting going on. I don’t know how many of the statistics are affected in this way, but Oracle has counted the CUs that passsed decompression once as they were processed at the cell server and again as they arrived at the database server. In this example we can infer that 12,302 – 5,601 = 6,701 compression units failed decompression at the cell server and were sent to the database server in compressed form to be decompressed again.

Here’s a couple of sets of figures from some similar tests run on 12.1.0.2 – one with a table compressed to query high another compressed to query low. There is one critical difference from the 11g figures but the same double-counting seems to have happened. In both cases the “EHCC Query [Low|High] CUs Decompressed” show the correct number of CUs in each table. Note, though that the “cell CUs processed for uncompress” in 12.1 appear to report the number of attempted decompressions rather than 11g’s number of successful decompressions.


=========================

cell CUs sent uncompressed                     19,561	-- successful decompressions at cell server
cell CUs processed for uncompressed            19,564	=> 3 failures

EHCC CUs Decompressed                          39,125	=  2 * 19,561 successes + 3 db server decompression
EHCC Query High CUs Decompressed               19,564

=========================

cell CUs sent uncompressed                     80,037	-- successful decompressions at cell server
cell CUs processed for uncompressed            82,178	=> 2,141 failures

EHCC CUs Decompressed                         162,215	=  2 * 80,037 successes + 2,141 db server decompressions
EHCC Query Low CUs Decompressed                82,178

=========================

I’ve annotated the figures to explain the arithmetic.

There has been some significant renaming and separation of statistics in 12.2, as described in this post by Roger MacNicol, and the problems of double-counting should have disappeared. I haven’t yet tested my old models in the latest versions of Oracle, though, so can’t show you anyy figures to demonstrate the change.

Takeaways

There are 4 key points to note in this posting.

  • Hash (sub)partitioning should be based on powers of 2, otherwise some partitions will be twice size of others.
  • There is a 1MB limit on the “data packet” sent between the cell server and database server in Exadata.
  • If you select a large fraction of the rows and columns from an HCC compressed table you may end up decompressing a lot of your data twice if the decompressed data for a read request is larger than the 1MB unit (and the cost will be highly visible at the database server as CPU usage).
  • The Cell Offload figure for a tablescan (in particular) will go negative if the volume of data sent from the cell server to the database server is larger than the volume of data read from the disk- even if double decompression hasn’t been happening.

A little corollary to the third point: if you are writing to a staging table with the expectation of doing an unfiltered tablescan (or a select *), then you probably don’t want to use hybrid columnar compression on the table as you will probably end up using a lot of CPU at the database server to compress it, then do double-decompression using even more CPU on the database server.  It’s only if you really need to minimise disk usage and have lots of CPU capacity to spare that you have a case for using hybrid columnar compression for the table (and Oracle In-Memory features may also change the degree of desirability).

Footnote

I haven’t said anything about accessing table data by index when the table is subject to HCC compression. I haven’t tested the mechanism in recent versions of Oracle but it used to be the case that the cell server would supply the whole compression unit (CU) to the database server which would decompress it to construct the relevant row. One side effect of this was that the same CU could be decompressed (with a high CPU load) many times in the course of a single query.

 

So pgio Does Not Accurately Report Physical I/O In Test Results? Buffering Buffers, and Baffles.

A new user to pgio (The SLOB Method for PostgreSQL) reached out to me with the following comment:

 I’ve been testing with pgio but when I compare I/O monitored in iostat output it does not match the pgio output for physical reads. 

The user is correct–but that’s not the fault of pgio. Please allow me to explain.

Buffering Buffers, and Baffles

PostgreSQL does not open files with the O_DIRECT flag which means I/O performed by PostgreSQL is buffered I/O. The buffering uses physical memory in the Linux page cache.  For this reason, the pgio runit.sh script produces output that accounts for read IOPS (RIOPS) as opposed to RPIOPS (Read Physical IOPS). The following is an example of what the user reported and how to change the behavior.

The output in Figure 1 shows how I set up a 48-schema (32GB each) test with a single pgio thread accessing each schema. The output of runit.sh shows that the PostgreSQL primary cache (shared memory) is 4GB and that at the end of the 120 second test the internal PostgreSQL counters tallied up 436,964 reads from storage per second (RIOPS).

#000000;" src="https://kevinclosson.files.wordpress.com/2019/09/pic1.png?w=500&h=275" alt="" width="500" height="275" srcset="https://kevinclosson.files.wordpress.com/2019/09/pic1.png?w=500&h=275 500w, https://kevinclosson.files.wordpress.com/2019/09/pic1.png?w=150&h=82 150w, https://kevinclosson.files.wordpress.com/2019/09/pic1.png?w=300&h=165 300w, https://kevinclosson.files.wordpress.com/2019/09/pic1.png?w=768&h=422 768w, https://kevinclosson.files.wordpress.com/2019/09/pic1.png 996w" sizes="(max-width: 500px) 100vw, 500px" />

Figure 1

The test configuration for this has NVMe drives assembled into a single logical volume with mdadm(4) called /dev/md0.  The pgio runit.sh driver script saves iostat(1) output. Figure 2 shows the iostat report for physical read requests issued to the device. A simple glance reveals the values cannot average up to 436,964 as reported by runit.sh. That’s because runit.sh isn’t reporting physical reads from storage.

#000000;" src="https://kevinclosson.files.wordpress.com/2019/09/pic2.png?w=500&h=313" alt="" width="500" height="313" srcset="https://kevinclosson.files.wordpress.com/2019/09/pic2.png?w=500&h=313 500w, https://kevinclosson.files.wordpress.com/2019/09/pic2.png?w=1000&h=626 1000w, https://kevinclosson.files.wordpress.com/2019/09/pic2.png?w=150&h=94 150w, https://kevinclosson.files.wordpress.com/2019/09/pic2.png?w=300&h=188 300w, https://kevinclosson.files.wordpress.com/2019/09/pic2.png?w=768&h=481 768w" sizes="(max-width: 500px) 100vw, 500px" />

Figure 2

As shown in Figure 3, pgio saves the output of /proc/diskstats. The most accurate way to calculate physical accesses to storage by pgio is to calculate the delta between the before and after data in the pgio_diskstats.out file. The diskstats data shows that the true physical read IOPS rate was 301,599 not the 436,964 figure reported by runit.sh. Again, this is because runit.sh reports reads, not physical reads and that is because PostgreSQL doesn’t really know whether a read operation is satisfied with a page cache buffer hit or a physical access to the storage device. Also shown in Figure 3 is how the data in iostat.out and diskstats output are within .1%.

#000000;" src="https://kevinclosson.files.wordpress.com/2019/09/pic3-1.png?w=500&h=158" alt="" width="500" height="158" srcset="https://kevinclosson.files.wordpress.com/2019/09/pic3-1.png?w=500&h=158 500w, https://kevinclosson.files.wordpress.com/2019/09/pic3-1.png?w=997&h=316 997w, https://kevinclosson.files.wordpress.com/2019/09/pic3-1.png?w=150&h=48 150w, https://kevinclosson.files.wordpress.com/2019/09/pic3-1.png?w=300&h=95 300w, https://kevinclosson.files.wordpress.com/2019/09/pic3-1.png?w=768&h=243 768w" sizes="(max-width: 500px) 100vw, 500px" />

Figure 3

So, let’s see how we can change this behavior.

Figure 4 shows that the 436,964 read results were achieved on a system that can buffer 526GB of the pgio active data set.

#000000;" src="https://kevinclosson.files.wordpress.com/2019/09/pic4.png?w=500&h=96" alt="" width="500" height="96" srcset="https://kevinclosson.files.wordpress.com/2019/09/pic4.png?w=500&h=96 500w, https://kevinclosson.files.wordpress.com/2019/09/pic4.png?w=150&h=29 150w, https://kevinclosson.files.wordpress.com/2019/09/pic4.png?w=300&h=58 300w, https://kevinclosson.files.wordpress.com/2019/09/pic4.png?w=768&h=148 768w, https://kevinclosson.files.wordpress.com/2019/09/pic4.png 863w" sizes="(max-width: 500px) 100vw, 500px" />

Figure 4

If You Don’t Want To Test Buffered I/O, Don’t

Figure 5 shows an example of using the pgio_reduce_free_memory.sh script to limit the amount of memory available to page cache. For no particular reason, I chose to limit page cache to less than 16GB and then executed the pgio test again.  As Figure 5 shows, the effect of neutering page cache buffering brought the RIOPS figure and both the iostat and diskstat data to within 1% variation.

#000000;" src="https://kevinclosson.files.wordpress.com/2019/09/pic5.png?w=500&h=372" alt="" width="500" height="372" srcset="https://kevinclosson.files.wordpress.com/2019/09/pic5.png?w=500&h=372 500w, https://kevinclosson.files.wordpress.com/2019/09/pic5.png?w=1000&h=744 1000w, https://kevinclosson.files.wordpress.com/2019/09/pic5.png?w=150&h=112 150w, https://kevinclosson.files.wordpress.com/2019/09/pic5.png?w=300&h=223 300w, https://kevinclosson.files.wordpress.com/2019/09/pic5.png?w=768&h=571 768w" sizes="(max-width: 500px) 100vw, 500px" />

Figure 5

Summary

This post shows that that PostgreSQL performs buffered I/O and that the pgio runit.sh driver script reports read iops (RIOPS) as per PostgreSQL internal statistics. Since pgio includes a helper script to eliminate page cache buffering from your testing, you too can test physical I/O with pgio and have accurate accounting of that physical I/O by analyzing the /proc/diskstats and iostat(1) data saved by the runit.sh script.

 

 

How to install Oracle XE 18c in Oracle Cloud Free Tier

You probably heard by now about Oracle Cloud Free Tier
While it sounds good, the only databases that you can use are in fact autonomous databases (which are not interesting for a tech, geeky guy like me).

Of course, you have also "2 virtual machines with 1/8 OCPU and 1 GB memory each." I thought: that’s cool because I could install an Oracle XE 18c database in there and play a bit… But no. The machines will have actually 971MB of RAM which is to low for Oracle XE 18c and you will get an error while trying to install the RPM file.

"[SEVERE] Oracle Database 18c Express Edition requires a minimum of 1GB of physical
memory (RAM).  This system has 971 MB of RAM and does not meet minimum
requirements."

This is due to one piece of script in the RPM file:

# Check and disallow install, if RAM is less than 1 GB
space=`cat /proc/meminfo | grep '^MemTotal' | awk '{print $2}'`
PhyMem=`expr $space / 1024`
swapspace=`free -m | grep Swap | awk '{print $4}'`

if [ $PhyMem -lt 1024 ]
then
        echo "[SEVERE] Oracle Database 18c Express Edition requires a minimum of 1GB of physical
memory (RAM).  This system has $PhyMem MB of RAM and does not meet minimum
requirements."
        echo
        exit 1
fi

You can edit this script with a tool called rpmrebuild, that can be installed from yum repository:

rpmrebuild -e -p /home/opc/oracle-database-xe-18c-1.0-1.x86_64.rpm

I changed the if statement to look like this:

if [ $PhyMem -lt 102 ]
then
        echo "[SEVERE] Oracle Database 18c Express Edition requires a minimum of 1GB of physical
memory (RAM).  This system has $PhyMem MB of RAM and does not meet minimum
requirements."
        echo
        exit 1
fi

After this modification rpmbuild will create a new RPM file (you should do it in screen because it will take quite a long time).
This change will let you install the RPM. Of course, the configure script will fail with an error, but you can either create a database manually or use a backup of your XE database, created previously on your normal virtual machine (this is what I did)

And that’s how you can have on XE database in your Oracle Cloud Free Tier… It will be as slow as Oracle Support… but at least you will have a normal database to play with </p />
</p></div>

    	  	<div class=

Ansible tips’n’tricks: executing a loop conditionally

When writing playbooks, I occasionally add optional tasks. These tasks are only executed if a corresponding configuration variable is defined. I usually define configuration variables either in group_vars/* or alternatively in the role’s roleName/default/ directory.

The “when” keyword can be used to test for the presence of a variable and execute a task if the condition evaluates to “true”. However this isn’t always straight-forward to me, and recently I stumbled across some interesting behaviour that I found worth mentioning. I would like to point out that I’m merely an Ansible enthusiast, and by no means a pro. In case there is a better way to do this, please let me know and I’ll update the post :)

Before showing you my code, I’d like to add a little bit of detail here in case someone finds this post via a search engine:

  • Ansible version: ansible 2.8.2
  • Operating system: Fedora 29 on Linux x86-64

The code

This is the initial code I started with:

$ tree
.
├── inventory.ini
├── roles
│   └── example
│       ├── defaults
│       │   └── main.yml
│       └── tasks
│           └── main.yml
└── variables.yml

4 directories, 4 files

$ nl variables.yml 
      1  ---
      2  - hosts: blogpost
      3    become: yes
      4    roles:
      5    - example

$ nl roles/example/defaults/main.yml 
     1  #
     2  # some variables
     3  #

     4  oracle_disks: ''

$ nl roles/example/tasks/main.yml
     1  ---
     2  - name: print lenght of oracle_disks variable
     3    debug: 
     4      msg: "The variable has a length of {{ oracle_disks | length }}"

     5  - name: format disk devices
     6    parted:
     7      device: "{{ item }}"
     8      number: 1
     9      state: present
    10      align: optimal
    11      label: gpt
    12    loop: "{{ oracle_disks }}"
    13    when: oracle_disks | length > 0

This will not work, as you can see in a minute.

The error

And indeed, the execution of my playbook (variables.yml) failed:

$ ansible-playbook -vi inventory.ini variables.yml 
Using /etc/ansible/ansible.cfg as config file

PLAY [blogpost] ******************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************
ok: [server6]

TASK [example : print lenght of oracle_disks variable] ***************************************************************
ok: [server6] => {}

MSG:

The variable has a length of 0


TASK [example : format disk devices] *********************************************************************************
fatal: [server6]: FAILED! => {}

MSG:

Invalid data passed to 'loop', it requires a list, got this instead: . 
Hint: If you passed a list/dict of just one element, try adding wantlist=True 
to your lookup invocation or use q/query instead of lookup.


PLAY RECAP ***********************************************************************************************************
server6                    : ok=2    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

The intention was to not execute the task named “format disk devices” if oracle_disks has a length of 0. This seems to be evaluated too late though, and it turned out to be the wrong check anyway. I tried various permutations of the scheme, but none were successful while oracle_disks was set to the empty string. Which is wrong, but please bear with me …

No errors with meaningful values

The loop syntax in the role’s tasks/main.yml file is correct though, once I set the variable to a list, it worked:

$ nl roles/example/defaults/main.yml
      1  #
      2  # some variables
      3  #
        
      4  oracle_disks: 
      5  - /dev/vdc
      6  - /dev/vdd

$ ansible-playbook -vi inventory.ini variables.yml
Using /etc/ansible/ansible.cfg as config file

PLAY [blogpost] ******************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************
ok: [server6]

TASK [example : print lenght of oracle_disks variable] ***************************************************************
ok: [server6] => {}

MSG:

The variable has a length of 2

TASK [example : format disk devices] *********************************************************************************
changed: [server6] => (item=/dev/vdc) => {
    "ansible_loop_var": "item",
    "changed": true,
    "disk": {
        "dev": "/dev/vdc",
        "logical_block": 512,
        "model": "Virtio Block Device",
        "physical_block": 512,
        "size": 10485760.0,
        "table": "gpt",
        "unit": "kib"
    },
    "item": "/dev/vdc",
    "partitions": [
        {
            "begin": 1024.0,
            "end": 10484736.0,
            "flags": [],
            "fstype": "",
            "name": "primary",
            "num": 1,
            "size": 10483712.0,
            "unit": "kib"
        }
    ],
    "script": "unit KiB mklabel gpt mkpart primary 0% 100%"
}
changed: [server6] => (item=/dev/vdd) => {
    "ansible_loop_var": "item",
    "changed": true,
    "disk": {
        "dev": "/dev/vdd",
        "logical_block": 512,
        "model": "Virtio Block Device",
        "physical_block": 512,
        "size": 10485760.0,
        "table": "gpt",
        "unit": "kib"
    },
    "item": "/dev/vdd",
    "partitions": [
        {
            "begin": 1024.0,
            "end": 10484736.0,
            "flags": [],
            "fstype": "",
            "name": "primary",
            "num": 1,
            "size": 10483712.0,
            "unit": "kib"
        }
    ],
    "script": "unit KiB mklabel gpt mkpart primary 0% 100%"
}

PLAY RECAP ***********************************************************************************************************
server6                    : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

So what gives? It once more goes to show that as soon as you do things right, they start working.

Checking if a variable is defined

How can I prevent the task from being executed? There are probably a great many ways of achieving this goal, I learned that not defining oracle_disks seems to work for me. Here I’m commenting out all references to the variable before trying again:

$ cat roles/example/defaults/main.yml 
#
# some variables
#

#oracle_disks: 
#- /dev/vdc
#- /dev/vdd

$ cat roles/example/tasks/main.yml 
---
- name: print lenght of oracle_disks variable
  debug: 
    msg: "The variable has a length of {{ oracle_disks | length }}"
  when: oracle_disks is defined

- name: format disk devices
  parted:
    device: "{{ item }}"
    number: 1
    state: present
    align: optimal
    label: gpt
  loop: "{{ oracle_disks }}" 
  when: oracle_disks is defined

$ ansible-playbook -vi inventory.ini variables.yml 
Using /etc/ansible/ansible.cfg as config file

PLAY [blogpost] ******************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************
ok: [server6]

TASK [example : print lenght of oracle_disks variable] ***************************************************************
skipping: [server6] => {}

TASK [example : format disk devices] *********************************************************************************
skipping: [server6] => {
    "changed": false,
    "skip_reason": "Conditional result was False"
}

PLAY RECAP ***********************************************************************************************************
server6                    : ok=1    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0 

With the variable not defined, the task is skipped as intended.

As you read earlier, using the empty string (”) isn’t the right way to mark a variable as “empty”. I guess this is where my other programming languages influenced me a bit (cough * perl * cough). The proper way to indicate there are no items in the list (as per the documentation) is this:

$ nl roles/example/defaults/main.yml 
     1  #
     2  # some variables
     3  #

     4  oracle_disks: []

$ nl roles/example/tasks/main.yml 
     1  ---
     2  - name: print lenght of oracle_disks variable
     3    debug: 
     4      msg: "The variable has a length of {{ oracle_disks | length }}"
     5    when: oracle_disks is defined

     6  - name: format disk devices
     7    parted:
     8      device: "{{ item }}"
     9      number: 1
    10      state: present
    11      align: optimal
    12      label: gpt
    13    loop: "{{ oracle_disks | default([]) }}" 

The default() assignment in tasks/main.yml line 13 shouldn’t be necessary with the assignment completed in defaults/main.yml line 4. It doesn’t seem to hurt either. Instead of the conditional check message you will see the task executed, but since there is nothing to loop over, it finishes straight away:

$ ansible-playbook -vi inventory.ini variables.yml 
Using /etc/ansible/ansible.cfg as config file

PLAY [blogpost] ***********************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************
ok: [server6]

TASK [example : print lenght of oracle_disks variable] ********************************************************************************************************************************
ok: [server6] => {}

MSG:

The variable has a length of 0


TASK [example : format disk devices] **************************************************************************************************************************************************

PLAY RECAP ****************************************************************************************************************************************************************************
server6                    : ok=2    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0

Happy coding!

#24HOP and PASS Summit 2019

You know how busy I am by the amount of time I DON’T blog and yes, it’s been over a month.  I think that’s the longest I haven’t blogged in YEARS.

24 HOP and Facebook Live

I just finished my 24HOP, (24 hrs of PASS) session back on September 9th, the PASS Facebook Live session this last Monday and I’m thrilled with these virtual events each year I get the chance to participate. If you’re not familiar with 24HOP, its a preview of some of the sessions at Summit, the annual tech conference from PASS.  Twenty four presenters are chosen to offer a preview on their talk.  As mine is on Linux Scripting, I decided to do a tips and tricks session that had three goals:

  1. Offer a session that would harmonize well with the one that I will give at Summit, but still be different.
  2. Have a session that those who weren’t learning BASH could still benefit from with high level best practices for scripting in any language.
  3. Provide some great tips and tricks that really made the session stand out and those technical tidbits that you couldn’t easily find elsewhere.

I’ve already uploaded my slides to slide share to ensure they’ve available to everyone and thank everyone for the great interaction, questions and kudos on the session.  As always, I’m thrilled to be part of the Microsoft community and look forward to the Summit conference.

Facebook Live is different, where they choose Experts to a 1/2 hr live session, where attendees can ask any questions of the expert to answer, hopefully in their area of expertise.  For me, this is additional challenge as I’m still straddling multiple areas of technology, including Oracle migrations to Azure, Power BI, database, Linux and DevOps.  The session was a lot of fun and the recording will be available going forward on the PASS Facebook page.

PASS Summit

at this year’s PASS Summit, I’ll have two sessions, which are both part of learning pathways.  The first is the Linux Learning Pathway, to be presented by Hamish Watson, Randolph West and me.  Where Hamish will cover Linux fundamentals in the first session, I’ll cover scripting and then Randolph will cover advanced Linux during the third session.  The three of these presentations should provide a solid focus on Linux vs. many sessions that just focus on how to manage SQL Server on Linux.

The other learning pathway that I’ll be part of is professional leadership.  I’ll be part of the first step, focused on Becoming a technical leader with Denise McInerney.  I’m looking forward to both of these sessions and along with blogging duties and panel sessions, no counting the social gatherings, I should be in a coma for approximately a week following the event.

Oh, did I mention I was speaking at SQL Saturday Oregon the weekend before and taking the SQL Train down?  Oh, yeah, I’m going to be dead afterwards, forget the coma…:)

 



Tags:  , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright ©  [#24HOP and PASS Summit 2019], All Right Reserved. 2019.