Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Direct IOT

A recent (automatic ?) tweet from Connor McDonald highlighted an article he’d written a couple of years ago about an enhancement introduced in 12c that allowed for direct path inserts to index organized tables (IOTs). The article included a demonstration seemed to suggest that direct path loads to IOTs were of no benefit, and ended with the comment (which could be applied to any Oracle feature): “Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.”

Clearly it’s necessary to pose the question – “so when would direct mode insert be a good option for IOTs?” – because if it’s never a good option you have to wonder why it has been implemented. This naturally leads on to thinking about which tests have not yet been done – what aspects of IOTs did Connor not get round to examining in his article. (That’s a standard principle of trouble-shooting, or testing, or investigation: when someone shows you a test case (or when you think you’ve finished testing) one thing you should do before taking the results as gospel is to ask yourself what possible scenarios have not been covered by the test.)

So if you say IOT what are the obvious tests once you’ve got past the initial step of loading the IOT and seeing what happens. First, I think, would be “What if the IOT weren’t empty before the test started”; second would be “IOTs can have overflow segments, what impact might one have?”; third would be “Do secondary indexes have any effects?”; finally “What happens with bitmap indexes and the requirement for a mapping table?” (Then, of course, you can worry about mixing all the different possibilities together – but for the purposes of this note I’m just going to play with two simple examples: non-empty starting tables, and overflow segments.)

Here’s some code to define a suitable table:


create table t2 
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
	3 * rownum			id,
	lpad(rownum,10,'0')		v1,
	lpad('x',50,'x')		padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
order by
	dbms_random.value
;

begin
	dbms_stats.gather_table_stats(
		ownname     => null,
		tabname     => 'T2',
		method_opt  => 'for all columns size 1'
	);
end;
/

create table t1(
	id,
	v1,
	padding,
	constraint t1_pk primary key(id)
)
organization index
-- including v1
-- overflow
nologging
as
select * from t2
;

begin
	dbms_stats.gather_table_stats(
		ownname     => null,
		tabname     => 'T1',
		method_opt  => 'for all columns size 1'
	);
end;
/

I’ve created a heap table t2 that holds 100,000 rows with an id column that is randomly ordered; then I’ve used this ta1ble as a source to create an IOT, with the option to have an overflow segment that contains just the 100 character padding columns. I’ve used 3 * rownum to define the id column for t2 so that when I insert another copy of t2 into t1 I can add 1 (or 2) to the id and interleave the new data with the old data. (That’s another thought about IOT testing – are you loading your data in a pre-existing order that suits the IOTs or is it arriving in a way that’s badly out of order with respect to the IOT ordering; and does your data go in above the current high value, or spread across the whole range, or have a partial overlap with the top end of the range and then run on above it.)

Have created the starting data set, here’s the test:


execute snap_my_stats.start_snap
execute snap_events.start_snap

insert 
	/*  append */
into t1
select
	id + 1, v1, padding
from
	t2
;


execute snap_events.end_snap
execute snap_my_stats.end_snap

All I’m doing is using a couple of my snapshot packages to check the work done and time spent while insert 100,000 interleaved rows – which are supplied out of order – into the existing table. As shown the “append” is a comment, not a hint, so I’ll be running the test case a total of 4 times: no overflow, with and without the hint – then with the overflow, with and without the hint. (Then, of course, I could run the test without the overflow but an index on v1).

Here are some summary figures from the tests – first from the test without an overflow segment:

                                      Unhinted       With Append
                                  ============      ============
CPU used when call started                 153               102
CPU used by this session                   153               102
DB time                                    166               139

redo entries                           130,603            42,209
redo size                           78,315,064        65,055,376

sorts (rows)                                30           100,031

You’ll notice that with the /*+ append */ hint in place there’s a noticeable reduction in redo entries and CPU time, but this has been achieved at a cost of sorting the incoming data into order. The reduction in redo (entries and size) is due to an “array insert” effect that Oracle can take advantage of with the delayed index maintenance that takes place when the append hint is legal (See the section labelled Option 4 in this note). So even with an IOT with no overflow there’s a potential benefit to gain from direct path loading that depends on how much the new data overlaps the old data, and there’s a penalty that depends on the amount of sorting you’d have to do.

What happens in my case when I move the big padding column out to an overflow segment – here are the equivalent results:


Headline figures                      Unhinted       With Append
================                  ============      ============
CPU used when call started                 158                52
CPU used by this session                   158                52
DB time                                    163                94
redo entries                           116,669            16,690
redo size                           51,392,748        26,741,868
sorts (memory)                               4                 5
sorts (rows)                                33           100,032

Interestingly, comparing the unhinted results with the previous unhinted results, there’s little difference in the CPU usage between having the padding column in the “TOP” section of the IOT compared to having it in the overflow segment, though there is a significant reduction in redo (the index entries are still going all over the place one by one, but the overflow blocks are being pinned and packed much more efficiently). The difference between having the append hint or not, though, is damatic. One third of the CPU time (despited still having 100,000 rows to sort), and half the redo. One of the side effects of the overflow, of course, is that the things being sorted are much shorted (only the id and v1 columns that go into the TOP section, and not the whole IOT row.

So, if you already have an overflow segment that caters for a significant percentage of the row, it looks as if the benefit you could get from using the /*+ append */ hint would far outweigh the penalty of sorting you have to pay. Of course, an IOT with a large overflow doesn’t look much different from a heap table with index – so perhaps that result isn’t very surprising.

I’ll close by re-iterating Connor’s closing comment:

Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.

Before you dive in and embrace it, or ruthlessly push it to one side, make sure you do some testing that reflects the situations you have to handle.

Drilling down the pgSentinel Active Session History

In pgSentinel: the sampling approach for PostgreSQL I mentioned that one of the advantages of the ASH approach is the ability to drill down from an overview of the database activity, down to the details where we can do some tuning. The idea is to always focus on the components which are relevant to our tuning goal:

  • Filter/Group by the relevant dimension to focus on what you want to tune (a program, session, query, time window,…)
  • Sort by the most active samples, to spend time only where you know you can improve significantly

The idea is to start at a high level. Here is a GROUP BY BACKEND_TYPE to show the activity of the ‘client backend’ and the ‘autovacuum worker':

select count(*), backend_type
from pg_active_session_history
where ash_time>=current_timestamp - interval '5 minutes'
group by backend_type
order by 1 desc
;
count | backend_type
-------+-------------------
1183 | client backend
89 | autovacuum worker

I selected only the last 5 minutes (the total retention is defined by pgsentinel_ash.max_entries and the sampling frequency by pgsentinel_ash.pull_frequency).

I ordered by the number of samples for each one, which gives a good idea of the proportion: most of the activity here for ‘client backend’. It may be more interesting to show a percentage, such as 93% activity is from the client and 7% is from the vacuum. However, this removes an interesting measure about the overall activity. The fact that we have 1183 samples within 5 minutes is an indication of the total load. In 5 minutes, we have 300 seconds, which means that each session can have 300 samples, when being 100% active in the database during that time. 1183 samples during 5 minutes mean that we have on average 1183/300 = 4 sessions active. This measure, calculated from the number of samples divided by the number of seconds, and known as Average Active Sessions (AAS) gives two different piece of information:

  • The overall activity in the database, similar to the load average at OS level
  • The relative activity of an aggregate (per session, program, event, time…)

AAS (Average Active Sessions)

In the previous post I counted the number of samples with count(distinct ash_time) because I knew that I had several sessions active during the whole time. But if there are periods of inactivity during those 5 minutes, there are no samples at all. And when drilling down to more detail, there will be some samples with no activity for a specific group. Here I calculate the number of seconds covered by the samples, using a window function:

with ash as (
select *,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples
from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes'
) select round(count(*)::numeric/samples,2) as "AAS",
backend_type
from ash
group by samples,
backend_type
order by 1 desc fetch first 20 rows only
;
AAS | backend_type
-------+-------------------
3.95 | client backend
0.29 | autovacuum worker
(2 rows)

From this output, I know that I have about 4 client sessions running. This is what I want to tune.

Drill down on wait events

Adding the WAIT_EVENT_TYPE to the GROUP BY, I can have more detail about the resources used by those sessions:

with ash as (
select *,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples
from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes'
) select round(count(*)::numeric/samples,2) as "AAS",
backend_type,wait_event_type
from ash
group by samples,
backend_type,wait_event_type
order by 1 desc fetch first 20 rows only
;
AAS | backend_type | wait_event_type
-------+-------------------+-----------------
2.57 | client backend | IO
0.94 | client backend | CPU
0.45 | client backend | LWLock
0.16 | autovacuum worker | CPU
0.12 | autovacuum worker | IO
0.00 | autovacuum worker | LWLock
(6 rows)

This gives a better idea about which system component may be tuned to reduce the response time or the throughput. IO is the major component here with 2.57 AAS being on an I/O call. Let’s get more information about which kind of I/O.

Drilling down to the wait event:

with ash as (
select *,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples
from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes'
) select round(count(*)::numeric/samples,2) as "AAS",
backend_type,wait_event_type,wait_event
from ash
group by samples,
backend_type,wait_event_type,wait_event
order by 1 desc fetch first 20 rows only
;
AAS | backend_type | wait_event_type | wait_event
-------+-------------------+-----------------+------------------
1.52 | client backend | IO | DataFileWrite
0.94 | client backend | CPU | CPU
0.46 | client backend | IO | DataFileExtend
0.41 | client backend | IO | DataFileRead
0.33 | client backend | LWLock | WALWriteLock
0.15 | autovacuum worker | CPU | CPU
0.12 | client backend | LWLock | buffer_mapping
0.10 | autovacuum worker | IO | DataFileRead
0.08 | client backend | IO | WALInitWrite
0.08 | client backend | IO | BufFileWrite
0.02 | client backend | IO | WALWrite
0.01 | autovacuum worker | IO | DataFileWrite
0.01 | client backend | IO | DataFilePrefetch
0.00 | client backend | LWLock | buffer_content
0.00 | autovacuum worker | LWLock | buffer_mapping
(15 rows)

This gives more information. The average 2.57 sessions active on IO are actually writing for 1.52 of them, reading for 0.46 of them, and waiting for the datafile to be extended for 0.46 of them. That helps to focus on the areas where we might improve the performance, without wasting time on the events which are only a small part of the session activity.

Drill-down on queries

This was a drill-down on the system axis (wait events are system call instrumentation). This is useful when we think something is wrong on the system or the storage. But performance tuning must also drive the investigation on the application axis. The higher level is the user call, the TOP_LEVEL_QUERY:

with ash as (
select *,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples
from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes'
) select round(count(*)::numeric/samples,2) as "AAS",
backend_type,top_level_query
from ash
group by samples,
backend_type,top_level_query
order by 1 desc fetch first 20 rows only
;
AAS | backend_type | top_level_query
-------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.95 | client backend | SELECT * FROM mypgio('pgio3', 50, 3000, 131072, 255, 8);
0.95 | client backend | SELECT * FROM mypgio('pgio2', 50, 3000, 131072, 255, 8);
0.95 | client backend | SELECT * FROM mypgio('pgio4', 50, 3000, 131072, 255, 8);
0.95 | client backend | SELECT * FROM mypgio('pgio1', 50, 3000, 131072, 255, 8);
0.25 | autovacuum worker | autovacuum: VACUUM ANALYZE public.pgio2
0.02 | client backend | commit;
0.01 | client backend | select * from pg_active_session_history where pid=21837 order by ash_time desc fetch first 1 rows only;
0.01 | client backend | with ash as ( +
| | select *,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples +
| | from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes' +
...

Here I see 4 user calls responsible for most of the 4 active sessions related to the ‘client backend’, each one with AAS=0.95 and this is actually what is running: the PGIO benchmark (see https://kevinclosson.net/) with 4 sessions calling mypgio function.

The function we see in TOP_LEVEL_QUERY is itself running some queries, and the big advantage of the pgSentinel extension, over pg_stat_activity, is the capture of the actual statement running, with the actual values of the parameters:

with ash as (
select *,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples
from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes'
) select round(count(*)::numeric/samples,2) as "AAS",
backend_type,substr(query,1,100)
from ash
group by samples,
backend_type,substr(query,1,100)
order by 1 desc fetch first 20 rows only
;
AAS | backend_type | substr
-------+-------------------+----------------------------------------------------------------------------------------
0.26 | autovacuum worker |
0.02 | client backend | commit
0.02 | client backend | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN 3567 AND 3822
0.01 | client backend | SELECT sum(scratch) FROM pgio4 WHERE mykey BETWEEN 5729 AND 5984
0.01 | client backend | SELECT sum(scratch) FROM pgio4 WHERE mykey BETWEEN 5245 AND 5500
0.01 | client backend | truncate table l_ash.ps
0.01 | client backend | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN 3249 AND 3504
0.01 | client backend | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN 57 AND 312
0.01 | client backend | UPDATE pgio4 SET scratch = scratch + 1 WHERE mykey BETWEEN 3712 AND 3720
0.01 | client backend | SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 1267 AND 1522
0.01 | client backend | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN 703 AND 958
0.01 | client backend | SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 2025 AND 2280
0.01 | client backend | insert into l_ash.ps_diff +
| | select ps1.pid,ps1.uname,ps1.pr,ps1.ni,ps1.virt,ps1.res,ps1.shr,ps1.s,ps1.
0.01 | client backend | UPDATE pgio4 SET scratch = scratch + 1 WHERE mykey BETWEEN 2690 AND 2698
0.01 | client backend | SELECT sum(scratch) FROM pgio3 WHERE mykey BETWEEN 5463 AND 5718
0.01 | client backend | SELECT sum(scratch) FROM pgio4 WHERE mykey BETWEEN 1467 AND 1722
0.01 | client backend | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN 4653 AND 4908
(20 rows)

Here, no session is at the top. We have a few samples for each execution. This is because each execution is different (different values for the parameters) and they have a balanced execution time. If we had one query being longer with one specific set of parameter values, it would show up at the top here.

Finally, we can also aggregate at a higher level than QUERY with QUERYID which is per prepared statement and do not change when executing with different parameter values. If we want to get the text, then we can join with PG_STAT_STATEMENTS

with ash as (
select *,datid dbid,ceil(extract(epoch from max(ash_time)over()-min(ash_time)over()))::numeric samples
from pg_active_session_history where ash_time>=current_timestamp - interval '5 minutes'
) select round(count(*)::numeric/samples,2) as "AAS",dbid,
backend_type,queryid,pg_stat_statements.query
from ash left outer join pg_stat_statements using(dbid,queryid)
group by samples,dbid,
backend_type,queryid,pg_stat_statements.query
order by 1 desc fetch first 15 rows only
;
AAS | dbid | backend_type | queryid | query
-------+-------+----------------+------------+------------------------------------------------------------------------------------------------------
0.89 | 17487 | client backend | 837728477 | SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 100926 AND 101181
0.70 | 17487 | client backend | 3411884874 | SELECT sum(scratch) FROM pgio4 WHERE mykey BETWEEN $1 AND $2
0.68 | 17487 | client backend | 1046864277 | SELECT sum(scratch) FROM pgio3 WHERE mykey BETWEEN 1591 AND 1846
0.67 | 17487 | client backend | 2994234299 | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN $1 AND $2
0.33 | 17487 | client backend | 1648177216 | UPDATE pgio1 SET scratch = scratch + 1 WHERE mykey BETWEEN 2582 AND 2590
0.32 | 17487 | client backend | 3381000939 | UPDATE pgio3 SET scratch = scratch + $1 WHERE mykey BETWEEN $2 AND $3
0.30 | 17487 | client backend | 1109524376 | UPDATE pgio4 SET scratch = scratch + 1 WHERE mykey BETWEEN 5462 AND 5470
0.11 | 17487 | client backend | 3355133240 | UPDATE pgio2 SET scratch = scratch + $1 WHERE mykey BETWEEN $2 AND $3
0.05 | 17547 | client backend | 2771355107 | update l_ash.parameters set value=now(),timestamp=now() where name=$1
0.05 | 17547 | client backend | 1235869898 | update l_ash.parameters set value=$1,timestamp=now() where name=$2
0.02 | 13806 | client backend | 935474258 | select * from pg_active_session_history where pid=$1 order by ash_time desc fetch first $2 rows only
0.01 | 13806 | client backend | 164740364 | with ash as ( +

This shows the main queries running: SELECT and UPDATE on the PGIO1,PGIO2,PGIO3,PGIO4. They run with different parameter values but have the same QUERYID. It seems that PG_STAT_STATEMENTS is not very consistent when capturing the query text: some show the parameter, some other show the values. But you must know that those are the prepared statements. We do not have 0.89 average sessions running the ‘SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 100926 AND 101181′. This is the ‘SELECT sum(scratch) FROM pgio2′ running with different parameter values and for whatever reasons, the PG_STAT_STATEMENTS extension displays one of the set of values rather than ‘BETWEEN $1 AND $2′.

Time dimension

Of course we can also query all samples and drill-down with a graphical tool. For the time axis, this is a better visualization. Here is a quick Excel PivotChart from those 5 minutes samples:
pg_active_session_history
I always have 4 sessions running, as we have seen in the average, but the wait event detail is not uniform during the timeline. This is where you will drill down on the time axis. This can be helpful to investigate a short duration issue. Or to try to understand non-uniform response time. For example, coming from Oracle, I’m not used to this pattern where, from one second to the other, the wait profile is completely different. Probably because of all the background activity such as Vacuum, WAL, sync buffers to disk, garbage collection,… The workload here, PGIO, the SLOB method for PostgreSQL, is short uniform queries. It would be interesting to have some statistics about the response time variation.

Note that in this database cluster, in addition to the PGIO workload, I have a small application running and committing very small changes occasionally and this why you see the peaks with 1 session on WALWrite and 4 sessions waiting on WALWriteLock. This adds to the chaos of waits.

This extension providing active session sampling is only the first component of pgSentinel so do not spend too much time building queries, reports and graphs on this and let’s see when will come with pgSentinel:

 

Cet article Drilling down the pgSentinel Active Session History est apparu en premier sur Blog dbi services.

Quiz Night

Because it’s been a long time since the last quiz night.  Here’s a question prompted by a recent thread on the ODevCom database forum – how many rows will Oracle sorts (assuming you have enough rows to start with in all_objects) for the final query, and how many sort operations will that take ?


drop table t1 purge;

create table t1 nologging as select * from all_objects where rownum < 50000;

select owner, count(distinct object_type), count(distinct object_name) from t1 group by owner;

Try to resist the temptation of doing a cut-n-paste and running the code until after you’ve thought about the answer.

And the answer is:

It was nice to see a few ideas being volunteered in response to this question; I think that getting a diverse set of comments makes a nice point about how it’s always worth spending a little time to think along the lines of: “If I do X how might Oracle handle it”. Having the ideas before trying to check the effects can make it a lot easier to understand what’s happening, and sometimes how to take advantage of what Oracle does do to improve the way you design a query.

The first point to make, as Michael D O’Shea  pointed out in comment #2, is that computer systems don’t usually “sort data” – they tend to create pointers to data and shuffle the pointers in some way. In Oracle’s case “sorting” used to mean inserting pointers into a balanced binary tree, and aggregating used to be a case of accumulating values at the leaf nodes of the insertion tree. In 10g Oracle then introduced a new sorting algorithm that often works more efficiently than the binary insertion tree. I’m still going to refer to the binary tree method as “sorting”, though.

Looking at the query, we can see that there is no “order by” clause so it’s possible that Oracle will do whatever it does using hash aggregation throughout and no sorting, but that leaves open the question of how a hash table on owner can also record a distinct count of both object_type and object_name because every single owner hash bucket would have to link to its own hash tables for object_type and object_name and do a sort of “recursive hash aggregation” which starts to sound a little complicated. Maybe the alternative suggested by Kaley in comment #1 is closer to the truth – maybe Oracle just “buckets” all the data by owner and then sorts within each owner twice to do the count distincts, but then we’re still going to be hanging on to a lot of data, doing a two-level open-ended process.

Having waved hands for a little bit to try and head in the direction of possible solutions we need to look for clues that tell us whether we ought to eliminate or refine some of our guesses. There are several bits of information we could look at and running the query (although I asked you not to) is the next step we have to take. But when we run the query we want to see the session statistics, pick up the actual execution plan with rowsource execution statistics, and enable the 10032 and 10033 (sort) traces. So let’s fold the query into a longer script, something like:


set linesize 255
set trimspool on
set pagesize 60

set serveroutput off
alter session set statistics_level = all;

execute snap_my_stats.start_snap

alter session set events '10032 trace name context forever';
alter session set events '10033 trace name context forever';

select owner ... etc.

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

alter session set events '10033 trace name context off';
alter session set events '10032 trace name context off';

execute snap_my_stats.end_snap

Of course, to avoid blurring around the edges, we may have to isolate the three different tests – the query against dbms_xplan.display_cursor(), for example, is obviously going to have some impact on the session stats – and we may then want to run each test twice in succession so that any warm-up or parsing activities don’t confuse the issue. It would also be a good idea to run the tests after creating a new session in case there are some distracting side effects from creating the data set. But with these details addressed, here are a few results:

First the execution plan (I got these results from 12.2.0.1, all recent versions of Oracle behave similarly):


----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     16 |00:00:00.34 |    1018 |       |       |          |
|   1 |  SORT GROUP BY     |      |      1 |     16 |     16 |00:00:00.34 |    1018 |  5014K|  1445K| 4456K (0)|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  50000 |  50000 |00:00:00.07 |    1018 |       |       |          |
----------------------------------------------------------------------------------------------------------------

Oracle uses a SORT GROUP BY, not a HASH GROUP BY, and the indications are that we used about 4.4MB of memory to sort 50,000 rows. Then there’s the session statistics:


Name                                                                         Value
---- 

session uga memory max                                                   3,255,648
session pga memory max                                                   3,211,264

table scan rows gotten                                                      50,000

sorts (memory)                                                                   1
sorts (rows)                                                               150,000

This says we did only one sort operation, and sorted 150,000 rows using an excess over starting pga/uga of 3.2MB (rather than the 4.4MB suggested by the plan); possibly the variation in apparent memory usage could be explained by the way that Oracle allocates reasonably large chunks to grow the PGA when you grow a workarea, but since I’m taking deltas there’s some scope for being misled by the change in maximum pga/uga memory.

Finally the 10032 trace file shows the following (we didn’t spill to disc, so the 10033 trace wasn’t triggered):


---- Sort Parameters ------------------------------
sort_area_size                    4562944
sort_area_retained_size           4562944
sort_multiblock_read_count        7
max intermediate merge width      38

---- Sort Statistics ------------------------------
Input records                             150000
Output records                            49907
Total number of comparisons performed     1945863
  Comparisons performed by in-memory sort 1945863
Total amount of memory used               4562944
Uses version 1 sort
---- End of Sort Statistics -----------------------

These figures are ones we have to trust – it seems we really did d0 one sort operation of 150,000 rows, and we did allocate 4.5MB of memory. There’s an obvious guess for the 150,000 input rows – Oracle has turned every row into three rows, the original row, one to count the object_type, and one to count the object_name, and with that in mind maybe we will be able to make sense of getting an output of 49,907 rows using 4.5M of memory. Let’s create a query that could produce the right numbers, though with a differently arranged output:


select distinct owner, 0, null        from t1
union all
select distinct owner, 1, object_type from t1
union all
select distinct owner, 2, object_name from t1
order by 1, 2, 3
;

For my data set this query produced 49,907 rows of output (which is a number we wanted to see) and here are the first 9 rows of output – followed by the first row of output from the original query:


OWNER                    0 NULL
--------------- ---------- ---------------------
APPQOSSYS                0

APPQOSSYS                1 SYNONYM
APPQOSSYS                1 TABLE

APPQOSSYS                2 DBMS_WLM
APPQOSSYS                2 WLM_CLASSIFIER_PLAN
APPQOSSYS                2 WLM_FEATURE_USAGE
APPQOSSYS                2 WLM_METRICS_STREAM
APPQOSSYS                2 WLM_MPA_STREAM
APPQOSSYS                2 WLM_VIOLATION_STREAM


OWNER           COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOBJECT_NAME)
--------------- -------------------------- --------------------------
APPQOSSYS                                2                          6

Spot the pattern ? All I have to do after this “union all with simple sort” is walk the result set in order accumulating distinct counts as I do so.

But what about the memory requirements ? Check back to the original 10032 trace file, it reported 4562944 bytes as the total memory needed, but it also reported using a “Version 1” sort – so before I ran my query I set “_newsort_enabled”=false, to get the following 10032 trace details:


---- Sort Statistics ------------------------------
Input records                             49907
Output records                            49907
Total number of comparisons performed     730667
  Comparisons performed by in-memory sort 730667
Total amount of memory used               4562944
Uses version 1 sort
---- End of Sort Statistics -----------------------

The memory used is exactly the number I wanted to see. (The new version 2 sort got exactly the same result using 3.5MB of memory, so I don’t know why it’s not used at this point – but maybe that’s because the implementation isn’t quite what I think.)

So, conclusion (to date, until a reader shows me that my answer is incomplete – or wrong): As the “SORT GROUP BY” operation accepts each row from the “TABLE ACCESS FULL” operation it converts each row into N rows (one base row and one for each column for which there is a count(distinct)). The base row holds just the set of “group by” columns and is tagged with a zero, each subsequent row holds the “group by” columns, an tag value to identify which column it carries, and one of the “count(distinct)” columns. Oracle then operates the normal “group by” aggregation mechanism but is actually aggregating on the “group by” columns plus the “tag” column. So each leaf node in the binary tree ends up holding {owner_value, tag_value, count}, and once all the data has been aggregated into the binary tree Oracle can walk the tree and perform a pivot to turn three rows for each owner value into a single row.

If you start thinking about nasty scenarios you will realise that the upshot of this implementation (if my hypothesis is correct) is that if you have a query with a long “group by” list, and several columns where there are lots of distinct values for each combination of the “group by” list then the volume of the B-tree could actually be much larger than the volume of the original table, and the amount of memory and CPU needed to build that tree (before collapsing it) could be huge.

Footnote:

There is one special case with this count(distinct …) query. If you have only ONE distinct operation in the query Oracle can use the “distinct aggregation” transformation with “view merging” to produce a completely different plan.

PostgreSQL Active Session History extension is now publicly available

Publicly available

A quick one to let you know that the pgsentinel extension providing active session history sampling is now publicly available.

You can find more details on it into this previous blog post and also from some early beta testers:

Thank you Franck Pachot and Daniel Westermann for beta testing and sharing.

Where to find it?

The extension is available from the pgsentinel github repository. Please keep in mind that at the time of this writing the extension is still in beta so may contain some bugs: don’t hesitate to raise issues at github with your bug report.

Where to follow pgsentinel stuff?

On the website, twitter or github. More stuff to come, stay tuned.

Please do contribute

If you’re lacking of some functionality, then you’re welcome to make pull requests.

pushing predicates

I came across this odd limitation (maybe defect) with pushing predicates (join predicate push down) a few years ago that made a dramatic difference to a client query when fixed but managed to hide itself rather cunningly until you looked closely at what was going on. Searching my library for something completely different I’ve just rediscovered the model I built to demonstrate the issue so I’ve tested it against a couple of newer versions  of Oracle (including 18.1) and found that the anomaly still exists. It’s an interesting little detail about checking execution plans properly so I’ve written up the details. The critical feature of the problem is a union all view:


rem
rem	Script:		push_pred_limitation.sql
rem	Author:		Jonathan Lewis
rem	Dated:		Jan 2015
rem
rem	Last tested 
rem		18.1.0.0	via LiveSQL
rem		12.2.0.1
rem		12.1.0.2
rem		11.2.0.4
rem

create table t1
as
select	* 
from	all_objects
where	rownum <= 10000 -- > comment to avoid WordPress format issue
;

create table t2
as
select	* 
from	all_objects
where	rownum <= 10000 -- > comment to avoid WordPress format issue
;

create table t3
as
select	* 
from	all_objects
where	rownum <= 10000 -- > comment to avoid WordPress format issue
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1 for columns owner size 254'
	);
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T2',
		method_opt	 => 'for all columns size 1'
	);
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T3',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create index t2_id on t2(object_id);
-- create index t2_id_ot on t2(object_id, object_type);

create index t3_name_type on t3(object_name, object_type);

create or replace view v1
as
select 
	/*+ qb_name(part1) */
	t2.object_id,
	t2.object_type	object_type_2,
	t3.object_type	object_type_3,
	t2.created	date_2,
	t3.created	date_3
from
	t2, t3
where
	t3.object_name = t2.object_name
union all
select
	/*+ qb_name(part2) */
	t2.object_id,
	t2.object_type	object_type_2,
	t3.object_type	object_type_3,
	t2.last_ddl_time	date_2,
	t3.last_ddl_time	date_3
from
	t2, t3
where
	t3.object_name = t2.object_name
;

Two points to note so far: first, the view is basically joining the same two tables in the same way twice but selecting different columns. It’s a close model of what the client was doing but so much simpler that it wouldn’t be hard to find a different way of getting the same result: the client’s version would have been much far harder to rewrite. Secondly, I’ve listed two possible indexes for table t2 but commented one of them out. The indexing will make a difference that I’ll describe later.

So here’s the query with execution plan (from explain plan – but pulling the plan from memory gives the same result):


select
	/*+ qb_name(main) */
	t1.object_name, t1.object_type,
	v1.object_id, v1.date_2, v1.date_3
from
	t1,
	v1
where
	v1.object_id = t1.object_id
and	v1.object_type_2 = t1.object_type
and	v1.object_type_3 = t1.object_type
and	t1.owner = 'OUTLN'
;

Plan hash value: 4123301926

---------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |              |     7 |   588 |    82   (2)| 00:00:01 |
|   1 |  NESTED LOOPS                            |              |     7 |   588 |    82   (2)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL                      | T1           |     7 |   280 |    26   (4)| 00:00:01 |
|*  3 |   VIEW                                   | V1           |     1 |    44 |     8   (0)| 00:00:01 |
|   4 |    UNION ALL PUSHED PREDICATE            |              |       |       |            |          |
|   5 |     NESTED LOOPS                         |              |     1 |    77 |     4   (0)| 00:00:01 |
|   6 |      NESTED LOOPS                        |              |     1 |    77 |     4   (0)| 00:00:01 |
|   7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T2           |     1 |    41 |     2   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN                  | T2_ID        |     1 |       |     1   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN                   | T3_NAME_TYPE |     1 |       |     1   (0)| 00:00:01 |
|  10 |      TABLE ACCESS BY INDEX ROWID         | T3           |     1 |    36 |     2   (0)| 00:00:01 |
|  11 |     NESTED LOOPS                         |              |     1 |    77 |     4   (0)| 00:00:01 |
|  12 |      NESTED LOOPS                        |              |     1 |    77 |     4   (0)| 00:00:01 |
|  13 |       TABLE ACCESS BY INDEX ROWID BATCHED| T2           |     1 |    41 |     2   (0)| 00:00:01 |
|* 14 |        INDEX RANGE SCAN                  | T2_ID        |     1 |       |     1   (0)| 00:00:01 |
|* 15 |       INDEX RANGE SCAN                   | T3_NAME_TYPE |     1 |       |     1   (0)| 00:00:01 |
|  16 |      TABLE ACCESS BY INDEX ROWID         | T3           |     1 |    36 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."OWNER"='OUTLN')
   3 - filter("V1"."OBJECT_TYPE_2"="T1"."OBJECT_TYPE")
   8 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   9 - access("T3"."OBJECT_NAME"="T2"."OBJECT_NAME" AND "T3"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
  14 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
  15 - access("T3"."OBJECT_NAME"="T2"."OBJECT_NAME" AND "T3"."OBJECT_TYPE"="T1"."OBJECT_TYPE")

The execution plan appears to be fine – we can see at operation 4 that the union all view has been access with the pushed predicate option and that the subsequent sub-plan has
used index driven nested loop joins in both branches – until we look a little more closely and examine the Predicate section of the plan. What, exactly, has been pushed ?

Look at the predicate for operation 3: “V1″.”OBJECT_TYPE_2″=”T1″.”OBJECT_TYPE”. It’s a join predicate that hasn’t been pushed into the view. On the other hand the original, and similar, join predicate v1.object_type_3 = t1.object_type has been pushed into the view, appearing at operations 9 and 15. There is a difference, of course, the object_type_3 column appears as the second column of the index on table t3.

Two questions then: (a) will the object_type_2 predicate be pushed if we add it to the relevant index on table t2, (b) is there a way to get the predicate pushed without adding it to the index. The answer to both questions is yes. First the index – re-run the test but create the alternative index on t2 and the plan changes to:

Plan hash value: 497545587

---------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |              |     7 |   553 |    82   (2)| 00:00:01 |
|   1 |  NESTED LOOPS                            |              |     7 |   553 |    82   (2)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL                      | T1           |     7 |   280 |    26   (4)| 00:00:01 |
|   3 |   VIEW                                   | V1           |     1 |    39 |     8   (0)| 00:00:01 |
|   4 |    UNION ALL PUSHED PREDICATE            |              |       |       |            |          |
|   5 |     NESTED LOOPS                         |              |     1 |    77 |     4   (0)| 00:00:01 |
|   6 |      NESTED LOOPS                        |              |     1 |    77 |     4   (0)| 00:00:01 |
|   7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T2           |     1 |    41 |     2   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN                  | T2_ID_OT     |     1 |       |     1   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN                   | T3_NAME_TYPE |     1 |       |     1   (0)| 00:00:01 |
|  10 |      TABLE ACCESS BY INDEX ROWID         | T3           |     1 |    36 |     2   (0)| 00:00:01 |
|  11 |     NESTED LOOPS                         |              |     1 |    77 |     4   (0)| 00:00:01 |
|  12 |      NESTED LOOPS                        |              |     1 |    77 |     4   (0)| 00:00:01 |
|  13 |       TABLE ACCESS BY INDEX ROWID BATCHED| T2           |     1 |    41 |     2   (0)| 00:00:01 |
|* 14 |        INDEX RANGE SCAN                  | T2_ID_OT     |     1 |       |     1   (0)| 00:00:01 |
|* 15 |       INDEX RANGE SCAN                   | T3_NAME_TYPE |     1 |       |     1   (0)| 00:00:01 |
|  16 |      TABLE ACCESS BY INDEX ROWID         | T3           |     1 |    36 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."OWNER"='OUTLN')
   8 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID" AND "T2"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
   9 - access("T3"."OBJECT_NAME"="T2"."OBJECT_NAME" AND "T3"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
  14 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID" AND "T2"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
  15 - access("T3"."OBJECT_NAME"="T2"."OBJECT_NAME" AND "T3"."OBJECT_TYPE"="T1"."OBJECT_TYPE")

Notice how the predicate at operation 3 has disappeared, and the access predicate at operation 8 now includes the predicate “T2″.”OBJECT_TYPE”=”T1″.”OBJECT_TYPE”.

Alternatively, don’t mess about with the indexes – just tell Oracle to push the predicate. Normally I would just try /*+ push_pred(v1) */ as the hint to do this, but the Outline section of the original execution plan already included a push_pred() hint that looked like this: PUSH_PRED(@”MAIN” “V1″@”MAIN” 3 1), so I first copied exactly that into the SQL to see if it would make any difference. It did – I got the following plan (and the hint in the outline changed to PUSH_PRED(@”MAIN” “V1″@”MAIN” 3 2 1) so this may be a case where the plan produced by a baseline will perform better than the plan that the produced the baseline!):

Plan hash value: 4123301926

---------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |              |     7 |   553 |    82   (2)| 00:00:01 |
|   1 |  NESTED LOOPS                            |              |     7 |   553 |    82   (2)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL                      | T1           |     7 |   280 |    26   (4)| 00:00:01 |
|   3 |   VIEW                                   | V1           |     1 |    39 |     8   (0)| 00:00:01 |
|   4 |    UNION ALL PUSHED PREDICATE            |              |       |       |            |          |
|   5 |     NESTED LOOPS                         |              |     1 |    77 |     4   (0)| 00:00:01 |
|   6 |      NESTED LOOPS                        |              |     1 |    77 |     4   (0)| 00:00:01 |
|*  7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T2           |     1 |    41 |     2   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN                  | T2_ID        |     1 |       |     1   (0)| 00:00:01 |
|*  9 |       INDEX RANGE SCAN                   | T3_NAME_TYPE |     1 |       |     1   (0)| 00:00:01 |
|  10 |      TABLE ACCESS BY INDEX ROWID         | T3           |     1 |    36 |     2   (0)| 00:00:01 |
|  11 |     NESTED LOOPS                         |              |     1 |    77 |     4   (0)| 00:00:01 |
|  12 |      NESTED LOOPS                        |              |     1 |    77 |     4   (0)| 00:00:01 |
|* 13 |       TABLE ACCESS BY INDEX ROWID BATCHED| T2           |     1 |    41 |     2   (0)| 00:00:01 |
|* 14 |        INDEX RANGE SCAN                  | T2_ID        |     1 |       |     1   (0)| 00:00:01 |
|* 15 |       INDEX RANGE SCAN                   | T3_NAME_TYPE |     1 |       |     1   (0)| 00:00:01 |
|  16 |      TABLE ACCESS BY INDEX ROWID         | T3           |     1 |    36 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."OWNER"='TEST_USER')
   7 - filter("T2"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
   8 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   9 - access("T3"."OBJECT_NAME"="T2"."OBJECT_NAME" AND "T3"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
  13 - filter("T2"."OBJECT_TYPE"="T1"."OBJECT_TYPE")
  14 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
  15 - access("T3"."OBJECT_NAME"="T2"."OBJECT_NAME" AND "T3"."OBJECT_TYPE"="T1"."OBJECT_TYPE")

In this case we see that the critical late-joining predicate has disappeared from operation 3 and re-appeared as a filter predicate at operation 7 In many cases you may find that the change in predicate use makes little difference to the performance – in my example the variation in run time over several executions of each query was larger than the average run time of the query; nevertheless it’s worth noting that the delayed use of the predicate could have increased the number of probes into table t3 for both branches of the union all and resulted in redundant data passing up through several layers of the call stack before being eliminated … and “eliminate early” is one of the major commandments of optimisation.

You might notice that the Plan Hash Value for the hinted execution plan is the same as for the original execution plan: the hashing algorithm doesn’t take the predicates into account (just one of many points that Randolf Geist raised in a blog post several years ago). This is one of the little details that makes it easy to miss the little changes in a plan that can make a big difference in performance.

Summary

If you have SQL that joins simple tables to set based (union all, etc.) views and you see the pushed predicate option appearing take a little time to examine the predicate section of the execution plan to see if the optimizer is pushing all the join predicates that it should and, if it isn’t, test the effects of pushing more predicates.

In many cases adding the hint /*+ push_pred(your_view_name) */ at the top of the query may be sufficient to get the predicate pushing you need, but you may need to look at the outline section of the execution plan and add a series of more complicated push_pred() and no_push_pred() hints because the push_pred hint has evolved over time to deal with increasingly complicated transformations.

 

Month Over Month and Newer Schtuff- Power BI

Today’s Post is brought to you by Patrick LeBlanc of Guy in a Cube.  I learn best by doing, so I was working with different features while watching along on Quick Measures:

As a newbie, yes, I had problems with my quick measures just as Patrick said I would, but with a twist-  It wasn’t that I didn’t want to learn DAX, quite the opposite, I could get the expression to work just fine  with DAX, but couldn’t seem to get the hang of the quick measure.  Leave it to me to have challenges with the *simpler* method… </p />
</p></div>

    	  	<div class=

pgSentinel: the sampling approach for PostgreSQL

Here is the first test I did with the beta of pgSentinel. This Active Session History sampling is a new approach to Postgres tuning. For people coming from Oracle, this is something that has made our life a lot easier to optimize database applications. Here is a quick example showing how it links together some information that are missing without this extension.

The installation of the extension is really easy (nore details on Daniel’s post):

cp pgsentinel.control /usr/pgsql-10/share/extension
cp pgsentinel--1.0.sql /usr/pgsql-10/share/extension
cp pgsentinel.so /usr/pgsql-10/lib

and declare it in postgresql.conf

grep -i pgSentinel $PGDATA/postgresql.conf
 
shared_preload_libraries = 'pg_stat_statements,pgsentinel'
#pgsentinel_ash.pull_frequency = 1
#pgsentinel_ash.max_entries = 1000000

and restart:

/usr/pgsql-10/bin/pg_ctl restart

Then create the views in psql:

CREATE EXTENSION pgsentinel;

I was running PGIO (the SLOB method for PostgreSQL from Kevin Closson https://kevinclosson.net/)

Without the extension, here is what I can see about the current activity from the OS point of view, with ‘top -c':

top - 21:57:23 up 1 day, 11:22, 4 users, load average: 4.35, 4.24, 4.16
Tasks: 201 total, 2 running, 199 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.6 us, 19.0 sy, 0.0 ni, 31.0 id, 19.0 wa, 0.0 hi, 3.4 si, 0.0 st
KiB Mem : 4044424 total, 54240 free, 282220 used, 3707964 buff/cache
KiB Swap: 421884 total, 386844 free, 35040 used. 3625000 avail Mem
 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9766 postgres 20 0 440280 160036 150328 D 50.0 4.0 10:56.63 postgres: postgres pgio [local] SELECT
9762 postgres 20 0 439940 160140 150412 D 43.8 4.0 10:55.95 postgres: postgres pgio [local] SELECT
9761 postgres 20 0 440392 160088 150312 D 37.5 4.0 10:52.29 postgres: postgres pgio [local] SELECT
9763 postgres 20 0 440280 160080 150432 R 37.5 4.0 10:41.94 postgres: postgres pgio [local] SELECT
9538 postgres 20 0 424860 144464 142956 D 6.2 3.6 0:30.79 postgres: writer process

As I described in a previous post, PostgreSQL changes the title of the process to display the current operation. This looks interesting, but not very detailed (only ‘SELECT’ here) and very misleading because here I’m running PGIO with 50% updates. The ‘SELECT’ here is the user call. Not the actual SQL statement running.

We have more information from PG_STAT_ACTIVITY, but again only the top-level call is displayed, as I mentioned in a previous post:

select * from pg_stat_activity where pid=9766;
-[ RECORD 1 ]----+---------------------------------------------------------
datid | 17487
datname | pgio
pid | 9766
usesysid | 10
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2018-07-12 21:28:46.539052+02
xact_start | 2018-07-12 21:28:46.542203+02
query_start | 2018-07-12 21:28:46.542203+02
state_change | 2018-07-12 21:28:46.542209+02
wait_event_type | IO
wait_event | DataFileWrite
state | active
backend_xid | 37554
backend_xmin | 37553
query | SELECT * FROM mypgio('pgio4', 50, 3000, 131072, 255, 8);
backend_type | client backend

Here, I know what the user is doing: a call to mypgio() started at 21:28:46. And I know which resources are involved on the system: DataFileWrite. But again the most important is missing, the link between the user call and the system resources. And you can only guess it here because you know that a SELECT do not write to datafiles. There’s something hidden in the middle, which is actually an UPDATE. Of course, we can see this UPDATE in PG_STAT_STATEMENTS. But there, it will not be linked with the current activity, the mypgio() call, nor the DataFileWrite wait event. And we also need some timing information to be able to see the database load over the time.

Here is where the pgSentinel extension fills the gap, providing:

  • The actual query running, with the queryid which links to PG_STAT_STATEMENTS, but also the full text with all parameter values
  • multiple samples of the activity, with their timestamp information


select ash_time,pid,wait_event_type,wait_event,state,queryid,backend_type,top_level_query,query from pg_active_session_history order by ash_time desc,pid fetch first 10 rows only;
 
ash_time | pid | wait_event_type | wait_event | state | queryid | backend_type | top_level_query | query
-------------------------------+------+-----------------+---------------+--------+------------+----------------+----------------------------------------------------------+--------------------------------------------------------------------------
2018-07-12 21:57:22.991558+02 | 9761 | IO | DataFileWrite | active | 837728477 | client backend | SELECT * FROM mypgio('pgio2', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 1065 AND 1320
2018-07-12 21:57:22.991558+02 | 9762 | IO | DataFileWrite | active | 1046864277 | client backend | SELECT * FROM mypgio('pgio3', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio3 WHERE mykey BETWEEN 267 AND 522
2018-07-12 21:57:22.991558+02 | 9763 | IO | DataFileRead | active | 1648177216 | client backend | SELECT * FROM mypgio('pgio1', 50, 3000, 131072, 255, 8); | UPDATE pgio1 SET scratch = scratch + 1 WHERE mykey BETWEEN 1586 AND 1594
2018-07-12 21:57:22.991558+02 | 9766 | IO | DataFileWrite | active | 3411884874 | client backend | SELECT * FROM mypgio('pgio4', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio4 WHERE mykey BETWEEN 3870 AND 4125
2018-07-12 21:57:21.990178+02 | 9761 | CPU | CPU | active | 837728477 | client backend | SELECT * FROM mypgio('pgio2', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 13733 AND 13988
2018-07-12 21:57:21.990178+02 | 9762 | IO | DataFileRead | active | 1046864277 | client backend | SELECT * FROM mypgio('pgio3', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio3 WHERE mykey BETWEEN 4135 AND 4390
2018-07-12 21:57:21.990178+02 | 9763 | IO | DataFileWrite | active | 2994234299 | client backend | SELECT * FROM mypgio('pgio1', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio1 WHERE mykey BETWEEN 4347 AND 4602
2018-07-12 21:57:21.990178+02 | 9766 | CPU | CPU | active | 3411884874 | client backend | SELECT * FROM mypgio('pgio4', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio4 WHERE mykey BETWEEN 14423 AND 14678
2018-07-12 21:57:20.985253+02 | 9761 | IO | DataFileWrite | active | 837728477 | client backend | SELECT * FROM mypgio('pgio2', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio2 WHERE mykey BETWEEN 129 AND 384
2018-07-12 21:57:20.985253+02 | 9762 | IO | DataFileWrite | active | 1046864277 | client backend | SELECT * FROM mypgio('pgio3', 50, 3000, 131072, 255, 8); | SELECT sum(scratch) FROM pgio3 WHERE mykey BETWEEN 3313 AND 3568
(10 rows)

Everything is there. The timeline where each sample links together the user call (top_level_query), the running query (queryid and query – which is the text with parameter values), and the wait event (wait_event_type and wait_event).

Here is, on one sample, what is currently available in the beta version:

select * from pg_active_session_history where pid=9766 order by ash_time desc fetch first 1 rows only;
-[ RECORD 1 ]----+-----------------------------------------------------------------------
ash_time | 2018-07-12 21:57:23.992798+02
datid | 17487
datname | pgio
pid | 9766
usesysid | 10
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2018-07-12 21:28:46.539052+02
xact_start | 2018-07-12 21:28:46.542203+02
query_start | 2018-07-12 21:28:46.542203+02
state_change | 2018-07-12 21:28:46.542209+02
wait_event_type | IO
wait_event | DataFileExtend
state | active
backend_xid | 37554
backend_xmin | 37553
top_level_query | SELECT * FROM mypgio('pgio4', 50, 3000, 131072, 255, 8);
query | UPDATE pgio4 SET scratch = scratch + 1 WHERE mykey BETWEEN 700 AND 708
queryid | 1109524376
backend_type | client backend

Then, what do we do with this? This is a fact table with many dimensions. And we can drill down on the database activity.

A quick overview of the load shows that I have, on average, 4 foreground sessions running for my user calls, and very low vacuuming activity:

postgres=# select backend_type
postgres-# ,count(*)/(select count(distinct ash_time)::float from pg_active_session_history) as load
postgres-# from pg_active_session_history
postgres-# group by backend_type
postgres-# ;
backend_type | load
-------------------+--------------------
client backend | 4.09720483938256
autovacuum worker | 0.07467667918231
(2 rows)

I’ll show in a future post how to query this view to drill down into the details. For the moment, here is a short explanation about the reason to go to a sampling approach.

Here is an abstract sequence diagram showing some typical user calls to the database. Several components are involved: CPU for the backed process, or for background processes, the OS, the storage… Our tuning goal is to reduce the user call duration. And then to reduce or optimize the work done in the different layers. With the current statistics available on PostgreSQL, like PG_STAT_ACTIVITY or PG_STAT_STATEMENTS, or available from the OS (strace to measure system call duration) we have a vertical approach on the load. We can look at each component individually:
PGASHWHY001

This is basically what we did on Oracle before ASH (Active Session History) was introduced in 10g, 12 years ago. The activity sampling approach takes an orthogonal point of view. Rather than cumulating statistics for each components, it looks at what happens on the system at specific point in times, across all components. We don’t have all measures (such as how many execution of a query) but only samples. However, each sample gives a complete view from the user call down to the system calls. And 1 second samples are sufficient to address any relevant activity, without taking too much space for short retention. For each sample, we cover all layers end-to-end:
PGASHWHY002
This horizontal approach makes the link between the user calls (the user perception of the database performance) and the system resources where we can analyze and optimize. With this, we can ensure that our tuning activity always focuses on the problem (the user response time) by addressing the root cause on the right component.

 

Cet article pgSentinel: the sampling approach for PostgreSQL est apparu en premier sur Blog dbi services.

Cardinality Puzzle

One of the difficulties of being a DBA and being required to solve performance problems is that you probably never have enough time to think about how you got to a solution and why the solution works; and if you don’t learn about the process itself , you just don’t get better at it. That’s why I try (at least some of the time) to write articles and books (as I did with CBO Fundamentals) that

  1. explain simple details that can be used as background facts
  2. walk through the steps of solving a problem

So here’s an example from a question on the ODC database forum asking about the cause and workaround for a bad cardinality estimate that is producing a poorly performing execution plan. It’s actually a type of problem that comes up quite frequently on large data sets and explains why a simple “gather stats” is almost guaranteed to leave you with a few headaches (regardless of whether or not you choose to include histograms as part of the process). I’m not going to offer “the answer” – I’m just going to talk about the inferences we can make from the facts supplied and where we have to go from there.

The DBA has a table holding 80,000,000,000 rows. It is list/hash partitioned with 2 partitions and 1,024 sub-partitions (per partition) but neither of the partitioning key columns appears in the query. The query runs parallel and the optimizer (presumably thanks to the specific settings of various parameters related to parallel execution uses dynamic sampling at level 3).

There is an inline view defined in the query and the DBA has isolated this as a key component of the problem and supplied a query and plan (from “explain plan”) against that view.


select * from TAB2 T
WHERE T.DT = to_date(:b1,'MM/DD/YYYY HH24:MI:SS');
 
------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                          |   479M|    76G|  1756K (14)| 05:51:14 |       |       |        |      |            |
|   1 |  PX COORDINATOR              |                          |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10000                 |   479M|    76G|  1756K (14)| 05:51:14 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION HASH ALL     |                          |   479M|    76G|  1756K (14)| 05:51:14 |     1 |  1024 |  Q1,00 | PCWC |            |
|*  4 |     TABLE ACCESS STORAGE FULL| TAB1                     |   479M|    76G|  1756K (14)| 05:51:14 |     1 |  2048 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - storage(COALESCE("TB1"."DFG",'N')='N' AND TRUNC(INTERNAL_FUNCTION("TB1"."DT_TM"))=TO_DATE(:B1,'MM/DD/YYYY
              HH24:MI:SS'))
       filter(COALESCE("TB1"."DFG",'N')='N' AND TRUNC(INTERNAL_FUNCTION("TB1"."DT_TM"))=TO_DATE(:B1,'MM/DD/YYYY
              HH24:MI:SS'))

Note
-----
   - dynamic sampling used for this statement (level=3)

The DBA’s problem is that if the estimated cardinality of this extract goes over roughly 500M the optimizer chooses a bad plan for the overall query – and on occasion this extract has given an estimate of 5 billion rows. Moreover, the actual number of rows returned by this extract is typically in the order of 40M, so the estimate is a long way off even when it’s “good enough”.

So where do we start looking to work out what’s going wrong? You’ll note, of course, that after text expansion the user’s single predicate has changed, and an extra predicate (previously hidden inside the view) has appeared; instead of just T.DT = to_date(:b1,’MM/DD/YYYY HH24:MI:SS’) we now have (cosmetically adjusted):

        COALESCE(DFG,'N')='N' 
AND     TRUNC(DT_TM)=TO_DATE(:B1,'MM/DD/YYYY HH24:MI:SS')

There are two immediately obvious threats here – first that the combination of predicates means Oracle is likely to make a mistake because it will check the individual selectivities and multiply them together to get the combined selectivity, second that the appearance of predicates of the form “function(column) = constant” means that Oracle will guess 1% as the individual selectivities.

Without checking more details we might assume that a possible quick fix (that would require no changes to existing code) would be to create a couple of virtual columns (or extended stats) to represent the two expressions and gather stats on the resulting columns – though it is a restriction of extended stats that you can’t “double up” and create a column group on the two column expressions, so there’s still some scope for a cardinality estimate that is still sufficiently bad even with this approach. We also note that if we can change the coalesce(DFG,’N’) that must have been hidden in the view to nvl(DFG,’N’) then Oracle would be able to “or expand” the nvl() and use a more appropriate selectivity for that part of the predicate.

However, the points I’ve covered so far tend to produce estimates that are too small and often much too small. So maybe the key to the problem is in the Note section that tells us that Oracle has (successfully) used dynamic sampling for this statement. In other words, all the theory of how the optimizer calculates selectivity may be irrelevant – the estimate will be based on the luck of the sample.

So let’s take a look at the (slightly edited) table stats we’ve been given:

column_name data_type num_distinct low_value      high_value     density   num_null  histogram
DT_TM       DATE           6179571 78740B1E0A383C 7876020B01262B 1.6182E-7 0         NONE
DFG         VARCHAR2             1 4E             4E             1         0         NONE

Notice that the DFG (apparently) has the value ‘N’ for every row in the table (low_value = high_value = 0x4E, num_nulls = 0). The date range is 30-Nov-2016 to 11-Feb-2018, with no histogram but 6.18M distinct values for 80 Billion rows. Neither column has a histogram.

A little arithmetic tells us that (on average) there ought to be about 182M (= 80B / 438 days) rows for any one day – and that’s worth thinking about for three separate reasons.

First, an estimate of 479M against an average of 182M isn’t too surprising if it’s based on a fairly small sample, it’s only out by a factor of 2.6. On the other hand, getting an an estimate of 5 billion – which can happen on bad days – is extremely unlikely if the data is uniformly distributed across dates.

Secondly, the DBA supplied us with some data from the recent past with an aggregate query for “trunc(dt_tm)”, with the following results:

TRUNC(DT_TM)   COUNT(*)
------------ ----------
01-FEB-18    44,254,425
02-FEB-18    46,585,349
03-FEB-18    43,383,099
04-FEB-18    32,748,364
05-FEB-18    37,993,126
06-FEB-18    39,708,994
07-FEB-18    38,696,777
08-FEB-18    41,871,780
09-FEB-18    46,702,852
10-FEB-18    42,744,870
11-FEB-18    34,971,845
12-FEB-18    37,165,983

Recent data seems to follow an average of around 40M rows per day, so the estimate of 182M that we can derive from the stored statistics is a long way off: the present is behaving very differently from the past and that’s a relatively common problem with very large data sets – though it’s more usual for rolling averages to increase from the past to the present because the data is often representing the growth of a business over time. Can we create a hypothesis to explain the discrepancy, and could that hypothesis also account for the sample producing some very strange estimates ?

Finally, slightly more subtle and only included for completeness, if this column is supposed to hold date and time to the nearest second – which is what you might expect from an Oracle date type – there are 38 million possible values (438 x 86,400) it could be holding, and that’s more than the actual number of distinct values by a factor of 6. We can also work out that 80 billion rows over 438 days is 2,000 rows per second (on average). Averages are often misleading, of course, many systems have a pattern where a working day shows most of the data created in a 12 – 16 hour window with a couple of hours of more intense activity. For reference, though: average rows per second for the recent data is roughly 40M/86400 = 460; while the average we derive from the stored statistics is 80B / 6M = 13000 rows per second; this unlikely pattern needs a “non-uniform” explanation.

How do these three thoughts help us to understand or, to be more accurate, to make a sensible guess about why the optimizer can use dynamic sampling and get a wildly variable estimate which can be 1 or 2 orders of magnitude wrong. (varying between 479M and 5,000M compared to the recent actual 40M)?

Here’s one simple idea: extrapolate the 40M rows per day over 80B rows: that’s 2,000 days (possibly rather more since businesses tend to grow). What if the dt_tm is the timestamp for the moment the row was loaded into the database, and a couple of years ago (maybe around “30th Nov 2016”) the data was restructured and the existing five years of data was loaded over a very short period of time – let’s say one week. This would leave you with 17B rows of “new” data with a dt_tm spread at 40M rows per day for most of 438 days, and 63B rows of “historic” data packed into 7 days (at 9B rows per day).

I don’t know how Oracle would have randomly selected its sample from an extremely large table with 2,048 physical data segments but it’s totally believable that a small, widely scattered sample could end up with an extremely unrepresentative subset of the data. A completely random sample of the data would produce an estimate of around 500M rows for the predicate; but it would only take a fairly small variation in the sample (taking a few too many “historic” rows) to produce a large enough change in the estimate to change the execution plan, and a rare, but not extreme variation could easily take the estimate up to 5B.

Next Steps

It would be at this point in a performance assignment that I’d be asking around to find out if my guess about a massive data load operation was correct – if I couldn’t get the answer by talking to people I’d run a query against the whole data set to check the hypothesis, because there’s clearly some sort of skew in the data that’s causing a problem. I’d also run the critical part of the query a couple of times with events 10046/level 4 and 10053 set (but only fetching the first few rows) to find out from the trace file how large a sample Oracle was using, and then run the sampling query a few times to see what the sampled results looked like. Depending on the results I’d either find a way to stop Oracle from sampling for this query or I might create a virtual column (or just extended stats since it’s 11g) on just the trunc(dt_tm), possibly with a histogram in place (maybe coded by hand) if that could isolate the special dates and leave Oracle with a better estimate of the typical date. I might find I had to change the coalesce() to an nvl() as well – or create a virtual  column – to stop the sampling.

Finally, it’s worth noting that in 11g it’s possible to create pending (table preference “PUBLISH” = FALSE) stats for testing purposes; it’s also worth noting that the default histogram on trunc(dt_tm) would be a height-balanced histogram while we could create a frequency histogram in 12c since 12c allows us to specify up to 2,048 buckets.

Footnote

If you check the ODC thread you’ll see that the OP has marked as correct a suggestion to change:

    TRUNC (TB1.DT_TM)  = to_date(:b1,'MM/DD/YYYY HH24:MI:SS');  

to

    dt_tm >= trunc(to_date(:b1,'MM/DD/YYYY HH24:MI:SS'))
and dt_tm <  trunc(to_date(:b1,'MM/DD/YYYY HH24:MI:SS'))+1

Note that that’s “greater than or equal to” at one end and “strictly less than” at the other when using “date + 1”.

This has the effect of giving the optimizer a chance of using the low/high values of the column to produce a better (though perhaps still overlarge) and consistent estimate of the rows in the date range; and it may also stop the optimizer from doing dynamic sampling at level 3 (the “I’m guessing, let’s check” level) though it’s possible that the sampling would be disabled only if the coalesce() were changed to an nvl() as well.

Of course, from the information supplied, this looks like the OP would have to change a view definition and the run-time code to achieve the result. But in an ideal world doing things that avoid confusing the optimizer is usually the sensible strategy provided it doesn’t take an extreme amount of coding and testing.

 

DBMS_CLOUD Package – A Reference Guide

The Appendix A of the Using Oracle Autonomous Data Warehouse Cloud guide describes the DBMS_CLOUD package. Unfortunately, it documents only a subset of the subroutines. And, for some of them, the description could also be enhanced. Therefore, while I was testing all the subroutines the DBMS_CLOUD package provides, I took a number of notes. By the end of my tests, I got what I can call my personal reference guide to the package. Since it might help others, here it is…

Introduction

To interact with other cloud services, the Oracle Autonomous Data Warehouse Cloud service provides the DBMS_CLOUD package. With it, the database engine can store/retrieve objects into/from the following cloud services:

  • Oracle’s Object Storage
  • Amazon Simple Storage Service (S3)
  • Microsoft’s Azure Blob Storage

Those services store data as objects within buckets (containers). Therefore, when in this post I use the terms “object” and “bucket”, I mean the concepts provided by those services.

Objects are identified with an URI having the following format (for more details refer to the documentation):

  • Oracle Cloud: https://swiftobjectstorage..oraclecloud.com/v1///
  • AWS: https://s3-.amazonaws.com//
  • Azure: https://.blob.core.windows.net//

The access to the objects stored within buckets is protected. To access an object the Oracle Autonomous Data Warehouse Cloud service has to provide identification information. That information is stored into a credential object.

Managing Credentials

This section describes subroutines to handle credential objects that give access to cloud services.

CREATE_CREDENTIAL

The CREATE_CREDENTIAL procedure creates a credential into the current schema. Its specification is the following:

PROCEDURE create_credential(
  credential_name IN VARCHAR2,
  username        IN VARCHAR2,
  password        IN VARCHAR2 DEFAULT NULL,
  tenancy_ocid    IN VARCHAR2 DEFAULT NULL,
  user_ocid       IN VARCHAR2 DEFAULT NULL,
  private_key     IN VARCHAR2 DEFAULT NULL,
  public_key      IN VARCHAR2 DEFAULT NULL,
  region          IN VARCHAR2 DEFAULT NULL
);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)

The meaning of the other parameters depends on the accessed cloud services. Refer to the documentation for information about them.

Notes:

  • A credential is a regular database object belonging to a schema. Its name must comply the usual naming rules used in Oracle Database.

Example:

BEGIN
  dbms_cloud.create_credential(
    credential_name => 'CHRIS',
    username => 'chris',
    password => 'mysecret'
  );
END;

DROP_CREDENTIAL

The DROP_CREDENTIAL procedure drops a credential object from the current schema. Its specification is the following:

PROCEDURE drop_credential(credential_name IN VARCHAR2);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)

Notes:

  • The credential is dropped also in case another object relies on it.

Example:

BEGIN
  dbms_cloud.drop_credential(credential_name => 'CHRIS');
END;

ENABLE_CREDENTIAL

The ENABLE_CREDENTIAL procedure enables a credential stored in the current schema. Its specification is the following:

PROCEDURE enable_credential(credential_name IN VARCHAR2);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)

Example:

BEGIN
  dbms_cloud.enable_credential(credential_name => 'CHRIS');
END;

DISABLE_CREDENTIAL

The DISABLE_CREDENTIAL procedure disables a credential stored in the current schema. Its specification is the following:

PROCEDURE disable_credential(credential_name IN VARCHAR2);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)

Example:

BEGIN
  dbms_cloud.disable_credential(credential_name => 'CHRIS');
END;

UPDATE_CREDENTIAL

The UPDATE_CREDENTIAL procedure updates an attribute associated to a credential stored in the current schema. Its specification is the following:

PROCEDURE update_credential(
  credential_name IN VARCHAR2,
  attribute       IN VARCHAR2,
  value           IN VARCHAR2
);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • ATTRIBUTE: identify the attribute to update (valid values are USERNAME, PASSWORD, TENANCY_OCID, USER_OCID, PRIVATE_KEY, PUBLIC_KEY and REGION)
  • VALUE: the value to associate to the attribute to update

Example:

BEGIN
  dbms_cloud.update_credential(
    credential_name => 'CHRIS',
    attribute => 'PASSWORD',
    value => 'anothersecret'
  );
END;

Managing Tables

This section describes subroutines that interact with database tables.

CREATE_EXTERNAL_TABLE

The CREATE_EXTERNAL_TABLE procedure creates an external table that can be used to query data stored into a cloud service from the database engine. Its specification is the following:

PROCEDURE create_external_table(
  table_name      IN VARCHAR2,
  credential_name IN VARCHAR2,
  file_uri_list   IN CLOB,
  column_list     IN CLOB,
  field_list      IN CLOB DEFAULT NULL,
  format          IN CLOB DEFAULT NULL
);

Parameters:

  • TABLE_NAME: the name of the external table
  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • FILE_URI_LIST: comma-delimited list of one or several objects stored in a cloud service
  • COLUMN_LIST: comma-delimited list of column names and data types
  • FIELD_LIST: describes the fields in the object; it has the same syntax as the FIELD_LIST clause of external table; refer to the documentation for further information
  • FORMAT: data formatting options in JSON format; refer to the documentation for further information.

Notes:

  • The external table is created with the NOLOGFILE and NOBADFILE options. So, in case of troubles, there is no easy way to immediately understand what is wrong. To validate its content and know what is wrong, the VALIDATE_EXTERNAL_TABLE procedure is available (see next section).
  • The FILE_URI_LIST parameter supports the wildcards “*” and “?”.
  • The external table is created in the current schema

Example:

BEGIN
  DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
    table_name => 'CHANNELS',
    credential_name => 'CHRIS',
    file_uri_list => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/channels.txt',
    column_list => 'channel_id CHAR(1), channel_desc VARCHAR2(20), channel_class VARCHAR2(20)',
    format => json_object('type' VALUE 'CSV')
  );
END;

VALIDATE_EXTERNAL_TABLE

The VALIDATE_EXTERNAL_TABLE procedure validates an external table, specifically it checks whether data can be loaded through it. It is overloaded to provide also the ability to return an id associated to the load operation. This id permits to get access to details about the operation itself. Its specification is the following:

PROCEDURE validate_external_table(
  table_name    IN VARCHAR2,
  schema_name   IN VARCHAR2 DEFAULT  NULL,
  rowcount      IN NUMBER   DEFAULT  0,
  stop_on_error IN BOOLEAN  DEFAULT  TRUE
);
PROCEDURE validate_external_table(
  table_name    IN         VARCHAR2,
  operation_id  OUT NOCOPY NUMBER,
  schema_name   IN         VARCHAR2 DEFAULT  NULL,
  rowcount      IN         NUMBER   DEFAULT  0,
  stop_on_error IN         BOOLEAN  DEFAULT  TRUE
);

Parameters:

  • TABLE_NAME: the name of the external table
  • SCHEMA_NAME: the owner of the external table (NULL = current schema)
  • ROWCOUNT: how many rows to read during the validation (0 = all rows)
  • STOP_ON_ERROR: whether to stop immediately in case bad records are found
  • OPERATION_ID: output parameter that provides the id of the load operation; it can for example be used to identify the corresponding information in USER_LOAD_OPERATIONS

Notes:

  • For two days, the log and bad files associated to the operation can be accessed through external tables. Their name is visible in USER_LOAD_OPERATIONS.LOGFILE_TABLE and USER_LOAD_OPERATIONS.BADFILE_TABLE.

Example:

BEGIN
  DBMS_CLOUD.VALIDATE_EXTERNAL_TABLE (
    table_name => 'CHANNELS'
  );
END;

COPY_DATA

The COPY_DATA procedure reads, through an external table it creates, an object stored in a cloud service and loads its contents into a heap table. It is overloaded to provide also the ability to return an id associated to the load operation. This id permits to get access to details about the operation itself. In other words, the procedure can be used to move data from a cloud service to a heap table. Its specification is the following:

PROCEDURE copy_data(
  table_name      IN VARCHAR2,
  credential_name IN VARCHAR2,
  file_uri_list   IN CLOB,
  schema_name     IN VARCHAR2 DEFAULT NULL,
  field_list      IN CLOB     DEFAULT NULL,
  format          IN CLOB     DEFAULT NULL
);
PROCEDURE copy_data(
  table_name      IN         VARCHAR2,
  credential_name IN         VARCHAR2,
  file_uri_list   IN         CLOB,
  operation_id    OUT NOCOPY NUMBER,
  schema_name     IN         VARCHAR2 DEFAULT NULL,
  field_list      IN         CLOB     DEFAULT NULL,
  format          IN         CLOB     DEFAULT NULL
);

Parameters:

  • TABLE_NAME: the name of the heap table
  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • FILE_URI_LIST: comma-delimited list of one or several objects stored in a cloud service; refer to the documentation for further information
  • SCHEMA_NAME: the owner of the heap table (NULL = current schema)
  • FIELD_LIST: describes the fields in the object; it has the same syntax as the FIELD_LIST clause of external table; refer to the documentation for further information
  • FORMAT: data formatting options in JSON format; refer to the documentation for further information
  • OPERATION_ID: output parameter that provides the id of the load operation; it can for example be used to identify the corresponding information in USER_LOAD_OPERATIONS

Notes:

  • The FILE_URI_LIST parameter supports the wildcards “*” and “?”.
  • To access the object stored in the cloud service, an external table is created and, when the load operation is over, it is immediately dropped.
  • For two days, the log and bad files associated to the operation can be accessed through external tables. Their name is visible in USER_LOAD_OPERATIONS.LOGFILE_TABLE and USER_LOAD_OPERATIONS.BADFILE_TABLE.

Example:

BEGIN
  dbms_cloud.copy_data(
    table_name => 'CHANNELS',
    credential_name => 'CHRIS',
    file_uri_list => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/channels.txt',
    format => json_object('type' VALUE 'CSV')
  );
END;

DELETE_OPERATION

The DELETE_OPERATION procedure cleans up the objects and information created during the execution of an operation carried out by the VALIDATE_EXTERNAL_TABLE and COPY_DATA procedures. Its specification is the following:

PROCEDURE delete_operation(id IN NUMBER);

Parameters:

  • ID: the id of the operation to be cleaned up (USER_LOAD_OPERATIONS.ID)

Notes:

  • Only operations carried out by the current user are considered.

Example:

BEGIN
  dbms_cloud.delete_operation(id => 42);
END;

DELETE_ALL_OPERATIONS

The DELETE_ALL_OPERATIONS procedure cleans up the objects and information related to all operations carried out by the VALIDATE_EXTERNAL_TABLE and COPY_DATA procedures. Its specification is the following:

PROCEDURE delete_all_operations(type IN VARCHAR DEFAULT NULL);

Parameters:

  • TYPE: the type of the operations to delete (USER_LOAD_OPERATIONS.TYPE; NULL = all types)

Notes:

  • Only operations carried out by the current user are considered.

Example:

BEGIN
  dbms_cloud.delete_all_operations(type => 'VALIDATE');
END;

Managing Objects

This section describes subroutines to handle objects stored in a cloud service.

PUT_OBJECT

The PUT_OBJECT procedure stores an object into a bucket. It is overloaded to provide the ability to read data from a BLOB or from a directory object. In other words, it can be used to move data from the database server to a cloud service. Its specification is the following:

PROCEDURE put_object(
  credential_name IN VARCHAR2,
  object_uri      IN VARCHAR2,
  contents        IN BLOB,
  compression     IN VARCHAR2 DEFAULT NULL
);
PROCEDURE put_object(
  credential_name IN VARCHAR2,
  object_uri      IN VARCHAR2,
  directory_name  IN VARCHAR2,
  file_name       IN VARCHAR2,
  compression     IN VARCHAR2 DEFAULT NULL
);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • OBJECT_URI: the URI of the object or, when the data is read from a directory object, the URI of the bucket
  • DIRECTORY_NAME: the name of the directory object (ALL_DIRECTORIES.DIRECTORY_NAME)
  • FILE_NAME: the name of the file
  • COMPRESSION: the compression algorithm used to store the object; valid values are DBMS_CLOUD.COMPRESS_NONE (NULL), DBMS_CLOUD.COMPRESS_AUTO (‘AUTO’) and DBMS_CLOUD.COMPRESS_GZIP (‘GZIP’).

Notes:

  • If the object already exists, it is overwritten.
  • In case a directory object is used and only the destination bucket is specified, the file name is used as object name.
  • Even though the compression algorithms ZLIB and BZIP2 are defined by the package, they are not supported by the PUT_OBJECT procedures.

Example:

  • Get data from a BLOB stored in the database
DECLARE
  l_contents BLOB;
BEGIN
  SELECT report_compressed INTO l_contents FROM dba_hist_reports_details WHERE rownum = 1;
  dbms_cloud.put_object(
    credential_name => 'CHRIS',
    object_uri => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/report.dat',
    contents => l_contents
  );
END;
  • Get data from the DATA_PUMP_DIR directory object
BEGIN
  dbms_cloud.put_object(
    credential_name => 'CHRIS',
    object_uri => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/',
    directory_name => 'DATA_PUMP_DIR',
    file_name => 'channels.txt',
    compression => dbms_cloud.compress_auto
  );
END;

LIST_OBJECTS

The LIST_OBJECTS pipelined function returns the list of objects stored in a bucket. Its specification as well as the specification of the type of the return value are the following:

FUNCTION list_objects(
  credential_name IN VARCHAR2,
  location_uri    IN VARCHAR2
) RETURN dbms_cloud_types.list_object_ret_tab PIPELINED;
TYPE list_object_ret_t IS RECORD (object_name VARCHAR2(4000), bytes NUMBER);
TYPE list_object_ret_tab IS TABLE OF list_object_ret_t;

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • LOCATION_URI: the address of the bucket

Return value:

  • The table contains one record for each object. Every record provides the name of the object as well as its size in bytes. If the object is compressed, it is the size of the compressed file, not the original one.

Example:

SELECT * 
FROM table(dbms_cloud.list_objects(
             credential_name => 'CHRIS', 
             location_uri => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/'
           ))
WHERE object_name LIKE '%.txt'

OBJECT_NAME    BYTES
------------- ------
channels.txt     104

GET_METADATA

The GET_METADATA function returns the metadata associated to an object. Its specification is the following:

FUNCTION get_metadata(
  credential_name IN VARCHAR2,
  object_uri      IN VARCHAR2
) RETURN CLOB;

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • OBJECT_URI: the URI of the object

Return value:

  • The metadata associated to the object in JSON format.

Example:

SELECT dbms_cloud.get_metadata(
         credential_name => 'CHRIS', 
         object_uri => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/channels.txt'
       ) AS metadata
FROM dual

METADATA 
-----------------------
{"Content-Length":104}

GET_OBJECT

The GET_OBJECT function reads an object stored into a bucket and returns it as a BLOB. It is overloaded to provide also the ability to write the data into a file stored in a directory object. In other words, it can be used to move data from a cloud service to the database server. Its specification is the following:

FUNCTION get_object(
  credential_name IN VARCHAR2,
  object_uri      IN VARCHAR2,
  startOffset     IN NUMBER   DEFAULT 0,
  endOffset       IN NUMBER   DEFAULT 0,
  compression     IN VARCHAR2 DEFAULT NULL
) RETURN BLOB;
FUNCTION get_object(
  credential_name IN VARCHAR2,
  object_uri      IN VARCHAR2,
  directory_name  IN VARCHAR2,
  file_name       IN VARCHAR2 DEFAULT NULL,
  startOffset     IN NUMBER   DEFAULT 0,
  endOffset       IN NUMBER   DEFAULT 0,
  compression     IN VARCHAR2 DEFAULT NULL
) RETURN BLOB;

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • OBJECT_URI: the URI of the object
  • STARTOFFSET: the offset (in bytes) from where data is read
  • ENDOFFSET: the offset (in bytes) until where data is read
  • DIRECTORY_NAME: the name of the directory object (ALL_DIRECTORIES.DIRECTORY_NAME)
  • FILE_NAME: the name of the file
  • COMPRESSION: the compression algorithm used to store the object; valid values are DBMS_CLOUD.COMPRESS_NONE (NULL), DBMS_CLOUD.COMPRESS_AUTO (‘AUTO’) and DBMS_CLOUD.COMPRESS_GZIP (‘GZIP’).

Return value:

  • A LOB locator giving access to the data read from the object.

Notes:

  • In case a directory object is used and the file name is not specified, the object name is used as file name.
  • The offsets start from 0. For example, to read the first 1000 bytes, STARTOFFSET has to be set to 0 and ENDOFFSET has to be set to 999.
  • If the compression algorithm matches the one used to store the object, data is uncompressed.

Example:

SELECT to_clob(dbms_cloud.get_object(
         credential_name => 'CHRIS',		
	     object_uri => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/channels.txt',
         startOffset => 0,
         endOffset => 20
       )) AS data
FROM dual

DATA 
----------------------
S,Direct Sales,Direct

DELETE_OBJECT

The DELETE_OBJECT procedure removes an object from a bucket. Its specification is the following:

PROCEDURE delete_object(
  credential_name IN VARCHAR2,
  object_uri      IN VARCHAR2
);

Parameters:

  • CREDENTIAL_NAME: the name of the credential object (USER_CREDENTIALS.CREDENTIAL_NAME)
  • OBJECT_URI: the URI of the object

Example:

BEGIN
  dbms_cloud.delete_object(
    credential_name => 'CHRIS',
    object_uri => 'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/chris/tmp/channels.txt'
  );
END;

Managing Files

This section describes subroutines to handle files stored in the file system of the database server. Since no directory object can be created, only the files stored in the directory referenced by the DATA_PUMP_DIR directory object are visible.

LIST_FILES

The LIST_FILES pipelined function lists the files stored in a directory. Its specification as well as the specification of the type of the return value are the following:

FUNCTION list_files(
  directory_name IN VARCHAR2
) RETURN dbms_cloud_types.list_object_ret_tab PIPELINED;
TYPE list_object_ret_t IS RECORD (object_name VARCHAR2(4000), bytes NUMBER);
TYPE list_object_ret_tab IS TABLE OF list_object_ret_t;

Parameters:

  • DIRECTORY_NAME: the name of the directory object (ALL_DIRECTORIES.DIRECTORY_NAME)

Return value:

  • The table contains one record for each file. Every record provides the name of the file as well as its size in bytes.

Notes:

  • To use the procedure successfully, the caller requires the READ privilege on the directory.

Example:

SELECT * 
FROM table(dbms_cloud.list_files('DATA_PUMP_DIR'))
WHERE object_name LIKE '%.txt'

OBJECT_NAME    BYTES
------------- ------
channels.txt      99

DELETE_FILE

The DELETE_FILE procedure deletes a single file from a directory. Its specification is the following:

PROCEDURE delete_file(
  directory_name IN VARCHAR2,
  file_name      IN VARCHAR2
);

Parameters:

  • DIRECTORY_NAME: the name of the directory object (ALL_DIRECTORIES.DIRECTORY_NAME)
  • FILE_NAME: the name of the file

Notes:

  • To use the procedure successfully, the caller requires the WRITE privilege on the modified directory.
  • If the file does not exist, no exception is raised.
  • Wildcards are not supported.

Example:

BEGIN
  dbms_cloud.delete_file(
    directory_name => 'DATA_PUMP_DIR',
    file_name      => 'channels.txt'
  );
END;

Dumping SGA to read encrypted blocks

After my last article AMM vs security, Martin Berger wrote to me:

well,
even without AMM you can do it:
write your own process which attaches to the same shm segments – and use its memory mapping (?)

My response was that it is also possible with ASMM but AMM makes it extremely easy. And this is because you can treat memory as regular binary files when operating on AMM.

Today I want to show you how dump blocks from SGA which is configured as ASMM to get into encrypted data which is also protected by Oracle Database Vault. To set up the environment I will use examples from a previous blog post.

Code for sgadump is written in GoLang, so please use those instructions to setup your environment: https://golang.org/doc/install

Please make sure that your ENV is set up properly:

[oracle@rokoko ~]$ cat .bash_profile | grep go
GOPATH=/home/oracle/go
PATH=$PATH:/usr/local/go/bin:$GOPATH/bin

After installing GoLang you can do the following to compile my tool:

[oracle@rokoko ~]$ mkdir -p ~/go/src/github.com/ghetzel
[oracle@rokoko ~]$ mkdir -p ~/go/src/github.com/ora600pl
[oracle@rokoko ~]$ cd ~/go/src/github.com/ghetzel
[oracle@rokoko ghetzel]$ git clone https://github.com/ghetzel/shmtool
Cloning into 'shmtool'...
remote: Counting objects: 92, done.
remote: Total 92 (delta 0), reused 0 (delta 0), pack-reused 92
Unpacking objects: 100% (92/92), done.
[oracle@rokoko ghetzel]$ cd ~/go/src/github.com/ora600pl
[oracle@rokoko ora600pl]$ git clone https://github.com/ora600pl/sgadump
Cloning into 'sgadump'...
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 9 (delta 0), reused 9 (delta 0), pack-reused 0
Unpacking objects: 100% (9/9), done.
[oracle@rokoko ora600pl]$ cd sgadump/
[oracle@rokoko sgadump]$ ls
README.md  sgadump.go
[oracle@rokoko sgadump]$ go install

If your ENV was setup properly, you’ll find a binary file under ~/go/bin/sgadump and use it like this:

[oracle@rokoko sgadump]$ sgadump
sgadump by Kamil Stawiarski (@ora600pl) - dumps database blocks from SGA.
Usage: sgadump -b block_size -d data_object_id -s shmid -o output_file_name

Cool, now let’s make sure, that we are using ASMM and not AMM:

SQL> sho parameters memory_target

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
memory_target			     big integer 0
SQL> sho parameters sga_target

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
sga_target			     big integer 1120M

Now we will check if the tool works – first of all we have to check data_object_id of HR.EMPLOYEES, load encrypted data to memory, get ID_SHM for SGA and dump blocks. Easy, right? </p />
</p></div>

    	  	<div class=