# cardinality

## Temp Table Transformation Cardinality Estimates - 2

Continuing from the previous part - which was about the Temp Table Transformation and join cardinality estimates - using the same simple table setup here is a slight variation of the previously used query to demonstrate the potential impact on single table cardinality estimates:

`explain plan forwithcte as (select /* inline */ id from t1 twhere 1 = 1)select /*+           no_merge(a) no_merge(b)       */ * from cte a, cte bwhere a.id = b.idand a.id > 990 and b.id > 990;-- 11.2.0.x Plan with TEMP transformation--------------------------------------------------------------------------------`

## Temp Table Transformation Cardinality Estimates - 1

Having published recently two notes about the Temp Table Transformation highlighting the heuristics based decision and other weaknesses, for example regarding the projection of columns, it's time to publish some more notes about it.The transformation can also have significant impact on cardinality estimates, both join and single table cardinality.Looking at the difference in the join cardinality estimates of following simple example:

`create table t1asselect        rownum as id      , mod(rownum, 10) + 1 as id2      , rpad('x', 100) as fillerfrom        dualconnect by        level <= 1000;`

## HAVING Cardinality

When performing aggregate GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.Usually aggregates are one of the last steps executed before the final result set is returned to the client.However there are various reasons, why a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), or in the more recent releases (11g+) the optimizer decided to use the GROUP BY PLACEMENT transformation that deliberately can move the GROUP BY operation to a different execution step of the plan.In such cases, when the GROUP BY operation will be input to some other operation, it becomes essential for the overall efficiency of the execution plan preferred by the optimizer that the cardinality estimates are in the right ballpark, as it will influe

## Forced Cursor Sharing And Virtual Columns

So you have that application that cannot be changed but makes use of some weird expressions that screw up the cardinality estimates of the optimizer.

Consider this simple example:

create table t as select rownum as id , case when mod(rownum, 100000) = 0 then 'y' else 'n' end as flag , rpad('x', 100) as filler from dual connect by level <= 1000000 ; exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1 for columns flag size 2', estimate_percent => 30) select flag, count(*) from t group by flag; explain plan for select count(*) from t where flag = 'y' ; set linesize 200 tab off pagesize 0 select * from table(dbms_xplan.display);

## Column Groups - Edge Cases

Oracle 11g added Extended Statistics support for column groups in order to detect correlated columns for filter predicates using an equal comparison.

Note that Oracle 11g also added the ability to use the number of distinct keys of a composite index as an upper limit for the cardinality estimates for matching column predicates, which means that the optimizer is now capable of detecting correlated columns without the explicit addition of Extended Statistics / Column Groups.

## Dynamic Sampling On Multiple Partitions - Bugs

In a recent OTN thread I've been reminded of two facts about Dynamic Sampling that I already knew but had forgotten in the meantime:

1. The table level dynamic sampling hint uses a different number of blocks for sampling than the session / cursor level dynamic sampling. So even if for both for example level 5 gets used the number of sampled blocks will be different for most of the 10 levels available (obviously level 0 and 10 are exceptions)

2. The Dynamic Sampling code uses a different approach for partitioned objects if it is faced with the situation that there are more partitions than blocks to sample according to the level (and type table/cursor/session) of Dynamic Sampling

Note that all this here applies to the case where no statistics have been gathered for the table - I don't cover the case when Dynamic Sampling gets used on top of existing statistics.

## Table Functions And Join Cardinality Estimates

If you consider the usage of Table Functions then you should be aware of some limitations to the optimizer calculations, in particular when considering a join between a Table Function and other row sources.

As outlined in one of my previous posts you can and should help the optimizer to arrive at a reasonable cardinality estimate when dealing with table functions, however doing so doesn't provide all necessary inputs to the join cardinality calculation that are useful and available from the statistics when dealing with regular tables.

Therefore even when following the recommended practice regarding the cardinality estimates it is possible to end up with some inaccuracies. This post will explain why.

Join Cardinality Basics

## Star Transformation And Cardinality Estimates

If you want to make use of Oracle's cunning Star Transformation feature then you need to be aware of the fact that the star transformation logic - as the name implies - assumes that you are using a proper star schema.

Here is a nice example of what can happen if you attempt to use star transformation but your model obviously doesn't really correspond to what Oracle expects:

drop table d;

purge table d;

drop table t;

purge table t;

create table t
as
select
rownum as id
, mod(rownum, 100) + 1 as fk1
, 1000 + mod(rownum, 10) + 1 as fk2
, 2000 + mod(rownum, 100) + 1 as fk3
from
dual
connect by
level <= 1000000
;

exec dbms_stats.gather_table_stats(null, 't')

create bitmap index t_fk1 on t (fk1);

## Multi-Column Joins, Expressions and 11g

Introduction

I've already outlined in one of my previous posts that getting a reasonable cardinality estimate for multi-column joins can be tricky, in particular when dealing with correlated column values in the join columns.

Since Oracle 10g several "Multi-Column Join Cardinality" sanity checks have been introduced that prevent a multi-column join from producing too low join cardinalities - this is controlled via the "_optimizer_join_sel_sanity_check" internal parameter that defaults to true from 10g on.

## Implicit Datatype Conversion + Histograms = Bad Execution Plan?

Earlier today I exchanged some tweets with @martinberx about some optimizer questions and after posting more information on the ORACLE-L list, I was able to reproduce what he was observing.

### The issue:

DB: 11.2.0.2.0 – 64bit /> I have a small query with a little error, which causes big troubles. /> The relevant part of the query is /> WHERE …. /> AND inst_prod_type=003 /> AND setid=’COM01′

but INST_PROD_TYPE is VARCHAR2.

this leads to filter[ (TO_NUMBER("INST_PROD_TYPE")=3 AND "SETID"='COM01') ]

based on this TO_NUMBER ( I guess!) the optimiser takes a fix selectivity of 1%.

Can someone tell me if this 1% is right? Jonathan Lewis “CBO Fundamentals” on page 133 is only talking about character expressions.

Unfortunately there are only 2 distinct values of INST_PROD_TYPE so this artificial [low] selectivity leads to my problem: /> An INDEX SKIP SCAN on PS0RF_INST_PROD is choosen. (columns of PS0RF_INST_PROD: INST_PROD_TYPE, SETID, INST_PROD_ID )

After fixing the statement to /> AND inst_prod_type=’003′ /> another index is used and the statement performs as expected.

Now I have no problem, but want to find the optimizers decisions in my 10053 traces.

### The Important Bits of Information

From Martin’s email we need to pay close attention to:

• Predicate of “inst_prod_type=003″ where INST_PROD_TYPE is VARCHAR2 (noting no single quotes around 003)
• Implicite datatype conversion in predicate section of explain plan – TO_NUMBER(“INST_PROD_TYPE”)=3
• only 2 distinct values of INST_PROD_TYPE

From this information I’ll construct the following test case:

```create table foo (c1 varchar2(8));
insert into foo select '003' from dual connect by level <= 1000000;
insert into foo select '100' from dual connect by level <= 1000000;
commit;
exec dbms_stats.gather_table_stats(user,'foo');
```

And using the display_raw function we’ll look at the column stats.

```col low_val     for a8
col high_val    for a8
col data_type   for a9
col column_name for a11

select
a.column_name,
display_raw(a.low_value,b.data_type) as low_val,
display_raw(a.high_value,b.data_type) as high_val,
b.data_type,
a.density,
a.histogram,
a.num_buckets
from
user_tab_col_statistics a, user_tab_cols b
where
a.table_name='FOO' and
a.table_name=b.table_name and
a.column_name=b.column_name
/

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2          .5 NONE                      1
```

Take note of the lack of a histogram.

Now let’s see what the CBO estimates for a simple query with and without quotes (explicit cast and implicit cast).

```SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(TO_NUMBER("C1")=003)

14 rows selected.

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("C1"='003')

14 rows selected.
```

In this case the estimated number of rows is spot on – 1 million rows. Now lets regather stats and because of our queries using C1 predicates, it will become a candidate for a histogram. We can see this from sys.col_usage\$.

```select  oo.name owner,
o.name table_name,
c.name column_name,
u.equality_preds,
u.equijoin_preds,
u.nonequijoin_preds,
u.range_preds,
u.like_preds,
u.null_preds,
u.timestamp
from    sys.col_usage\$ u,
sys.obj\$ o,
sys.user\$ oo,
sys.col\$ c
where   o.obj#   = u.obj#
and     oo.user# = o.owner#
and     c.obj#   = u.obj#
and     c.col#   = u.intcol#
and     oo.name  = 'GRAHN'
and     o.name   = 'FOO'
/

OWNER TABLE_NAME COLUMN_NAME EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP
----- ---------- ----------- -------------- -------------- ----------------- ----------- ---------- ---------- -------------------
GRAHN FOO        C1                       1              0                 0           0          0          0 2011-06-08 22:29:59
```

Regather stats and re-check the column stats:

```SQL> exec dbms_stats.gather_table_stats(user,'foo');

PL/SQL procedure successfully completed.

SQL> select
2     a.column_name,
3     display_raw(a.low_value,b.data_type) as low_val,
4     display_raw(a.high_value,b.data_type) as high_val,
5     b.data_type,
6     a.density,
7     a.histogram,
8     a.num_buckets
9  from
10     user_tab_col_statistics a, user_tab_cols b
11  where
12     a.table_name='FOO' and
13     a.table_name=b.table_name and
14     a.column_name=b.column_name
15  /

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2  2.5192E-07 FREQUENCY                 2
```

Note the presence of a frequency histogram. Now let’s re-explain:

```SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |     1 |     4 |   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(TO_NUMBER("C1")=003)

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1025K|  4006K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("C1"='003')
```

And whammy! Note that the implicit cast [ filter(TO_NUMBER("C1")=003) ] now has an estimate of 1 row (when we know there is 1 million). /> So what is going on here? Let’s dig into the optimizer trace for some insight:

```SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for FOO[FOO]
Column (#1):
NewDensity:0.243587, OldDensity:0.000000 BktCnt:5458, PopBktCnt:5458, PopValCnt:2, NDV:2
Column (#1): C1(
AvgLen: 4 NDV: 2 Nulls: 0 Density: 0.243587
Histogram: Freq  #Bkts: 2  UncompBkts: 5458  EndPtVals: 2
Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred
Table: FOO  Alias: FOO
Card: Original: 2000000.000000  Rounded: 1  Computed: 0.50  Non Adjusted: 0.50
Access Path: TableScan
Cost:  875.41  Resp: 875.41  Degree: 0
Cost_io: 853.00  Cost_cpu: 622375564
Resp_io: 853.00  Resp_cpu: 622375564
Best:: AccessPath: TableScan
Cost: 875.41  Degree: 1  Resp: 875.41  Card: 0.50  Bytes: 0
```

As you can see from the line

`Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred`

The presence of the histogram and the implicit conversion of TO_NUMBER(“C1″)=003 causes the CBO to use a density of 0 because it thinks it’s a non-existent value. The reason for this is that TO_NUMBER(“C1″)=003 is the same as TO_NUMBER(“C1″)=3 and for the histogram the CBO uses TO_CHAR(C1)=’3′ and 3 is not present in the histogram only ’003′ and ’100′.

### Dumb Luck?

So, what if the predicate contained a number that was not left padded with zeros, say 100, the other value we put in the table?

```SQL> explain plan for select count(*) from foo where c1=100;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1009K|  3944K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(TO_NUMBER("C1")=100)
```

While not exact, the CBO estimate is quite close to the 1 million rows with C1=’100′.

### Summary

It’s quite clear that Martin’s issue came down to the following:

• implicit casting
• presences of histogram