Join Cardinality – 5

So far in this series I’ve written about the way that the optimizer estimates cardinality for an equijoin where one end of the join has a frequency histogram and the other end has a histogram of type:

It’s now time to look at a join where the other end has a height-balanced histogram. Arguably it’s not sensible to spend time writing about this since you shouldn’t be creating them in 12c (depending, instead, on the hybrid histogram that goes with the auto_sample_size), and the arithmetic is different in 11g. However, there still seem to be plenty of people running 12c but not using the auto_sample_size and that means they could be generating some height-balanced histograms – so let’s generate some data and see what happens.

```
rem
rem     Script:         freq_hist_join_04a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem     Purpose:
rem
rem     Last tested
rem             18.3.0.0
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4        Different results
rem

drop table t2 purge;
drop table t1 purge;

set linesize 156
set trimspool on
set pagesize 60

set feedback off

execute dbms_random.seed(0)

create table t1(
id              number(6),
n04             number(6),
n05             number(6),
n20             number(6),
j1              number(6)
)
;

create table t2(
id              number(8,0),
n20             number(6,0),
n30             number(6,0),
n50             number(6,0),
j2              number(6,0)
)
;

insert into t1
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum                                  id,
mod(rownum,   4) + 1                    n04,
mod(rownum,   5) + 1                    n05,
mod(rownum,  20) + 1                    n20,
trunc(2.5 * trunc(sqrt(v1.id*v2.id)))   j1
from
generator       v1,
generator       v2
where
v1.id <= 10 -- > comment to avoid WordPress format issue
and     v2.id <= 10 -- > comment to avoid WordPress format issue
;

insert into t2
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
rownum                                  id,
mod(rownum,   20) + 1                   n20,
mod(rownum,   30) + 1                   n30,
mod(rownum,   50) + 1                   n50,
28 - round(abs(7*dbms_random.normal))   j2
from
generator       v1
where
rownum <= 800 -- > comment to avoid WordPress format issue
;

commit;

prompt  ==========================================================
prompt  Using estimate_percent => 100 to get height-balanced in t2
prompt  ==========================================================

begin
dbms_stats.gather_table_stats(
ownname          => null,
tabname          => 'T1',
method_opt       => 'for all columns size 1 for columns j1 size 254'
);
dbms_stats.gather_table_stats(
ownname          => null,
tabname          => 'T2',
estimate_percent => 100,
method_opt       => 'for all columns size 1 for columns j2 size 20'
);
end;
/

```

As in earlier examples I’ve created some empty tables, then inserted randomly generated data (after calling the dbms_random.seed(0) function to make the data reproducible). Then I’ve gathered stats, knowing that there will be 22 distinct values in t2 so forcing a height-balanced histogram of 20 buckets to appear.

When we try to calculate the join cardinality we’re going to need various details from the histogram information, such as bucket sizes, number of distinct values, and so on, so in the next few queries to display the histogram information I’ve captured a few values into SQL*Plus variables. Here’s the basic information about the histograms on the join columns t1.j1 and t2.j2:

```
column num_distinct new_value m_t2_distinct
column num_rows     new_value m_t2_rows
column num_buckets  new_value m_t2_buckets
column bucket_size  new_value m_t2_bucket_size

select  table_name, column_name, histogram, num_distinct, num_buckets, density
from    user_tab_cols
where   table_name in ('T1','T2')
and     column_name in ('J1','J2')
order by
table_name
;

select  table_name, num_rows, decode(table_name, 'T2', num_rows/&m_t2_buckets, null) bucket_size
from    user_tables
where   table_name in ('T1','T2')
order by
table_name
;

column table_name format a3 heading "Tab"
break on table_name skip 1 on report skip 1

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
)
select f1.* from f1
union all
select f2.* from f2
order by 1,2
;

Tab                  COLUMN_NAME          HISTOGRAM       NUM_DISTINCT NUM_BUCKETS    DENSITY
-------------------- -------------------- --------------- ------------ ----------- ----------
T1                   J1                   FREQUENCY                 10          10       .005
T2                   J2                   HEIGHT BALANCED           22          20 .052652266

Tab                    NUM_ROWS BUCKET_SIZE
-------------------- ---------- -----------
T1                          100
T2                          800          40

Tab      VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER
--- ---------- ------------------- ---------------
T1           2                   5               5
5                  15              20
7                  15              35
10                  17              52
12                  13              65
15                  13              78
17                  11              89
20                   7              96
22                   3              99
25                   1             100

T2           1                   0               0
14                   1               1
17                   1               2
18                   1               3
19                   1               4
20                   1               5
21                   2               7
22                   1               8
23                   1               9
24                   2              11
25                   2              13
26                   3              16
27                   2              18
28                   2              20

```

As you can see, there is a frequency histogram on t1 reporting a cumulative total of 100 rows; and the histogram on t2 is a height-balanced histogram of 20 buckets, showing 21, 24, 25, 26, 27 and 28 as popular values with 2,2,2,2,3 and 2 endpoints (i.e. buckets) respectively. You’ll also note that the t2 histogram has 21 rows with row/bucket 0 showing us the minimum value in the column and letting us know that bucket 1 is not exclusively full of the value 14. (If 14 had been the minimum value for the column as well as an end point Oracle would not have created a bucket 0 – that may be a little detail that isn’t well-known – and will be the subject of a little follow-up blog note.)

Let’s modify the code to join the two sets of hisogram data on data value – using a full outer join so we don’t lose any data but restricting ourselves to values where the histograms overlap. We’re going to follow the idea we’ve developed in earlier postings and multiply frequencies together to derive a join frequency, so we’ll start with a simple full outer join and assume that when we find a real match value we should behave as if the height-balanced buckets (t2) where the bucket count is 2 or greater represent completely full buckets and are popular values..

I’ve also included in this query (because it had a convenient full outer join) a column selection that counts how many rows there are in t1 with values that fall inside the range of the t2 histogram but don’t match a popular value in t2.

```
column unmatch_ct   new_value m_unmatch_ct
column product format 999,999.99

break on report skip 1
compute sum of product on report

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
),
join1 as (
select
f1.value t1_value,
f2.value t2_value,
f1.frequency t1_frequency,
f2.frequency t2_frequency,
sum(
case
when f2.frequency > 1 and f1.frequency is not null
then 0
else f1.frequency
end
) over()        unmatch_ct,
f2.frequency * &m_t2_bucket_size *
case
when f2.frequency > 1 and f1.frequency is not null
then f1.frequency
end     product
from
f1
full outer join
f2
on
f2.value = f1.value
where
coalesce(f1.value, f2.value) between 2 and 25
--      coalesce(f1.value, f2.value) between &m_low and &m_high
order by
coalesce(f1.value, f2.value)
)
select  *
from    join1
;

T1_VALUE   T2_VALUE T1_FREQUENCY T2_FREQUENCY UNMATCH_CT     PRODUCT
---------- ---------- ------------ ------------ ---------- -----------
2			 5			99
5			15			99
7			15			99
10			17			99
12			13			99
14			      1 	99
15			13			99
17	   17		11	      1 	99
18			      1 	99
19			      1 	99
20	   20		 7	      1 	99
21			      2 	99
22	   22		 3	      1 	99
23			      1 	99
24			      2 	99
25	   25		 1	      2 	99	 80.00
-----------
sum								 80.00

```

We captured the bucket size (&m_bucket_size) for the t2 histogram as 40 in the earlier SQL, and we can see now that in the overlap range (which I’ve hard coded as 2 – 25) we have three buckets that identify popular values, but only one of them corresponds to a value in the frequency histogram on t1, so the Product column shows a value of 1 * 2 * 40 = 80. Unfortunately this is a long way off the prediction that the optimizer is going to make for the simple join. (Eventually we’ll see it’s 1,893 so we have a lot more rows to estimate for).

Our code so far only acounts for items that are popular in both tables. Previous experience tells us that when a popular value exists only at one end of the join predicate we need to derive a contribution to the total prediction through an “average selectivity” calculated for the other end of the join predicate. For frequency histograms we’ve seen that “half the number of the least frequently occuring value” seems to be the appropriate frequency estimate, and if we add that in we’ll get two more contributions to the total from the values 21 and 24 which appear in the height-balanced (t2) histogram as popular but don’t appear in the frequency (t1) histogram. Since the lowest frequency in t1 is 1 this would give us two contributions of 0.5 * 2 (buckets) * 40 (bucket size) viz: two contributions of 40 bringing our total to 160 – still a serious shortfall from Oracle’s prediction. So we need to work out how Oracle generates an “average frequency” for the non-popular values of t2 and then apply it to the 99 rows in t1 that haven’t yet been accounted for in the output above.

To calculate the “average selectivity” of a non-popular row in t2 I need a few numbers (some of which I’ve already acquired above). The total number of rows in the table (NR), the number of distinct values (NDV), and the number of popular values (NPV), from which we can derive the the number of distinct non-popular values and the number of rows for the non-popular values. The model that Oracle uses to derive these numbers is simply to assume that a value is popular if its frequency in the histogram is greater than one and the number of rows for that value is “frequency * bucket size”.

The first query we ran against the t2 histogram showed 6 popular values, accounting for 13 buckets of 40 rows each. We reported 22 distinct values for the column and 800 rows for the table so the optimizer assumes the non-popular values account for (22 – 6) = 16 distinct values and (800 – 13 * 40) = 280 rows. So the selectivity of non-popular values is (280/800) * (1/16) = 0.021875. This needs to be multiplied by the 99 rows in t1 which don’t match a popular value in t2 – so we now need to write some SQL to derive that number.

We could enhance our earlier full outer join and slot 0.5, 99, and 0.021875 into it as “magic” constants. Rather than do that though I’m going to write a couple of messy queries to derive the values (and the low/high range we’re interested in) so that I will be able to tweak the data later on and see if the formula still produces the right answer.

```
column range_low    new_value m_low
column range_high   new_value m_high
column avg_t1_freq  new_value m_avg_t1_freq
column new_density  new_value m_avg_t2_dens

with f1 as (
select  endpoint_value ep_val,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from    user_tab_histograms
where   table_name  = 'T1'
and     column_name = 'J1'
),
f2 as (
select  endpoint_value ep_val,
endpoint_number ep_num,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from    user_tab_histograms
where   table_name  = 'T2'
and     column_name = 'J2'
)
select
max(min_v) range_low, min(max_v) range_high, min(min_f)/2 avg_t1_freq, max(new_density) new_density
from    (
select  min(ep_val) min_v, max(ep_val) max_v, min(frequency) min_f, to_number(null) new_density
from f1
union all
select  min(ep_val) min_v, max(ep_val) max_v, null           min_f,
(max(ep_num) - sum(case when frequency > 1 then frequency end)) /
(
max(ep_num) *
(&m_t2_distinct - count(case when frequency > 1 then 1 end))
)       new_density
from    f2
)
;

RANGE_LOW RANGE_HIGH AVG_T1_FREQ NEW_DENSITY
---------- ---------- ----------- -----------
2         25          .5     .021875

```

This query finds the overlap by querying the two histograms and reporting the lower high value and higher low value. It also reports the minimum frequency from the frequency histogram and divides by 2, and calculates the number of non-popular rows divided by the total number of rows and the number of distinct non-popular values. (Note that I’ve picked up the number of distinct values in t2.j2 as a substituion variable generated by one of my earlier queries.) In my full script this messy piece of code runs before the query that showed I showed earlier on that told us how well (or badly) the two histograms matched.

Finally we can use the various values we’ve picked up in a slightly more complex version of the full outer join – with a special row added through a union all to give us our the estimate:

```
break on report skip 1
compute sum of product on report

with f1 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T1'
and     column_name = 'J1'
),
f2 as (
select
table_name,
endpoint_value                                                            value,
endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency,
endpoint_number
from
user_tab_histograms
where
table_name  = 'T2'
and     column_name = 'J2'
),
join1 as (
select
f1.value t1_value, f2.value t2_value,
f1.frequency t1_frequency, f2.frequency t2_frequency,
f2.frequency *
case
when f2.frequency > 1 and f1.frequency is not null
then f1.frequency
when f2.frequency > 1 and f1.frequency is null
then &m_avg_t1_freq
end *
&m_t2_bucket_size        product
from
f1
full outer join
f2
on
f2.value = f1.value
where
coalesce(f1.value, f2.value) between &m_low and &m_high
order by
coalesce(f1.value, f2.value)
)
select  *
from    join1
union all
select
null,
&m_avg_t2_dens,
&m_unmatch_ct,
&m_t2_rows * &m_avg_t2_dens,
&m_t2_rows * &m_avg_t2_dens * &m_unmatch_ct
from
dual
;

T1_VALUE   T2_VALUE T1_FREQUENCY T2_FREQUENCY     PRODUCT
---------- ---------- ------------ ------------ -----------
2                       5
5                      15
7                      15
10                      17
12                      13
14                         1
15                      13
17         17           11            1
18                         1
19                         1
20         20            7            1
21                         2       40.00
22         22            3            1
23                         1
24                         2       40.00
25         25            1            2       80.00
.021875           99         17.5    1,732.50
-----------
sum                                                1,892.50

```

It remains only to check what the optimizer thinks the cardinality will be on a simple join, and then modify the data slightly to see if the string of queries continues to produce the right result. Here’s a starting test:

```
set serveroutput off

alter session set statistics_level = all;
alter session set events '10053 trace name context forever';
alter session set tracefile_identifier='BASELINE';

select
count(*)
from
t1, t2
where
t1.j1 = t2.j2
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

alter session set statistics_level = typical;
alter session set events '10053 trace name context off';

COUNT(*)
----------
1327

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  f8wj7karu0hhs, child number 0
-------------------------------------
select         count(*) from         t1, t2 where         t1.j1 = t2.j2

Plan hash value: 906334482

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |      41 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |      41 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |   1893 |   1327 |00:00:00.01 |      41 |  2545K|  2545K| 1367K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |       7 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    800 |    800 |00:00:00.01 |       7 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

```

The E-rows for the hash join operation reports 1893 – and a quick check of the 10053 trace file shows that this is 1892.500000 rounded – a perfect match for the result from my query. I’ve modified the data in various ways (notably updating the t1 table to change the value 25 (i.e. the current maximum value of j1) to other, lower, values) and the algorithm in the script seems to be sound – for 12c and 18c. I won’t be surprised, however, if someone comes up with a data pattern where the wrong estimate appears.

Don’t look back

Upgrades are a pain. With the same data set and same statistics on 11.2.0.4, running the same join query between t1 and t2, here’s the execution plan I got:

```
SQL_ID  f8wj7karu0hhs, child number 0
-------------------------------------
select         count(*) from         t1, t2 where         t1.j1 = t2.j2

Plan hash value: 906334482

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |      12 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |      12 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |   1855 |   1327 |00:00:00.01 |      12 |  2440K|  2440K| 1357K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |       6 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    800 |    800 |00:00:00.01 |       6 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"."J1"="T2"."J2")

```

Notice that the E-rows value is different. The join cardinality algorithm seems to have changed in the upgrade from 11.2.0.4 to 12c. I haven’t quite figured out how to get to the 11g result, but I seem to get quite close most of the time by making a simple change to the final query that I used to predict the optimizer’s estimate. In the case expression that chooses between the actual t1.j1 frequency and the “average frequency” don’t choose, just use the latter:

```
case
when f2.frequency > 1 and f1.frequency is not null
-- then f1.frequency    -- 12c
then &m_avg_t1_freq     -- 11g
when f2.frequency > 1 and f1.frequency is null
then &m_avg_t1_freq
end *

```

As I modified the t1 row with the value 25 to hold other values this change kept producing results that were exactly 2, 2.5, or 3.0 different from the execution plan E-Rows – except in one case where the error was exactly 15.5 (which looks suspiciously like 17.5: the “average frequency in t2” minus 2). I’m not keen to spend time trying to work out exactly what’s going on but the takeaway from this change is that anyone upgrading from 11g to 12c may find that some of their queries change plans because they happen to match the type of example I’ve been working with in this post.

In some email I exchanged with Chinar Aliyev, he suggested three fix-controls that might be relevant. I’ve added these to an earlier posting I did when I first hit the anomaly a few days ago but I’ll repeat them here. I will be testing their effects at some point in the not too distant future:

```14033181 1 QKSFM_CARDINALITY_14033181   correct ndv for non-popular values in join cardinality comp.         (12.1.0.1)
19230097 1 QKSFM_CARDINALITY_19230097   correct join card when popular value compared to non popular         (12.2.0.1)
22159570 1 QKSFM_CARDINALITY_22159570   correct non-popular region cardinality for hybrid histogram          (12.2.0.1)
```