# Histogram Limit

A surprising question came up on OTN a couple of days ago:

Why does a query for “column = 999999999999999999” run slower than a query for “column > 999999999999999998” (that’s 18 digit numbers, if you don’t want to count them). With the equality predicate the query is very slow, with the range-based predicate perfomance is good.

In the absence of further information there are various reasons why this is possible – but the example in question was about a “versioning” table where the single very large value was used as the “not yet ended” value for the history of an id so, at a minimum, the table held columns (id, nstart, nend, other), and each id could appear many times with pairs of start and end values that supplied non-overlapping, covering ranges and one row that had the very large number as the end value.

Let’s jot down a few ideas about what the data (and stats) might look like.

Assuming every id appears “a few” time and every id has to have a “still valid” row this means that a very large fraction (say 10% to 25%, if “a few” means 4 to 9) of the rows hold the value 999999999999999999.

If you gather stats without a histogram then you should get the low and high, finding that the high is 999999999999999999 and that the range is enormous, and so the predicates “column = {high value}” and “column > {high value} – 1” should give very similar cardinalities.

If you collect stats with a histogram you should find the very popular high value even in a very small sample set (which is what happens with histogram collection in 11g, and even in 12c for hybrid histograms). In this case the histogram should spot the significance of the high value and again the two predicates should have very similar cardinalities.

At first sight there doesn’t seem to be a feasible way that the two cardinalities could be sufficiently different to cause a problem – so maybe there’s something about character conversion or maybe bind variable usage that hasn’t been mentioned. So to test a couple of the less likely ideas I built a data set using 11.2.0.4 – and found a bug:

```
create table t1 (
id              number(18),
nend            number(18),
n1              number(18),
small_vc        varchar2(10)
)
nologging
;

insert /*+ append */ into t1
with generator as (
select  --+ materialize
rownum id
from dual
connect by
level <= 1e4
)
select
rownum -1               id,
case mod(rownum - 1,4)
when 0  then 999999999999999999
--              when 0  then 999999999999999
else mod(rownum - 1, 250000)
end                     nend,
rownum - 1              n1,
from
generator       v1,
generator       v2
where
rownum <= 1e6 ; commit; select * from t1 where nend = 0; begin dbms_stats.gather_table_stats( ownname => user,
tabname          =>'T1',
method_opt       => 'for all columns size auto'
);
end;
/

```

I have an nend column that is set to 999999999999999999 every 4th row in the table and otherwise has 4 rows per value for 187,500 other (relatively low) values. It’s probably a reasonable initial model of the original data. I’ve run a query with a predicate referencing nend before gathering stats so that the (default) auto option will build a histogram for nend. Then I’ve checked the execution plans for two critical queries:

```
explain plan for
select  *
from    t1
where   nend = 999999999999999999
;

select * from table(dbms_xplan.display);

explain plan for
select  *
from    t1
where   nend > 999999999999999998
;

select * from table(dbms_xplan.display);

```

And here’s the surprise – the two plans, in order:

```--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     4 |   108 |   625   (9)| 00:00:04 |
|*  1 |  TABLE ACCESS FULL| T1   |     4 |   108 |   625   (9)| 00:00:04 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("NEND"=999999999999999999)

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   250K|  6591K|   627   (9)| 00:00:04 |
|*  1 |  TABLE ACCESS FULL| T1   |   250K|  6591K|   627   (9)| 00:00:04 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("NEND">999999999999999998)

```

Clearly this should not happen – the equality test is way off, the inequality test is correct. The obvious first guess is that something funny has happened with the statistics so let’s see what they look like – the column stats (user_tab_cols) and the histogram stats (user_tab_histograms) seem like a good starting point:

```
select
sample_size, num_distinct, histogram, num_buckets,
substr(low_value,1,26) low_value, substr(high_value,1,26) high_value
from
user_tab_cols
where
table_name = 'T1'
and     column_name = 'NEND'
;

column endpoint_value format 999,999,999,999,999,999,999

select
endpoint_number, endpoint_value -- , endpoint_repeat_count
from
user_tab_histograms
where
table_name = 'T1'
and     column_name = 'NEND'
order by
endpoint_number
;

```

And here are the results (with a couple of hundred uninteresting rows eliminated from the histogram):

```
Sample     Distinct HISTOGRAM          Buckets LOW_VALUE                  HIGH_VALUE
------------ ------------ --------------- ---------- -------------------------- --------------------------
5,541      190,176 HEIGHT BALANCED        254 C102                       C9646464646464646464

ENDPOINT_NUMBER               ENDPOINT_VALUE
--------------- ----------------------------
0                           19
1                        1,225
2                        2,503
3                        3,911
4                        4,806
...
188                      247,479
189                      248,754
190                      249,862
254    1,000,000,000,000,000,000

```

Oracle will have started with a 100% sample to collect stats on all the columns, but taken a small sample to test the need for a histogram on the nend column – and that’s why the sample size of 5,541 has appeared, but that’s not relevant to the problem in hand. The big question comes from endpoint_number 254 – why is the highest value in the histogram 1e19 when we know (and the column stats show) that the highest value is actually 999999999999999999 ?!

It’s a question to which I don’t have an answer – but I do know that

• if your high value is 15 digits long (all 9s) then the histogram shows the right high value
• if your high value is more that 15 9s then the histogram shows the high value plus 1
• the value collected by the query that Oracle runs is the actual value (i.e. 18 9s)
• if you use set_column stats to set 18 9s as the high value you still get 1e19 in the histogram

Once you see the stats you can understand why the OP sees the odd performance problem. If the histogram identifies 1e19 as a (very) popular value, leaving 999999999999999 as an “average” value with only 4 rows; on the other hand the query for greater than 999999999999999998 can see that there really are 250K rows with higher values.

### Footnote:

Interestingly 12c does the same with the stats – introducing the 1e19 in the histogram – but still manages, somehow, to calculate the correct cardinality in the equality case. (There is one slight difference in 12c, the histogram is a hybrid histogram, not a height-balanced histogram).

### Update 24th Oct 2015:

I realised late last night that I had written about this behaviour before – though possible just as a response on OTN or in some unpublished notes. According to the notes in a script called histogram_numeric_bug.sql that I found on my laptop this is “Bug 18514507 : WRONG CARDINALITY ESTIMATES WHEN NUMERIC VALUE IS LONGER THAN 15 BYTES”. (except that the base bugs that that one is linked to are probably not the same bug).

I suspect the problem is related to the way that character histograms are built based on a numeric representation of the string that takes the first few characters of the string, treats that resulting N bytes as a hex number, converts to decimal and then applies round(N,-14) to restrict the precision stored. The effect with strings is that (broadly speaking) you get fifteen digits precision – which is exactly what I seem to be seeing with numbers.

The reason that 12c can get the right answers despite storing the wrong endpoint_value is that it’s also storing to_char() of the right value as the endpoint_actual_value – possibly doing this any time it has had to round the endpoint_value (for character strings the endpoint_actual_value was only populated if two entries in the endpoint_value were the same).