Search

Top 60 Oracle Blogs

Recent comments

Troubleshooting

An anecdote with a moral.

Many years ago – in the days of Oracle 7.2.3, when parallel query and partition views were in their infancy and when RAC was still OPS (Oracle Parallel Server), I discovered a bug that caused parallel queries against partition views to crash (Maxim – mixing two new features is a good way to find bugs). I no longer recall the details of exact syntax but the problem revolved around the way Oracle used to rewrite queries for parallel execution. In outline it was something like the following:


create or replace view v1
as
select * from t1
union all
select * from t2
;

select  /*+ parallel(v1 2) */ * 
from    v1 
where   pv_col between 1 and 10
and     date_col = to_date('1-Apr-1999','dd-mm-yyyy')

If you had followed the rules about partition views then Oracle would generate some code that managed to combine the partitioning definitions with the view definition and query predicates and come up with rewritten code for the parallel query slaves that looked something like (e.g.)

select  {list of columns}
from    t1
where   pv_col between 1 and 10 
and     pv_col >= 0 and pv_col < 3000 -- > comment to avoid wordpress format issue 
and     date_col = to_date(:SYS_B1,'dd-mm-yyyy') 
union all 
select  {list of columns} 
from    t2 
where   pv_col between 1 and 10 
and     pv_col >= 3000 and pv_col < 6000 -- > comment to avoid wordpress format issue
and     date_col = to_date(:SYS_B2,'dd-mm-yyyy')

In this case I’ve defined my partitions (using constraints on the underlying tables) so that t1 will hold rows where pv_col >= 0 and pv_col < 3000, and t2 will hold rows where pv_col >= 3000 and pv_col < 6000. As you can see the optimizer has expanded the query with the view text and  pulled the constraints into the query and will be able to bypass all access to t2 because it can reduce the pv_col predicates on t2 into the contradiction “1 = 0”. Here’s the basic form of the execution plan we’d be looking for with partition elimination:


-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |    15 |  1500 |     2   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000 |    15 |  1500 |     2   (0)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    VIEW                 | V1       |    15 |  1500 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   4 |     UNION-ALL           |          |       |       |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR  |          |    12 |  1308 |     2   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS FULL | T1       |    12 |  1308 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  7 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   8 |       PX BLOCK ITERATOR |          |     1 |   109 |     2   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|*  9 |        TABLE ACCESS FULL| T2       |     1 |   109 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------


Note the FILTER at operation 7 – the predicate there is actually “NULL IS NOT NULL” for new versions of Oracle, but would have been “1 = 0” for older versions, so operations 8 and 9 would not get executed.

Notice, also, that my date predicate changes from a literal to a bind variable. This just happened to be the way things were done for parallel query in that version of Oracle. So now we can get to the bug. Our data was actually partitioned by day using a date(-only) column, and the initial target was to keep two years of data. The trouble was that certain queries kept crashing with Oracle error “ORA-12801: error signaled in parallel query server”.

Naturally I created a simplified model (smaller tables, fewer partitions in the views) to try and track down the problem – and the problem disappeared. So I took a crashing query from production, and started creating partition views with fewer and fewer table until the query stopped crashing, and what I discovered was the following:

  • If you had 2 dates in the query it crashed if the view held 128 or more tables
  • If you had 3 dates in the query it crashed if the view held 86 or more tables

Think about the arithmetic for a moment: 2 * 128 = 256, 3 * 86 = 258. Does this give you a clue ?

What it suggested to me was that someone in Oracle Development had used a single byte to index the array of bind variables that they defined for use with the generated parallel query text, so when you had 2 dates and needed 256 bind variable the counter overflowed, when you had 3 dates and needed 258 bind variables the counter overflowed. Having made the hypothesis I predicted that a query would crash if there were 4 dates and 64 partitions, but survive if there were only 63 partitions. (And I was right.)

When I called this in to Oracle support (who remembers those early days when you were supposed to “phone” Oracle support on a “land-line”) and suggested the source of the problem I was told that there was “no limit on the number of bind variables a query could handle in Oracle”. Notice how this is essentially a true statement – but has nothing to do with my suggestion.

Several months (maybe even a couple of years) later – long after the client had upgraded to 7.3.2 then 7.3.4 and I was no long on site – I got a call from Oracle support who wanted to close the TAR (as SR’s used to be known) because they’d discovered the problem and it was … see above. I got the feeling that no-one had considered my suggestion for a long time because they “knew” it had to be  wrong.

The moral(s) of the story

  • Listen to the question carefully – you may not  have heard what you were assuming you would hear.
  • Listen to the answer carefully – it may sound like a convincing response to your question while not being relevant to the actual question.
  • “It’s not supposed to do that” isn’t the same as “That didn’t happen” or (to mis-quote a well-known philosophical problem): “you can’t turn an ‘ought not’ into a ‘did not”

One thing that’s worth emphasising is that everyone (and that does include me) will occasionally hear what they’re expecting to hear and completely misunderstand the point of the question. So when someone says something which is clearly very silly pause for thought then ask them, with care and precision, if they just said what you thought they said – maybe what they said and what you heard were very different. (The same is true for twitter, list servers, forums etc. – it’s very easy to misinterpret a short piece of text, and it may be the way it’s been written but it may be the way it’s been read.)