Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Friday Philosophy – Robots Rising & Tech Taking Over?

Today I saw some cracking photographs of a lighthouse. How many of us have at some point wondered if it might be nice to be a lighthouse keeper? The solitude, the scenery, the stoic fortitude in the face of the storm – quite literally. (Thank you Brendan Tierney for the photo I’ve stolen from him).

It’s an odd job lighthouse keeper, it holds a special place in Western culture and literature. A job to be held by those a little apart from society and yet with a dedication to the betterment of mankind. I suspect a lot of people in I.T. (and the wider community) find a resonance in that, as so many of us are a little bit apart and yet intelligent & care.

Well, you can’t be a lighthouse keeper anymore. At least, in the UK you can’t. Check out This web site about UK lighthouses and lighthouse keeping. That job, that vocation, was handed over to automated I.T. systems a few years ago, effectively handed to robots & technology. You might think you know where I am going with this, and initially you will be right, but bear with me.

I’ve been thinking a lot over the last 2 or 3 years about the increasing use of technology and robotics to do tasks that we humans have been doing. An obvious one is autonomous driving vehicles, where The I.T. smarts and sensors are leaping along incredibly fast. I am in a long-running “argument” with a friend about when fully autonomous vehicles will be a reality on public roads. He says under 5 years, I think it is more (I started saying more than 5 years to him in 2016, so, giving some leeway, I say not before December 2021 will we see fully autonomous vehicles driving from a town centre to another town centre, sharing lanes with human drivers – specific enough Neil?). But self-driving vehicles will be safer than humans “soon”, and cheaper than employing humans, so companies will swap to it. That will end a lot of employment.

I know others have pointed this out and it is not as if history isn’t almost a continuous tale of technology assisting or replacing human effort. Tolpuddle martyrs, dark satanic mills and all that. Industrialisation of farming has put a lot of farm labours out of work but we could not feed the current mass of humanity without it. People move on to new tasks.

https://mwidlake.files.wordpress.com/2017/07/robots2.jpg?w=920&h=606 920w, https://mwidlake.files.wordpress.com/2017/07/robots2.jpg?w=150&h=99 150w, https://mwidlake.files.wordpress.com/2017/07/robots2.jpg?w=300&h=197 300w, https://mwidlake.files.wordpress.com/2017/07/robots2.jpg?w=768&h=505 768w" sizes="(max-width: 460px) 100vw, 460px" />

But the difference now is not that we are handing jobs to a slightly better automated system where we still need some human control, we are removing the human element. And we are doing this in a lot of areas in a very short space of time. Factories are becoming far more automated, we order our goods online and huge conveyor robotic systems are being built to do the packing, with fewer people involved and lower long-term costs.

But it’s not just physical tasks that are being automated. Genetic algorithms, neural nets, deep data and machine learning is starting to replace tasks that have needed human interaction. Chatbots are getting smarter, to the point where they are used by companies as first-line support {often laughably poorly at present, but it is getting better – and Oracle do have an interest as was covered in Oracle Scene recently {sorry, that link might not work for long}. Expert systems have been developed that can judge simple court cases such as parking fines and beat humans at spotting pre-cancerous cells in tissue samples.

https://mwidlake.files.wordpress.com/2017/07/screenhunter_167-jul-06-22-... 228w" sizes="(max-width: 460px) 100vw, 460px" />


Oracle and the Bots

We now see expert computer systems breaking a lot of barriers and doing things that until now have been deemed uniquely human cerebral tasks. Computers won at playing chess 10+ years ago, they triumphed in “Go” last year and now they can win at versions of Poker where you are not sure of the data and have to read the play of your opponent – in effect second guess a human. Currently all these systems are very expensive, highly focused and specific to a task, built on huge data sets and using fine-tuned sets of algorithms, to do one task. We have nothing as generally capable as a 5 year old child or even a dog.

Only, we keep building systems that are better and better at specific tasks.

So why do I say this bothers me but not in the way you would expect? It’s because I keep seeing “thought leaders” present the same denial of these impacts on us in I.T. of the systems we as an industry are developing, platitudes that we are a special case and will be OK. Several times over the last couple of years I see some utter pillock in a suit from upper management telling a crowd of I.T. experts that we will be just fine as we are smart and we can stop doing the easy tasks and concentrate on the harder ones, use our brains more.

This is balls for two reasons. Firstly:

What about everyone who is below smart?

Most of us in I.T. are not only above average intelligence (probably IQs of 125 and upwards), we are surrounded by similar smart people. Our life partners are generally above normal intelligence, we work in teams who are above-average smart, we probably mostly socialise with generally intelligent people (as a raft of psychological studies show, we gravitate to those at a similar IQ to ourselves, irrespective of where we are on the scale). Even the end users we abuse tend to be above average intelligence. I suspect that most of us somehow don’t “get” that well over 60% of people are not only less intelligent than we are but they have few options if our society passes the jobs they can do to computers and robots. And they are not that likely to be philosophical about having no point to their lives and being poorer. They’re probably going to be very angry, very poor and pretty pissed off with smart-arses who say that “we are OK” – and there are a lot, lot more of them than us.

And that leads to the second reason it is balls.

The smart work will also be doable by Tech

As I’ve said already, we can already create technological systems that can beat us at specific cerebral tasks and there is going to be a small and smaller pool of work for highly-intelligent workers. Let’s face it, a lot of what we do now in I.T. is drudge and boring, there is not really that much smart work needed doing, even in this industry stacked by us smart people. And doing work that really needs you to be smart is tiring (well, I find it tiring!). And our work in I.T. tends to be logic-based and what are computers good at? We will just have a breathing space before our work is also more cheaply done by computers.

I’m annoyed as I think those of use who are involved in this revolution are being told a deluded lie that we will be OK if it pans out like I have just said. Those extra 25+ IQ points are not going to keep us special for very long.

So if computers can drive the taxis & lorries, manage the steel works and build the cars, derive the best drug treatment and give the perfect injection (yep, theoretically a robot already wins on that) what do we as humans do?

Only a few people can be “utterly human” – artists, poets and philosophers. And we do not need 7 billion of them anyway.

We could try and legislate against it, tax robots hard. But those who make a lot of money already run the “free market economy” and will continue to do so. If Robots and computer programs do tasks more cheaply, companies that uses robots will rise to the top of any monetary-based society, i.e. a capitalist society. What will change what has been in place for 100+ years? I can’t see the currently rich and powerful objecting to working methods that increase their wealth. Even if it means more and more poorer people.

Some argue for a basic living wage to keep us all alive – fed, warm and basic healthcare whilst machines do the work. That would give us that often cited nirvana of being free to do “what we want”. But if you have no job, what do you do? Again, for those of us with high IQ we can maybe come up with things to do. Maybe. I seem to be relatively happy being semi-retired, but I’ve done a lot of stuff and had my time of striving to achieve. And still do. But how about those who are IQ 100 and below? I suspect entertaining yourself is not as easy. I think anger, resentment and the feeling of being disenfranchised is just going to continue increasing. I think it’s why the UK is leaving Europe and why the US has an egotistical man-child as president. More and more normal people are unhappy with their lot and see no good future – so they vote for a change. ANY change. Even if it is crazy change.

I know, not a very happy Friday Philosophy. Will someone please tell me it will all be OK? And I mean OK for everyone, not just us “smart” people.

12.2 New Feature: the FLEX ASM disk group part 1

I knew about the 12.2 FLEX ASM disk group type from other presenters but until now – when researching the feature for the upcoming DOAG HA Day – I haven’t been able to appreciate how cool this is. And I really think it is pretty cool and worth sharing! There is a lot to be said about the feature and these tests, which is why I am splitting it into multiple parts.

Please be aware that this post is about my lab experiments, I have no production experience with FLEX ASM disk groups. As with all new features it might take a while to mature, so test, test, test…

Background

In previous versions of Oracle Grid Infrastructure/Automatic Storage Management (ASM), especially in consolidation environments, certain operations I would have liked to perform were not easily possible. Most properties of the disk group – such as redundancy levels etc. – are valid for all files that reside within it. For example, if you wanted to have normal redundancy for some databases, and high redundancy for others, you typically ended up with two “data” disk groups. The same goes for +RECO. I can think of scenarios where “filesystem-like” attributes within a disk group would have been nice. In the build-up to ASM 12.2, Oracle has steadily increased ASM limits to allow for more customisation, and in 12.1 you really could go a little over the top. You could have 63 ASM disk groups in 11.2 and up to 511 in 12.1. Although this should allow for plenty customisation, it adds a little maintenance overhead.

A more granular option to manage storage came with 12.1 and the introduction of Container Databases (CDBs). As part of the Pluggable Database’s creation, the administrator could specify a pdb_storage_clause as in this very basic example:

SQL> create pluggable database pdb1 
  2   admin user someone identified by somepassword
  3    ...
  4   storage (maxsize 200G);

Pluggable database created.

However, if the database was created within a disk group with high redundancy, all files residing in that disk group inherited that property. I couldn’t define a PDB with normal redundancy in a high redundancy disk group, at least I wasn’t aware of a way to do so in 12.1.

Flex Disk Group

A Flex Disk Group promises fine-granular management of data within the disk group. You can also enforce quotas in a disk group (probably most useful on database-level), and you can define properties such as redundancy settings per file type (also per database or Pluggable Database). So in other words you can now have a disk group containing 2 databases for example, out of which database 1 uses normal redundancy, and database 2 uses high redundancy. If database 2 is a Container Database (CDB), you can even manage settings as low down as the PDB level. A few new concepts need introducing before that can happen. Let’s begin with the most essential part: the new Flex ASM disk group. There is a variation of the theme for extended distance clusters, which is not in scope of this article series.

In my lab system, I have created a new ASM disk group of flex redundancy. The system I am using is a two-node RAC running 12.2.0.1.170620 on Oracle Linux 7.3 with UEK4. I called the disk group FLEX, here is the command used for its creation:

CREATE DISKGROUP FLEX FLEX REDUNDANCY  
 DISK 'AFD:FLEX1' 
 DISK 'AFD:FLEX2'   
 DISK 'AFD:FLEX3' 
 DISK 'AFD:FLEX4'  
 DISK 'AFD:FLEX5'  
ATTRIBUTE 
 'compatible.asm'='12.2.0.1',
 'compatible.rdbms'='12.2.0.1',
 'compatible.advm'='12.2.0.1',
 'au_size'='4M’;

Note the use of ASM Filter Driver which I am testing as part of my lab set up. It’s also enabled by default when you install ASM 12.2. Looking at the code example I do realise now that the disk group name is maybe not ideal … The important bit in the example is the use of “FLEX REDUNDANCY”, the 5 implicit different failure groups and the compatibility settings that need to be 12.2.

The documentation (Automatic Storage Management Administrator’s Guide, chapter “Managing Oracle ASM Flex Disk Groups”) states that a Flex disk group generally tolerates the loss of 2 failure groups (FGs). The same bullet point then elaborates that at least 5 failure groups are needed to absorb the loss of 2 FGs. The minimum number of FGs within a Flex disk group are 3.

If you now get all excited about this new feature, there is one Big Caveat: you need a 12.2 RDBMS instance to use the feature.

This command results in a flurry of activity in the ASM instance, I have captured the output from the command initiation to completion because it’s quite interesting to see what happens in 12.2 ASM when you create a new disk group. Feel free to scroll down past the listing if you aren’t interested in the finer details.

SQL> CREATE DISKGROUP FLEX FLEX REDUNDANCY  DISK 'AFD:FLEX1' SIZE 10239M
 DISK 'AFD:FLEX2' SIZE 10239M
 DISK 'AFD:FLEX3' SIZE 10239M
 DISK 'AFD:FLEX4' SIZE 10239M
 DISK 'AFD:FLEX5' SIZE 10239M
 ATTRIBUTE 'compatible.asm'='12.2.0.1','compatible.rdbms'='12.2.0.1','compatible.advm'='12.2.0.1','au_size'='4M'
NOTE: Assigning number (5,0) to disk (AFD:FLEX1)
NOTE: Assigning number (5,1) to disk (AFD:FLEX2)
NOTE: Assigning number (5,2) to disk (AFD:FLEX3)
NOTE: Assigning number (5,3) to disk (AFD:FLEX4)
NOTE: Assigning number (5,4) to disk (AFD:FLEX5)
2017-07-03 10:38:53.811000 +01:00
NOTE: initializing header (replicated) on grp 5 disk FLEX1
NOTE: initializing header (replicated) on grp 5 disk FLEX2
NOTE: initializing header (replicated) on grp 5 disk FLEX3
NOTE: initializing header (replicated) on grp 5 disk FLEX4
NOTE: initializing header (replicated) on grp 5 disk FLEX5
NOTE: initializing header on grp 5 disk FLEX1
NOTE: initializing header on grp 5 disk FLEX2
NOTE: initializing header on grp 5 disk FLEX3
NOTE: initializing header on grp 5 disk FLEX4
NOTE: initializing header on grp 5 disk FLEX5
NOTE: Disk 0 in group 5 is assigned fgnum=1
NOTE: Disk 1 in group 5 is assigned fgnum=2
NOTE: Disk 2 in group 5 is assigned fgnum=3
NOTE: Disk 3 in group 5 is assigned fgnum=4
NOTE: Disk 4 in group 5 is assigned fgnum=5
GMON updating for reconfiguration, group 5 at 657 for pid 45, osid 25857
NOTE: group 5 PST updated.
NOTE: initiating PST update: grp = 5
GMON updating group 5 at 658 for pid 45, osid 25857
NOTE: set version 0 for asmCompat 12.2.0.1.0 for group 5
NOTE: group FLEX: initial PST location: disks 0000 0001 0002 0003 0004
NOTE: PST update grp = 5 completed successfully
NOTE: cache registered group FLEX 5/0x0A58F009
NOTE: cache began mount (first) of group FLEX 5/0x0A58F009
NOTE: cache is mounting group FLEX created on 2017/07/03 10:38:52
NOTE: cache opening disk 0 of grp 5: FLEX1 label:FLEX1
NOTE: cache opening disk 1 of grp 5: FLEX2 label:FLEX2
NOTE: cache opening disk 2 of grp 5: FLEX3 label:FLEX3
NOTE: cache opening disk 3 of grp 5: FLEX4 label:FLEX4
NOTE: cache opening disk 4 of grp 5: FLEX5 label:FLEX5
* allocate domain 5, valid ? 0
kjbdomatt send to inst 2
NOTE: attached to recovery domain 5
NOTE: cache creating group 5/0x0A58F009 (FLEX)
NOTE: cache mounting group 5/0x0A58F009 (FLEX) succeeded
NOTE: allocating F1X0 (replicated) on grp 5 disk FLEX1
NOTE: allocating F1X0 (replicated) on grp 5 disk FLEX2
NOTE: allocating F1X0 (replicated) on grp 5 disk FLEX3
NOTE: allocating F1X0 on grp 5 disk FLEX1
NOTE: allocating F1X0 on grp 5 disk FLEX2
NOTE: allocating F1X0 on grp 5 disk FLEX3
2017-07-03 10:38:56.621000 +01:00
NOTE: Created Used Space Directory for 1 threads
NOTE: Created Virtual Allocation Locator (1 extents) and Table (5 extents) directories for group 5/0x0A58F009 (FLEX)
2017-07-03 10:39:00.153000 +01:00
NOTE: VAM migration has completed for group 5/0x0A58F009 (FLEX)
NOTE: diskgroup must now be re-mounted prior to first use
NOTE: cache dismounting (clean) group 5/0x0A58F009 (FLEX)
NOTE: messaging CKPT to quiesce pins Unix process pid: 25857, image: oracle@rac122pri1 (TNS V1-V3)
2017-07-03 10:39:01.805000 +01:00
NOTE: LGWR not being messaged to dismount
kjbdomdet send to inst 2
detach from dom 5, sending detach message to inst 2
freeing rdom 5
NOTE: detached from domain 5
NOTE: cache dismounted group 5/0x0A58F009 (FLEX)
GMON dismounting group 5 at 659 for pid 45, osid 25857
GMON dismounting group 5 at 660 for pid 45, osid 25857
NOTE: Disk FLEX1 in mode 0x7f marked for de-assignment
NOTE: Disk FLEX2 in mode 0x7f marked for de-assignment
NOTE: Disk FLEX3 in mode 0x7f marked for de-assignment
NOTE: Disk FLEX4 in mode 0x7f marked for de-assignment
NOTE: Disk FLEX5 in mode 0x7f marked for de-assignment
SUCCESS: diskgroup FLEX was created
NOTE: cache deleting context for group FLEX 5/0x0a58f009
NOTE: cache registered group FLEX 5/0x4718F00C
NOTE: cache began mount (first) of group FLEX 5/0x4718F00C
NOTE: Assigning number (5,0) to disk (AFD:FLEX1)
NOTE: Assigning number (5,1) to disk (AFD:FLEX2)
NOTE: Assigning number (5,2) to disk (AFD:FLEX3)
NOTE: Assigning number (5,3) to disk (AFD:FLEX4)
NOTE: Assigning number (5,4) to disk (AFD:FLEX5)
2017-07-03 10:39:08.161000 +01:00
NOTE: GMON heartbeating for grp 5 (FLEX)
GMON querying group 5 at 663 for pid 45, osid 25857
NOTE: cache is mounting group FLEX created on 2017/07/03 10:38:52
NOTE: cache opening disk 0 of grp 5: FLEX1 label:FLEX1
NOTE: 07/03/17 10:39:07 FLEX.F1X0 found on disk 0 au 10 fcn 0.0 datfmt 1
NOTE: cache opening disk 1 of grp 5: FLEX2 label:FLEX2
NOTE: 07/03/17 10:39:07 FLEX.F1X0 found on disk 1 au 10 fcn 0.0 datfmt 1
NOTE: cache opening disk 2 of grp 5: FLEX3 label:FLEX3
NOTE: 07/03/17 10:39:07 FLEX.F1X0 found on disk 2 au 10 fcn 0.0 datfmt 1
NOTE: cache opening disk 3 of grp 5: FLEX4 label:FLEX4
NOTE: cache opening disk 4 of grp 5: FLEX5 label:FLEX5
NOTE: cache mounting (first) flex redundancy group 5/0x4718F00C (FLEX)
* allocate domain 5, valid ? 0
kjbdomatt send to inst 2
NOTE: attached to recovery domain 5
start recovery: pdb 5, passed in flags x4 (domain enable 0)
validate pdb 5, flags x4, valid 0, pdb flags x204
* validated domain 5, flags = 0x200
NOTE: cache recovered group 5 to fcn 0.0
NOTE: redo buffer size is 512 blocks (2105344 bytes)
NOTE: LGWR attempting to mount thread 1 for diskgroup 5 (FLEX)
NOTE: LGWR found thread 1 closed at ABA 0.11262 lock domain=0 inc#=0 instnum=0
NOTE: LGWR mounted thread 1 for diskgroup 5 (FLEX)
NOTE: setting 11.2 start ABA for group FLEX thread 1 to 2.0
NOTE: LGWR opened thread 1 (FLEX) at fcn 0.0 ABA 2.0 lock domain=5 inc#=12 instnum=1 gx.incarn=1192816652 mntstmp=2017/07/03 10:39:08.437000
NOTE: cache mounting group 5/0x4718F00C (FLEX) succeeded
NOTE: cache ending mount (success) of group FLEX number=5 incarn=0x4718f00c
NOTE: Instance updated compatible.asm to 12.2.0.1.0 for grp 5 (FLEX).
NOTE: Instance updated compatible.asm to 12.2.0.1.0 for grp 5 (FLEX).
NOTE: Instance updated compatible.rdbms to 12.2.0.1.0 for grp 5 (FLEX).
NOTE: Instance updated compatible.rdbms to 12.2.0.1.0 for grp 5 (FLEX).
SUCCESS: diskgroup FLEX was mounted
NOTE: diskgroup resource ora.FLEX.dg is online
SUCCESS: CREATE DISKGROUP FLEX FLEX REDUNDANCY  DISK 'AFD:FLEX1' SIZE 10239M
 DISK 'AFD:FLEX2' SIZE 10239M
 DISK 'AFD:FLEX3' SIZE 10239M
 DISK 'AFD:FLEX4' SIZE 10239M
 DISK 'AFD:FLEX5' SIZE 10239M
 ATTRIBUTE 'compatible.asm'='12.2.0.1','compatible.rdbms'='12.2.0.1','compatible.advm'='12.2.0.1','au_size'='4M'
2017-07-03 10:39:09.429000 +01:00
NOTE: enlarging ACD to 2 threads for group 5/0x4718f00c (FLEX)
2017-07-03 10:39:11.438000 +01:00
SUCCESS: ACD enlarged for group 5/0x4718f00c (FLEX)
NOTE: Physical metadata for diskgroup 5 (FLEX) was replicated.
adrci> 

Quite a bit of activity for just 1 command… I checked the attributes of the disk group, they don’t seem too alien to me:

SQL> select name, value from v$asm_attribute 
  2   where group_number = 5 
  3   and name not like 'template%';

NAME                           VALUE
------------------------------ ------------------------------
idp.type                       dynamic
idp.boundary                   auto
disk_repair_time               3.6h
phys_meta_replicated           true
failgroup_repair_time          24.0h
thin_provisioned               FALSE
preferred_read.enabled         FALSE
sector_size                    512
logical_sector_size            512
content.type                   data
content.check                  FALSE
au_size                        4194304
appliance._partnering_type     GENERIC
compatible.asm                 12.2.0.1.0
compatible.rdbms               12.2.0.1.0
compatible.advm                12.2.0.1.0
cell.smart_scan_capable        FALSE
cell.sparse_dg                 allnonsparse
access_control.enabled         FALSE
access_control.umask           066
scrub_async_limit              1
scrub_metadata.enabled         FALSE

22 rows selected.

I didn’t specify anything about failure groups in the create disk group command, hence I am getting 5 of them:

SQL> select name, os_mb, failgroup, path from v$asm_disk where group_number = 5;

NAME            OS_MB FAILGROUP                      PATH
---------- ---------- ------------------------------ --------------------
FLEX1           10239 FLEX1                          AFD:FLEX1
FLEX2           10239 FLEX2                          AFD:FLEX2
FLEX3           10239 FLEX3                          AFD:FLEX3
FLEX4           10239 FLEX4                          AFD:FLEX4
FLEX5           10239 FLEX5                          AFD:FLEX5

The result is the new disk group, a new part of my existing lab setup. As you can see in the following output I went with a separate disk group for the Grid Infrastructure Management Repository (GIMR), named +MGMT. In addition I have a disk group named +OCR which (surprise!) I use for OCR and voting files plus the usual suspects, +DATA and +RECO. Except +FLEX, all these are disk group types that have been available forever.

[oracle@rac122pri1 ~]$ asmcmd lsdg
State    Type    Rebal  Sector  Logical_Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512             512   4096  4194304     20476    17068                0           17068              0             N  DATA/
MOUNTED  FLEX    N         512             512   4096  4194304     51180    50676                0               0              0             N  FLEX/
MOUNTED  EXTERN  N         512             512   4096  4194304     40956     6560                0            6560              0             N  MGMT/
MOUNTED  NORMAL  N         512             512   4096  4194304     15348    14480             5116            4682              0             Y  OCR/
MOUNTED  EXTERN  N         512             512   4096  4194304     15356    15224                0           15224              0             N  RECO/
[oracle@rac122pri1 ~]$ 

The values of 0 in required_mirror_free_mb and useable_file_mb for +FLEX aren’t bugs, they are documented to be empty for this type of disk group. You need to consult other entities – to be covered in the next posts – to check the space usage for your databases.

Wrap Up Part 1

Flex ASM disk groups are hugely interesting and worth keeping an eye out for when you are testing Oracle 12.2. I admit that 12.2 is still quite new, and the cautious person I am won’t use it for production until the first major patch set comes out and I tested it to death. I am also curious how the use of the ASM Filter Driver will change the way I am working with ASM. It might be similar to ASMLib but I have yet to work that out.

In the next part(s) I will cover additional concepts you need to understand in the context of Flex ASM disk groups.

Offline Analysis of ASH Data with ASHDUMP

From time to time, it happens to me to carry out offline analyses of ASH data. For that, I mean to analyze the ASH data without having access to the database instance that generated it. For that purpose, Oracle Database provides the possibility to dump the content of the ASH buffer as well as information on how to load it through SQL*Loader to a file. The typical steps to carry out to move the data from the source to the destination database (the best thing is to use a destination database with exactly the same version as the source database) are the following:

On the source database…

  • Check the number of ASH samples and the oldest entry (to make sure that the period you are looking for is in there)
SQL> SELECT sample_count, oldest_sample_time FROM v$ash_info;

SAMPLE_COUNT OLDEST_SAMPLE_TIME
------------ -------------------------------
       34997 06-JUL-17 08.31.23.294000000 AM
  • Dump the ASH buffer for the last n minutes (60 minutes in the example)
SQL> ALTER SYSTEM SET events 'immediate trace name ashdump level 60';
  • Get the name of the trace file containing the ASH data
SQL> SELECT value FROM v$diag_info WHERE name = 'Default Trace File';

VALUE
--------------------------------------------------------------------------
/u00/oracle/diag/rdbms/dba121/DBA121/trace/DBA121_ora_22501.trc

On the destination database…

  • Create a table named ASHDUMP
SQL> CREATE TABLE ashdump AS SELECT * FROM SYS.WRH$_ACTIVE_SESSION_HISTORY WHERE rownum < 0;
  • Extract the SQL*Loader control file from the trace file
$ cd /u00/oracle/diag/rdbms/dba121/DBA121/trace/
$ sed -n '1,/^Step 2:/d;/^Step 3:/,$d;p' DBA121_ora_22501.trc > ashldr.ctl
  • Run SQL*Loader to insert ASH data into the ASHDUMP table
$ sqlldr control=ashldr.ctl data=DBA121_ora_22501.trc errors=1000000

Once the data is available in the ASHDUMP table, you can start the analysis. But, in general, you cannot execute against the ASHDUMP table exactly the same queries as the ones you would use with the V$ACTIVE_SESSION_HISTORY view. In fact, the structure and information it provides are quite different.

To make the analysis easier I create a view (ASHDUMP_V) on top of the ASHDUMP table that is much more close to the V$ACTIVE_SESSION_HISTORY view. It allows me to run the same queries that I would use against the V$ACTIVE_SESSION_HISTORY view.

How does the view look like? It is not so simple. Hence, I provide it as a script that contains not only the CREATE VIEW statement but also grants the necessary privileges to the owner of the view. Note that the script has to be executed as SYS.

Here you can download the script for the last four versions (in general, every new version adds new columns):

Feel free to test them. In case you find a mismatch between my view and the V$ACTIVE_SESSION_HISTORY view, please, let me know. There might be cases that I did not handle. In any case, note that since some data is missing in the dump. Therefore, not all information can be reconstructed.

In Memoriam

My mother died a few weeks ago after a couple of months in terminal care. One of my tasks while she was in care was to go through all her paperwork and while doing so I discovered a couple of stories from her past that she had typed out (remember type-writers?) about 30 years ago. I typed them up on my laptop and printed several copies to hand out for people to read at the tea-party we held for her old friends – of all ages, ranging from 15 to 99 – after the funeral; this seemed to give them a lot of pleasure and they found them so interesting that I decided to share them with a larger audience. So here’s the story, written by my mother in 1983, of her evacuation experience at the start of the 2nd world war when she was a little over 14 years old.

The Summer of 1939.

Reminiscences of Dorothy Kathleen Lewis (1925 – 2017)

There had been a lot of talk about a war coming. Adolf Hitler and his armies had marched into Austria and were threatening Poland. We had all been issued with gas masks – just in case we would need them – and emergency plans had been made to evacuate all children from the big cities.

During the school holidays I was taken by my parents, with my sister, to my mother’s home village of Llangeitho in Cardiganshire. My mother had a cousin who was a retired school teacher and it was arranged with Auntie Jane that if war broke out Peggy and I would be sent to stay with her. I don’t think we were very pleased with the arrangement because to us she was very old-fashioned, not a bit like our mother. We ended our holiday and went back to London to wait for the school term to begin.

On the 1st September we heard that all children from our school whose parents wanted them to be evacuated should assemble at the school gates with a small suitcase and their gas masks. As we had already been told we were going to Llangeitho if the war broke out we stood and watched all our friends walking in crocodile fashion down the street and mothers and fathers crying watching them go. It was a very sad day, but I wished I was going with them. I didn’t like the idea of staying with Auntie Jane. None of these children knew where they were going, just somewhere in the countryside for safety, and they didn’t know who would be looking after them.

Well, on the morning of 3rd September Neville Chamberlain, our prime minister, spoke on the wireless (we now call it a radio) to say that we were at war with Germany. Immediately the sirens went and everyone ran to the shelters. My parents, Peggy, and I went to Baker Street Station, which has very deep platforms. There were hundreds of people with the same thing on their minds. We all took our gas masks with us. After a short time the all-clear went. My father sent a telegram to Auntie Jane to say Peggy and I would be leaving London on the train at 9:25pm that night. Trains did not travel as fast as they do today and we were due to arrive at Pont Llanio Station at 7:30am on Monday morning. Peggy’s friend and her mother (an Italian lady who did not speak very good English) was coming too, also one of the young people from the village who was working in London.

Paddington Station had very dim lights and when we got on the train there were no lights at all. After a little while we children began to feel a bit less afraid and started to tell ghost stories and play memory games. It was fun going to the toilet on the train because there were people sitting in the corridor and so was their luggage. We could not see them and I don’t think we really tried – it was all a game. We were supposed to be sleeping, but we were too excited for that. When it came time to eat our sandwiches we had to taste them before we knew what we were eating. Can you imagine being in a train without any lights, and there were no lights in the streets or houses or on the station platforms that we passed. Names of stations had already been removed in case the country was invaded by the enemy. The belief was that the enemy would not know were he was if there were no road signs etc. No-one thought about them using maps and compasses as they would now. [ed: 1983]

We eventually arrived in a town called Carmarthen where we had to change trains and take a slow train to Pont Llanio where a car would meet us. Our train from Paddington was very late arriving and the slow train had gone. Someone telephoned Pont Llanio station to say we would be late and to send the car back. The train from Carmarthen was a very slow one and my father used to say “you could get out of the train and pick flowers on the bank and get back into the train again”. It really was very slow and chugged its way along the line. We arrived at last in Pont Llanio and then in Llangeitho after a journey of 16 hours. [ed: 4:30 to 5:00 hours driving time, now; 6 hours by public transport] I am sure we must have looked very dirty and untidy. The trains in those days were steam and there would be plenty of coal smuts flying around.

I did not think Auntie Jane would be very pleased to see us and I was soon to find out that I had thought rightly. The first thing she did was to take up the stair carpet in case we wore it out. I don’t know how she thought we would do that because once we came down in the morning we were not allowed to go back upstairs again until we went to bed. [ed: if you’ve read “Carrie’s War” you may recognise the behaviour]  She also did not know that children eat quite a lot too. For breakfast Auntie Jane would boil an egg and cut it in half, so Peggy and I had half each. And the same for our dinner, we would have two small potatoes – and this was before rationing and shortage of food. We had a lot of friends in the village and if it was not for them asking us out to tea and/or supper we would have been very hungry. Peggy went to school in the village, but I was too old [ed: at 14 yrs 4 months] and had nothing to do all day, but soon found a baby I could take out in the pram and that meant I would be asked if I would like a piece of cake and a drink. After a few weeks and a number of letters home things goT a little better because my mother was sending parcels of food to Auntie Jane. I don’t know what arrangements were made money wise; because we were not Government evacuees Auntie Jane would not have been paid by the authorities to keep us.

One of the things we used to do together with two of our friends was to help the local butcher clean out his slaughter-house after he had killed a beast. This meant he then asked us to supper in his old farm-house with a huge Inglenook fireplace. Another of my mother’s friends used to have us in early for a meal and say “don’t tell Auntie Jane or she will not give you anything else to eat”. I often think back on those days and wonder why she was so mean. She had never married and had children, but being a teacher I would have expected her to be more tolerant.

In December of 1939 Peggy wrote a letter home which was full of complaint and left it somewhere where Auntie Jane found it and this letter was sent to my parents with a letter from Auntie Jane asking that we be sent back to London. A lot of the people in the village were very surprised to think that she should think to send us back to London when there were air-raids (these had not started at that time). People were saying we would be going home to be killed, but as for me I would rather take that chance than be left in Llangeitho.

Going back to London wasn’t much fun – the school was closed so once again we were at a loose end. We stayed in London over Christmas and again the government started evacuating school children and in February we joined a group who were leaving London – this time as London School Evacuees. We were sent to Buckingham to a family with a little girl of 2 years. This seemed to be alright and we went to school in the afternoons whilst the local children went to school in the mornings. It got rather uncomfortable there after a while because the man of the house, aged 24, lost his job (I don’t know why) and there were a lot of arguments in the house. His wife did not make herself look smart and he started to pay too much attention to me. Again a letter home that it was time we left there and one morning my father arrived and said: “pack your bags, you’re coming home”. What joy!

I don’t have much memory about this part of being an evacuee except to say I was relieved to be out of that house and back in the safety of my family. Whilst we were in Buckingham there had been bombing in London and Peggy and I were taken to see some of the damage that had been done. I think this was to frighten us so that we would be willing to go away again. I certainly did not like the sirens going and having to stop what we were doing and go to the shelter[1]. Once again we were on the move and this time I have a very detailed memory of the events.

We were assembled at the school playground all with our cases and gas masks – worried children and even more worried parents and teachers. No one knew where we were going except that we all piled into a double-decker bus. Lots of tears this time because we knew that people were being killed and injured. Would we see our parents again? What was going to happen to us if they were killed? Where would we go, who would look after us? Questions, questions!

We were taken to Marylebone station and put on an underground train. Where was it going? What were mum and dad doing now; were they still blowing their noses? We were not so bothered because we still knew where we were. Next stop Paddington Station and hundreds of children milling about. I remember I was in the Junior Red Cross at that time and a Red Cross nurse saw my badge and came to speak to me. Such a little thing but it meant such a lot and I have never forgotten her kind words, saying I was going somewhere safe and would be alright. Maybe I was crying at the time, I don’t know.

As the train pulled out of Paddington Station we were all trying to get to a window to wave, although we didn’t know anybody and we didn’t know where we were going. Well of all places – we arrived in Banbury. Now my Auntie Kit, my father’s sister, only lived about 2 miles out of Banbury in a village called Bodicote. I knew Banbury well because we had often visited Bodicote. If only she knew I was here. I know we could not stay with her because she had a very small house and she had 4 children already.

Again on another bus and somehow Peggy and I and two other girls from my class got separated from the rest of our school and we were taken to a village called Great Rollright. Peggy and I went to stay with a lady called Mrs. Robinson who kept the village shop, and my two class friends went to a farm.

Mrs. Robinson was a kind lady – she had asked for two boys but somewhere along the line she had two girls instead. It was very strange arriving in Great Rollright. We were all taken into the Church Hall and there were village people there who, to us, had a funny way of speaking. And one after the other they were saying how many children they would take. Mrs. Robinson wasn’t there, so maybe that is why she didn’t get two boys. I thought it was very embarrassing to be standing there with these quaint country people whilst they were deciding whether they wanted one, two, or three children.

Our time with Mrs. Robinson was very happy. Peggy went to the village school and I went to the county school in Chipping Norton, again on a part-time basis. Mrs. Robinson had a pet cockerel which was allowed to roam round the house – I hated that bird and I think it knew it. Its name was Cocky. Every time I wanted to go down the garden to the toilet this bird would follow me and stay outside until I came out again and peck my legs as I raced back up the garden.

There was certainly plenty to eat in this house and we really had an enjoyable time there. We were always sent to bed with a small glass of cider. I never knew then that cider was an alcoholic drink and I thought it was lovely. We didn’t seem any the worse for it anyway.

We got involved with the village church and would have been happy to have stayed there. But doom. One day I came home from school to find my head mistress from London sitting at the table. Unbeknown to us she had been frantically looking for these four girls she had lost at Banbury Station. I don’t know how she had found us, whether she had contacted our parents or hunted through the schools in the area. With a surname like Walklett I don’t think we would have been difficult to find. I don’t think she had been in touch with our parents – what an awful thing to say to them: “I’m very sorry but I may have lost your children”. No, she must have hunted through the school registers.

The upshot of this visit was that she had found somewhere more suitable for us and would we pack our things because she had found somewhere else for us to stay. More tears because we liked Mrs. Robinson, and the village people were all lovely to us and we were invited to their homes. Off we went with Miss Attride in the car to another village called Duns Tew. The strange thing was that none of our school were there, so why were we moved yet again?

This time we stayed with Mr. and Mrs. Beck, his name was Harry and hers was Daisy, but they were Mr. and Mrs. Beck to us. Mr. Beck was a farm hand and he worked with horses. He used to plough the fields of the farm across the road. He must have walked miles in the days he was ploughing. Although I had had many holidays in Wales and Shropshire at haymaking time I knew nothing about ploughing.

Mr. and Mrs. Beck had a young man living with them. He was like their son; although his family lived in the village he had lived with the Becks since he was a baby and they called him their son. His name was Walter. The village was a beautiful place and we lived in No. 73. There were no street names, every house had a name and a number so we were at No. 73 Duns Tew, the last house in the village, a lovely old thatched cottage. There was always a big wood fire in the grate and plenty on the table. Mr. and Mrs. Beck were the nicest people in village.

Peggy now had to go to Steeple Aston School (since moving to Banbury in 1975 I have met the widow of her headmaster there), and I went to a Continuation College which had been evacuated from the East End of London. This was very odd to me – we were taught shorthand, typing, arithmetic, English grammar, French. This was obviously training us for the commercial world. I was much younger than the other girls there but my education was more advanced than theirs so I soon became top of the class. My English was always being complimented. What they didn’t know was that I had a Welsh mother and the Welsh language used very letter in the word. My French was well in advance and my Maths took a flying leap.

I made friends in the class. The class was held in The Hall, North Aston – a country seat. The Hall was so large that there were 9 girls living there and they had servants. The school room was in the Grand Hall and it was so beautiful it seemed a pity to me that there were desks etc. on the polished floor.

In Duns Tew we had one of the masters of the school staying in The Nurseries (which is still there) and every Friday evening the family he stayed with invited those of us in the village to spend the evening in their house and they had a piano so all the war songs were being sung: “Roll out the Barrel”, “We’re going to hang out the washing on the Siegfried line” and many more.

Because the school at North Aston was a long walk I bought a bike, something I had always wanted, and I joined the cycling group. This meant on one day a week we would go for an outing to places like Blenheim Palace [ed: 10 miles away] etc. I became a good cyclist and had plenty of energy when others flagged behind. I certainly made use of my bike.

One particularly happy time was when it snowed in the winter. Yes, we did get snow in London, but not like this. It was white[2] and where the wind blew it was as high as the hedgerows; I couldn’t believe what I saw. Walter the Beck’s son had a sledge and he showed us where it was good to use it. It was a fantastic time.

 

[Banbury, 1983]

 

https://jonathanlewis.files.wordpress.com/2017/07/sisters1.jpg?w=1200&h=750 1200w, https://jonathanlewis.files.wordpress.com/2017/07/sisters1.jpg?w=150&h=94 150w, https://jonathanlewis.files.wordpress.com/2017/07/sisters1.jpg?w=300&h=188 300w, https://jonathanlewis.files.wordpress.com/2017/07/sisters1.jpg?w=768&h=480 768w, https://jonathanlewis.files.wordpress.com/2017/07/sisters1.jpg?w=1024&h=640 1024w" sizes="(max-width: 600px) 100vw, 600px" />

 

[1] One of the stories about my mother that I first heard at her funeral was about the time she persuaded her parents to let her stay at home overnight. At the time the family used to head for the air-raid shelter (i.e. the local underground station) at the end of the day and stay there all night long. My mother hated this and persuaded her parents to let her stay at home in her own bed provided she promised to join them at the air-raid shelter as soon as the air-raid sirens sounded. She was only allowed to do this once – because she managed to sleep through two bombing runs and without being woken by the sirens or the explosions.

[2]If you’re wondering why white snow is worth mentioning you probably don’t know about the density of London smog at that time.

POUG voucher – contest

Contest time!
I had a cool function in my database:

SQL> ;
  1  select object_name, object_type
  2  from user_objects
  3* where object_name like '%POUG%'
SQL> /

OBJECT_NAME		       OBJECT_TYPE
------------------------------ -----------------------
F_GET_POUG_PROMO_CODE	       FUNCTION

When executed, it returned a -15% voucher for POUG conference. And POUG is a REALLY COOL confernece </p />
</p></div>

    	  	<div class=

POUG voucher – contest

Contest time!
I had a cool function in my database:

SQL> ;
  1  select object_name, object_type
  2  from user_objects
  3* where object_name like '%POUG%'
SQL> /

OBJECT_NAME		       OBJECT_TYPE
------------------------------ -----------------------
F_GET_POUG_PROMO_CODE	       FUNCTION

When executed, it returned a -15% voucher for POUG conference. And POUG is a REALLY COOL confernece </p />
</p></div>

    	  	<div class=

Postgresql block internals, part 2

This is the second part of a blogpost about Postgresql database block internals. If you found this blogpost, and are interested in getting started with it, please read the first part, and then continue with this post.
I am doing the investigations on Oracle Linux 7u3 with postgres 9.6 (both the latest versions when this blogpost was written).

In the first part I talked about the pageinspect extension, and investigated the page header and line pointer array. This blogpost looks at the actual tuples, including the index, and how these are stored in the pages.

The block structures are explained in the main documentation at: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
We can see the tuple metadata and raw data using the heap_page_items function:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

As has been described in the first blogpost, the ‘lp’ entries fetch their data from the line pointer array in the page header, and the ‘t’ entries fetch the data from the actual tuple.

This is the raw block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

In the previous blog post I described how the offset of the rows can be found. This a description of the tuple header data in the documentation:

Field           Type            Length  Offset  Description
t_xmin          TransactionId   4 bytes 0       insert XID stamp
t_xmax          TransactionId   4 bytes 4       delete XID stamp
t_cid           CommandId       4 bytes 8       insert and/or delete CID stamp (overlays with t_xvac)
t_xvac          TransactionId   4 bytes 8       XID for VACUUM operation moving a row version
t_ctid          ItemPointerData 6 bytes 12      current TID of this or newer row version
t_infomask2     uint16          2 bytes 18      number of attributes, plus various flag bits
t_infomask      uint16          2 bytes 20      various flag bits
t_hoff          uint8           1 byte  22      offset to user data
All the details can be found in src/include/access/htup_details.h.

Let’s extract the fields from the raw block:

$ # xmin 8152+0
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8152 -N 2
   1760
$ # xmax 8152+4
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8156 -N 2
      0
$ # ctid 8152+12 (current tuple id), first 4 bytes is the block number, second 2 bytes tuple id
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x4 -j 8164 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8168 -N 2
     1
$ # t_infomask2 8152+18
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8170 -N 2
     2

What does ‘2’ mean? Once again, the source code to the rescue! A description of t_infomask2 can be found here: https://doxygen.postgresql.org/htup__details_8h_source.html At line 260 there is a description. t_infomask2 is a (binary, obviously) bit mask. Here is how you can look it up. I took the mask definition from the source code, and printed the binary value using ‘echo “ibase=16;obase=2;7FF” | bc’ in front of it:

     11111111111 #define HEAP_NATTS_MASK         0x07FF  /* 11 bits for number of attributes */
  10000000000000 #define HEAP_KEYS_UPDATED       0x2000  /* tuple was updated and key cols
 100000000000000 #define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
1000000000000000 #define HEAP_ONLY_TUPLE         0x8000  /* this is heap-only tuple */
1110000000000000 #define HEAP2_XACT_MASK         0xE000  /* visibility-related bits */
1111111111111110 #define SpecTokenOffsetNumber       0xfffe

So, anything up to the number 2047 (0x7FF) in the infomask2 field is meant to describe the number of fields/attributes. In our case ‘2’ (binary 10) means there are two fields. Please mind additional bits can be set (for a HOT update for example) while the number of fields is still described. With a bitmask, every position functions independent from the other positions.

Now let’s look at the next field, t_infomask:

$ # t_infomask 8152+20
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8172 -N 2
  2050

This too is a bitmask. The description is in https://doxygen.postgresql.org/htup__details_8h_source.html starting from line 173:

               1 #define HEAP_HASNULL            0x0001  /* has null attribute(s) */
              10 #define HEAP_HASVARWIDTH        0x0002  /* has variable-width attribute(s) */
             100 #define HEAP_HASEXTERNAL        0x0004  /* has external stored attribute(s) */
            1000 #define HEAP_HASOID             0x0008  /* has an object-id field */
           10000 #define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-shared locker */
          100000 #define HEAP_COMBOCID           0x0020  /* t_cid is a combo cid */
         1000000 #define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is exclusive locker */
        10000000 #define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax, if valid, is only a locker */
                    /* xmax is a shared locker */
                 #define HEAP_XMAX_SHR_LOCK  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
                 #define HEAP_LOCK_MASK  (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                          HEAP_XMAX_KEYSHR_LOCK)
       100000000 #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
      1000000000 #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
                 #define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
     10000000000 #define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
    100000000000 #define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */
   1000000000000 #define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
  10000000000000 #define HEAP_UPDATED            0x2000  /* this is UPDATEd version of row */
 100000000000000 #define HEAP_MOVED_OFF          0x4000  /* moved to another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
1000000000000000 #define HEAP_MOVED_IN           0x8000  /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
                 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
1111111111110000 #define HEAP_XACT_MASK          0xFFF0  /* visibility-related bits */

The value for t_infomask is 2050, this is binary (echo “obase=2;2050” | bc):

    100000000010

Which means: 10: has variable-width attributes and 100000000000: xmax is invalid (this means xmax is not set).

I created an index on this table too, called ‘pk_mytable’. The below text looks at an ordinary index. An index uses the same block size, the same block header, but different contents, quite obviously, because an index needs to maintain a balanced tree (btree) structure with ordered entries.

The first page of an index is a ‘metapage’ (a page that essentially points to the index root block). This is how the metapage can be inspected:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    1 |     0 |        1 |         0

As described earlier, a heap page, index metapage or index page all are normal postgres pages, which means they all contain a header which can be inspected using the ‘page_header’ function:

test=# select * from page_header(get_raw_page('pk_mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576A10 |        0 |     0 |    48 |  8176 |    8176 |     8192 |       4 |         0

The bt_metap function tells us the root block is in block 1. Index entries can be listed using the bt_page_items function:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The index structure (like root/leaf blocks, next and previous block) can be investigated using the bt_page_stats function:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3

This tells us that this is a leaf block (type: l), we got four indexed tuples (live_items) in this leaf block, there is no previous index leaf page/left sibling (btpo_prev: 0), there is no next index leaf page/right sibling (btpo_next: 0), btpo is a union that either contains the tree level if not a leaf block or next transaction ID if this leaf page is deleted and btpo_flags, which is a bitmap. the bitmap explanation is available in the source code, I added the actual bits in front of the defines (this is how that is done on the linux prompt: ‘echo “obase=2;$((1 << 0 ))" | bc'):

             1 #define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
            10 #define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
           100 #define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
          1000 #define BTP_META        (1 << 3)    /* meta-page */
         10000 #define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
        100000 #define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
       1000000 #define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
      10000000 #define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */
10011011111011 #define MAX_BT_CYCLE_ID     0xFF7F

The btpo_flags field contains the value 3 (binary 11), which means this is a leaf block and this is the root block. This is how the BTPageOpaqueData struct looks like, with included offsets and byte sizes by me:

typedef struct BTPageOpaqueData
Bytes   {
4         BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
4         BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
          union
          {
4             uint32      level;      /* tree level --- zero for leaf pages */
              TransactionId xact;     /* next transaction ID, if deleted */
          }           btpo;
2         uint16      btpo_flags;     /* flag bits, see below */
2         BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
       } BTPageOpaqueData;

This struct is placed at the end of the block at the offset for ‘special’ when using the page_header function on the index block.

This is how the block contents look like for an index:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 e8 6b 57 01 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look closely, you see the header is the same, which is obvious because I just told that all pages contain the same page header. For index entries, the entries are allocated from the bottom up too, to allow the header (the line pointer array actually) to grow when additional entries are inserted into this page of the index. This is exactly the same line pointer array as we saw with a heap table, so we can extract them in the same way; first entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fe0

Now flip the 16th bit to get the offset number:

$ echo $((0x9fe0 & ~$((1<<15))))
8160

Get the length of the index entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
0020

Right-shift:

$ echo $((0x0020 >> 1))
16

The offset number from the line pointer array points to the index tuple data header, which contains two fields, described in the annotated source code:

typedef struct IndexTupleData
{
   ItemPointerData t_tid;      /* reference TID to heap tuple */
   /* ---------------
    * t_info is laid out in the following fashion:
    *
    * 15th (high) bit: has nulls
    * 14th bit: has var-width attributes
    * 13th bit: unused
    * 12-0 bit: size of tuple
    * ---------------
    */
    unsigned short t_info;      /* various info about tuple */
} IndexTupleData;               /* MORE DATA FOLLOWS AT END OF STRUCT */

So the header contains the block and tuple id in ItemPointerData to the row the index entry describes. The length of ItemPointerData is 6 bytes. And the header contains a bitmap called t_info, size 2 bytes. This means an index tuple header is 8 bytes. The header of an index is not described in the postgres online manual. Did you notice there is no xmin/xmax in the index structure? An index does not do multi-versioning like a table does!

Let’s fetch the index tuple header data directly. First the tuple id of the heap (block number and tuple offset number:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x4 -j 8160 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
     1

The next field in the header is t_info:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
    16
$ echo "obase=2;16" | bc
10000

The value in t_info has the highest bit set at the 5th position, while the description says that the bits 13/14/15 are used as a bitmask, so the only information in t_info is the size of the tuple: 16 bytes.

It is widely known that an regular btree index stores its entries ordered, which is why you can range scan the index leaf blocks, which is an optimised way of searching through data. But how is that organised in the index if I enter data unordered? In the table the data will be added unordered, bottom up to the block, because the table has no requirement to store the data ordered. But the index must provide the indexed value in an ordered way. The two options I see there are is the indexed value is added bottom up unordered, and the pointer to the entry is entered in the correct position in the line pointer array, or: the indexed value is placed at the correct position in the data area, AND the pointer is added at the correct position in the line pointer array. Let’s test this! We currently have the values 1-4 stored in the index, let’s add 6, and checkpoint:

test=# insert into mytable (id, f1) values (6, 'ffffffffff');
INSERT 0 1
test=# checkpoint;
CHECKPOINT

Now let’s look at the index block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 c8 81 64 01 00 00 00 00 2c 00 a0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

Here we see the new tuple with the value 6 is added at position 8096, and the corresponding line pointer entry is added (echo $((0x9fa0 & ~$((1<<15)))) = 8096). Now let's add 5 to mytable, and checkpoint:

test=# insert into mytable (id, f1) values (5, 'eeeeeeeeee');
INSERT 0 1
test=# checkpoint;

And dump the block contents again:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 00 85 64 01 00 00 00 00 30 00 90 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008080 00 00 00 00 06 00 10 00 05 00 00 00 00 00 00 00
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look in the index field data first (offset numbers 8080-8176), you see that the indexed field is added unordered. The number (field data) is at offset 8088, value 5, and at offset 8104, value 6 for the two rows we just inserted. If you look at the last two array members of the line pointer, you see that the ordering is happening there:

$ # 5th array member: 24 + (4 * 4)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 40 -N 2
 9f90
$ echo $((0x9f90 & ~$((1<<15))))
8080
$ # 6th array member: 24 + (4 * 5)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 44 -N 2
 9fa0
echo $((0x9fa0 & ~$((1<<15))))
8096

Now let’s fill up the table to 1000 rows to see what happens when a heap and an index block fill up:

test=# do $$
test$# begin
test$# for nr in 7..1000 loop
test$# insert into mytable (id, f1) values (nr, 'XXXXXXXXXX');
test$# end loop;
test$# end $$;
DO
test=# checkpoint;
CHECKPOINT

This is the first block of mytable:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/164E6A8 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

If you subtract upper with lower (792-764), you get the amount of free space in the first block: 28 bytes. I deliberately created all the rows with the same mount of data (an int and a varchar with 10 characters in it), let’s extract the first row:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp = 1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1779 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |       | \x010000001761616161616161616161

This shows in lp_len that the length of a single row is 39 bytes. This means that by default (when no fill factor is set), a table block will be filled up to 100%.

Let’s look at the index. First scan the metapage of the index ‘pk_mytable’:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1

The root of the index now is block 3 (obviously in the ‘root’ field). Let’s examine that page:

test=# select * from bt_page_stats('pk_mytable',3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2

The old page block changed from being a combined root and leaf page to being a dedicated leaf page, and a new root page was created in page number 3 of pk_mytable! This also explains why there are only 3 items in the page (live_items), this is a block to guide the process to the leaf blocks. This also is visible in the btpo_flags, which now is 2: only the bit for root page is set.

Let’s look at the contents of the root page:

test=# select * from bt_page_items('pk_mytable',3);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (1,1) |       8 | f     | f    |
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00

A root/tree (called ‘branch’ in Oracle) index block contains pointers to index pages which can be tree pages or leaf blocks. Non-root tree pages are called ‘internal’ pages. The data field contains the the highest value of the left/lower value branch or leaf. This way, with a tree level of one (bt_page_stats.btpo = 1), it contains the highest value of the lower offset leaf page ctid is pointing at for this index. A leaf page always points to heap (table) pages. If this single tree page fills up, it is splitted and a new root is created to point to the two tree pages which are just created as a result of the split of the old root page. Because a tree page points to the highest value of the left/lower value branch or leaf page, there is no value for the left-most leaf, because there is no ‘left side’ block. This is also visible with the ‘itemlen’ field above when examining the index page using bt_page_items, only the header of 8 bytes is stored.

To get an idea of the id values stored in the leaf pages, take the hexadecimal value and convert it to decimal:

$ echo "ibase=16;016F" | bc
367
$ echo "ibase=16;02DD" | bc
733

This means pk_mytable leaf page 1 contains the values 1-367, leaf page 2 contains 368-733 and leaf page 4 contains 734-1000.

Let’s descent into the first leaf page, page 1 of pk_mytable, which was our old combined root and leaf page, and read the index meta data using bt_page_stats:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1

This shows this is a leaf page (type: l), but this time it is only a leaf page, which is visible with btpo_flags, only bit 1 is set indicating a leaf page. btpo_prev is set to 0, which indicates this is the left-most side of the index, there is no left block. There is a page with a higher value, btpo_next is set to 2, indicating block 2 of pk_mytable is the right side page.

There are 808 bytes of free space available in the page. An index entry for this index is 16 bytes, which means there is space available. The way this is handled is summarised in the documentation in the source code for the function _bt_findsplitloc. Essentially, if an index page fills up, the page contents are split between two pages, for which every page gets 50% of the page’s payload. In the case of the page being a leaf page and the right-most page (btpo_next=0), it is considered a special case, and the page is split leaving fillfactor% (default 90%) in the original page, and the remaining percentage in the new right-most page, optimising space usage and still leaving room for updates.

Now let’s look at the index tuples for the leftmost page:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,6)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
          7 | (0,5)   |      16 | f     | f    | 06 00 00 00 00 00 00 00
          8 | (0,7)   |      16 | f     | f    | 07 00 00 00 00 00 00 00
          9 | (0,8)   |      16 | f     | f    | 08 00 00 00 00 00 00 00
...etc...

What is visible is the data is ordered, however the highest value in the index page is put in the first row, the lowest value starts at the second row. In the readme with the nbtree source this is explained:

On a page that is not rightmost in its tree level, the "high key" is 
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused.  A page that is
rightmost has no "high key", so data items start with the first item.
Putting the high key at the left, rather than the right, may seem odd,
but it avoids moving the high key as we add data items.

Conclusion
In this post I moved beyond the page headers of a heap and looked at how tuple metadata looks like, notably xmin and xmax, the ctid and the infomasks which are bitmap fields to indicate specific status indicators per row, and how to extract these manually.

Next I moved to the (simple primary key) index on the id column. An index page partly looks the same as a heap block because it contains a common block header. However an index contains a metapage as the first page, and if all the index tuples fit in a single page has a combined root and leaf page. Once the combined root and leaf page grows out of the single page, a non-leaf root page is created which stores pointers to index leaf pages containing the index tuple values and the max values of the leaf pages, except for the leftmost leaf page.

Heap (table) pages are filled up to 100% by default, which can be changed with the fillfactor storage attribute. An index has a default fillfactor of 90%. If an index leaf page is filled up to fillfactor, it will split. However, if the right-most leaf page fills up (which means an increasing value, like a sequence or a timestamp), the split will be done 90% (fillfactor% actually)-10% (old-new page) instead of 50%-50%.

All index leaf pages outside of the rightmost page stores the max index field value as the first index tuple, and starts of with the lowest index field value as the second tuple, to optimise inserting new values.

The next post will look at what happens when row data gets changed. This is particularly exciting for me, because this is where postgres truly is different from my base reference, which is the Oracle database.

Tagged: internals, page, pg, postgres, Postgresql

DevOps is Ruining the DBA?

Database Administrators, (DBAs) through their own self-promotion, will tell you they’re the smartest people in the room and being such, will avoid buzzwords that create cataclysmic shifts in technology as DevOps has.  One of our main role is to maintain consistent availability, which is always threatened by change and DevOps opposes this with a focus on methodologies like agile, continuous delivery and lean development.

Residing a step or more behind bleeding edge has never phased the DBA.  We were the cool kids by being retro, those refusing to fall for the latest trend or the coolest new feature, knowing that with bleeding edge comes risk and that a DBA that takes risks is a DBA out of work.  So we  put up the barricades and refused the radical claims and cultural shift to DevOps.

As I travel to multiple events focused on numerous platforms the database is crucial to, I’m faced with peers frustrated with DevOps and considerable conversation dedicated to how it’s the end of the database administrator.  It may be my imagination, but I’ve been hearing this same story, with the blame assigned elsewhere-  either its Agile, DevOps, the Cloud or even a latest release of the actual database platform.  The story’s the same-  the end of the Database Administrator.

The most alarming and obvious pain point of this, is that in each of these scenarios, the result was the Database Administrator a focal point in the end more so than they were when it began.  When it comes to DevOps, the specific challenges of the goal needed the DBA more so than any of these storylines.  As development hurdled top speed to deliver what the business required, the DBA and operations as a whole, delivered the security, the stability and the methodologies to build automation at the level that the other groups simply never needed previously.

https://i2.wp.com/dbakevlar.com/wp-content/uploads/2017/07/Screen-Shot-2... 300w, https://i2.wp.com/dbakevlar.com/wp-content/uploads/2017/07/Screen-Shot-2... 768w, https://i2.wp.com/dbakevlar.com/wp-content/uploads/2017/07/Screen-Shot-2... 1092w" sizes="(max-width: 600px) 100vw, 600px" data-recalc-dims="1" />

Powerful DBAs with skills not just in scripting, but in efficiency and logic, were able to take complicated, multi-tier environments and break them down into strategies that could be easily adopted.  As they’d overcome the challenges of the database being central and blamed for everything in the IT environment, they were able to dissect and built out complex management and monitoring of end-to-end DevOps.  As essential as System, Network and Server Administration was to the Operations group, the Database Administrator possessed advanced skills required, a hybrid of the developer and the operations personnel that make them a natural fit for DevOps.

The truth is, the DBA is not ruined by DevOps, but the role is revolutionized.

Thanks to this awesome post from 2012 from Alex Tatiyants which resonated so well with the DBAs I speak to every day, even in 2017.



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [DevOps is Ruining the DBA?], All Right Reserved. 2017.

The post DevOps is Ruining the DBA? appeared first on DBA Kevlar.

AskTOM TV episode 8

On AskTOM episode 8, I’ve taken a look at locating the SQL Plan Directives used for a particular query.  Here is the script output from the video if you want to use this for your own exploration


SQL>
SQL> create table t as
  2  select *
  3  from dba_objects
  4  where owner = 'SYS' and rownum <= 20
  5  union all
  6  select *
  7  from dba_objects
  8  where owner = 'SYSTEM'
  9  and rownum <= 200;

Table created.

SQL>
SQL> create index ix on t ( owner);

Index created.

SQL>
SQL> select COLUMN_NAME,NUM_DISTINCT,AVG_COL_LEN, num_nulls, density
  2  from   user_tab_cols
  3  where  table_name = 'T'
  4  order by COLUMN_ID;

COLUMN_NAME                    NUM_DISTINCT AVG_COL_LEN  NUM_NULLS    DENSITY
------------------------------ ------------ ----------- ---------- ----------
OWNER                                     2           7          0         .5
OBJECT_NAME                             199          18          0 .005025126
SUBOBJECT_NAME                            1           3        199          1
OBJECT_ID                               220           4          0 .004545455
DATA_OBJECT_ID                          167           4         49 .005988024
OBJECT_TYPE                               9           8          0 .111111111
CREATED                                  12           8          0 .083333333
LAST_DDL_TIME                            16           8          0      .0625
TIMESTAMP                                13          20          0 .076923077
STATUS                                    1           6          0          1
TEMPORARY                                 2           2          0         .5
GENERATED                                 2           2          0         .5
SECONDARY                                 1           2          0          1
NAMESPACE                                 4           3          0        .25
EDITION_NAME                              0           0        220          0
SHARING                                   2          10          0         .5
EDITIONABLE                               1           2        206          1
ORACLE_MAINTAINED                         1           2          0          1
APPLICATION                               1           2          0          1
DEFAULT_COLLATION                         1           7        136          1
DUPLICATED                                1           2          0          1
SHARDED                                   1           2          0          1
CREATED_APPID                             0           0        220          0
CREATED_VSNID                             0           0        220          0
MODIFIED_APPID                            0           0        220          0
MODIFIED_VSNID                            0           0        220          0

26 rows selected.

SQL>
SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> insert into t  select * from dba_objects;

78329 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> select /*+ gather_plan_statistics */ count(created)
  2  from t
  3  where owner = 'SYS'
  4  and object_type = 'JAVA CLASS';

COUNT(CREATED)
--------------
        138424

1 row selected.

SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  3qyuxjtjy92m5, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(created) from t where owner
= 'SYS' and object_type = 'JAVA CLASS'

Plan hash value: 2143077847

-------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |      1 |        |      1 |00:00:00.08 |    5991 |
|   1 |  SORT AGGREGATE                      |      |      1 |      1 |      1 |00:00:00.08 |    5991 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |     12 |    138K|00:00:00.07 |    5991 |
|*  3 |    INDEX RANGE SCAN                  | IX   |      1 |    110 |    207K|00:00:00.03 |    1218 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')


22 rows selected.

SQL>
SQL> select sql_id, child_number,is_reoptimizable  from v$sql where sql_id = '3qyuxjtjy92m5';

SQL_ID        CHILD_NUMBER I
------------- ------------ -
3qyuxjtjy92m5            0 Y

1 row selected.

SQL>
SQL> exec dbms_spd.FLUSH_SQL_PLAN_DIRECTIVE

PL/SQL procedure successfully completed.

SQL>
SQL> select /*+ gather_plan_statistics */ count(created)
  2  from t
  3  where owner = 'SYS'
  4  and object_type = 'JAVA CLASS';

COUNT(CREATED)
--------------
        138424

1 row selected.

SQL>
SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  3qyuxjtjy92m5, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ count(created) from t where owner
= 'SYS' and object_type = 'JAVA CLASS'

Plan hash value: 2143077847

-------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |      1 |        |      1 |00:00:00.06 |    5383 |
|   1 |  SORT AGGREGATE                      |      |      1 |      1 |      1 |00:00:00.06 |    5383 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |    138K|    138K|00:00:00.05 |    5383 |
|*  3 |    INDEX RANGE SCAN                  | IX   |      1 |    220 |    207K|00:00:00.02 |     610 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - statistics feedback used for this statement
   - performance feedback used for this statement
   - 1 Sql Plan Directive used for this statement


29 rows selected.

SQL>
SQL> select count(*) from dba_sql_plan_directives;

  COUNT(*)
----------
       354

1 row selected.

SQL>
SQL> set lines 60
SQL>
SQL> desc v$sql_plan
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ADDRESS                                RAW(8)
 HASH_VALUE                             NUMBER
 SQL_ID                                 VARCHAR2(13)
 PLAN_HASH_VALUE                        NUMBER
 FULL_PLAN_HASH_VALUE                   NUMBER
 CHILD_ADDRESS                          RAW(8)
 CHILD_NUMBER                           NUMBER
 TIMESTAMP                              DATE
 OPERATION                              VARCHAR2(30)
 OPTIONS                                VARCHAR2(30)
 OBJECT_NODE                            VARCHAR2(40)
 OBJECT#                                NUMBER
 OBJECT_OWNER                           VARCHAR2(128)
 OBJECT_NAME                            VARCHAR2(128)
 OBJECT_ALIAS                           VARCHAR2(261)
 OBJECT_TYPE                            VARCHAR2(20)
 OPTIMIZER                              VARCHAR2(20)
 ID                                     NUMBER
 PARENT_ID                              NUMBER
 DEPTH                                  NUMBER
 POSITION                               NUMBER
 SEARCH_COLUMNS                         NUMBER
 COST                                   NUMBER
 CARDINALITY                            NUMBER
 BYTES                                  NUMBER
 OTHER_TAG                              VARCHAR2(35)
 PARTITION_START                        VARCHAR2(64)
 PARTITION_STOP                         VARCHAR2(64)
 PARTITION_ID                           NUMBER
 OTHER                                  VARCHAR2(4000)
 DISTRIBUTION                           VARCHAR2(20)
 CPU_COST                               NUMBER
 IO_COST                                NUMBER
 TEMP_SPACE                             NUMBER
 ACCESS_PREDICATES                      VARCHAR2(4000)
 FILTER_PREDICATES                      VARCHAR2(4000)
 PROJECTION                             VARCHAR2(4000)
 TIME                                   NUMBER
 QBLOCK_NAME                            VARCHAR2(128)
 REMARKS                                VARCHAR2(4000)
 OTHER_XML                              CLOB
 CON_ID                                 NUMBER

SQL> set lines 200
SQL>
SQL>
SQL> select other from v$sql_plan
  2  where other is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

no rows selected

SQL>
SQL> select remarks from v$sql_plan
  2  where remarks is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

no rows selected

SQL>
SQL> select other_xml from v$sql_plan
  2  where other_xml is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

OTHER_XML
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
yes
SQL> select xmltype(other_xml) from v$sql_plan
  2  where other_xml is not null
  3  and sql_id = '3qyuxjtjy92m5'
  4  and child_number > 0;

XMLTYPE(OTHER_XML)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  yes
  yes
  12.2.0.1
  
  2
  1068910003
  2143077847
  1068910003
  
    0
    1
  
  
    
    
    
    
    
    
    
  



1 row selected.

SQL>
SQL> explain plan for
  2  select count(created)
  3  from t
  4  where owner = 'SYS'
  5  and object_type = 'JAVA CLASS';

Explained.

SQL>
SQL> select * from table(dbms_xplan.display(format=>'all'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2143077847

---------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |     1 |    23 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |      |     1 |    23 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |   143K|  3226K|     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX   |   220 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1
   2 - SEL$1 / T@SEL$1
   3 - SEL$1 / T@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT("CREATED")[22]
   2 - "CREATED"[DATE,7]
   3 - "T".ROWID[ROWID,10]

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - 1 Sql Plan Directive used for this statement

35 rows selected.

SQL>
SQL> select * from table(dbms_xplan.display(format=>'all +metrics'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2143077847

---------------------------------------------------------------------------------------------
| Id  | Operation                            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |     1 |    23 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |      |     1 |    23 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T    |   143K|  3226K|     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX   |   220 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1
   2 - SEL$1 / T@SEL$1
   3 - SEL$1 / T@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_TYPE"='JAVA CLASS')
   3 - access("OWNER"='SYS')

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT("CREATED")[22]
   2 - "CREATED"[DATE,7]
   3 - "T".ROWID[ROWID,10]

Sql Plan Directive information:
-------------------------------

  Used directive ids:
    14906410523430420431

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - 1 Sql Plan Directive used for this statement

41 rows selected.

SQL>
SQL>

Postgresql block internals

This blogpost is the result of me looking into how postgres works, and specifically the database blocks. The inspiration and essence of this blogpost comes from two blogs from Jeremiah Peschka: https://facility9.com/2011/03/postgresql-row-storage-fundamentals/ and https://facility9.com/2011/04/postgresql-update-internals/
I am using Oracle Linux 7u3 and postgres 9.6 (current versions when this blogpost was written).

Postgres is already installed, and a database cluster is already running. Let’s create a database ‘test’ for the sake of our tests:

$ createdb test

Once the database is created, logging on is done with ‘psql’:

$ psql -d test
psql (9.6.3)
Type "help" for help.

test=#

Once logged on, we need a table and index to investigate. Let’s create a very simple table with a primary key and insert some data:

test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

In order to look at the block structures there is a postgres extension called ‘pageinspect’. The extension is part of the ‘postgresql96-contrib’ RPM package. You can check if it’s installed on your machine by looking if the file ‘/usr/pgsql-9.6/share/extension/pageinspect.control’ exists.

Once you made sure the pageinspect operating system dependencies are met, you need to install it in the database. To verify if it’s installed or not, look in ‘pg_extension’. This is how it looks like on a fresh installed database cluster:

test=# select * from pg_extension;
 extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
---------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql |       10 |           11 | f              | 1.0        |           |

This means the pageinspect extension is not installed, install it in the following way:

test=# create extension pageinspect;
CREATE EXTENSION
test=# select * from pg_extension;
   extname   | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
-------------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql     |       10 |           11 | f              | 1.0        |           |
 pageinspect |       10 |         2200 | t              | 1.5        |           |

Now with the pageinspect extension installed, let’s look at the heap (table) block. In order to understand how a block looks like, go to the documentation: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html
The essence is a heap bock contains a header, immediately followed by ItemIdData (also known as line pointers, which are pointers to rows in the same block) growing top to bottom, and then the tuples allocated bottom to top, allowing the line pointer array to grow.

The table mytable in this case consists of one block, because the table tuples (rows) fit in one. We can look at the page header by using the page_header function together with the get_raw_page function. Mind the name of the table and the block number in the function get_raw_page:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576BA8 |        0 |     0 |    40 |  8032 |    8192 |     8192 |       4 |         0

Next we want to look at the ItemIdData/line pointer array and rows, which are combined in the heap_page_items function. The ItemIdData array fields are indicated by ‘lp’, the fields in the tuple are indicated by ‘t’:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

In case you wondered how to decipher the data:

test=# select chr(x'61'::integer);
 chr
-----
 a

In order to get a better understanding, it often helps to physically see the block contents (at least, that is how my brain works). This is how that can be done:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

(if xxd is not installed on your system, it’s in the vim-common package)
If you look at the header and page items results, you see the header and line pointer items on the top, then a star after offset 48, which indicates identical lines (all zero, indicating non-touched free space), and the row data allocated from the bottom (remember 0x61 is ‘a’, indicating the first added row is at the bottom).

Let’s work a bit on the raw data, to see how it works. What I found EXTREMELY helpful (for me coming from investigating a closed-source product like the Oracle database), is the source code available and fully searchable at http://doxygen.postgresql.org (thanks Jan and Devrim!!).

Let’s try to find the first record directly using the block contents.

First we need to find the line pointer array. In order to do that, I looked at the PageHeaderData struct in the source code:

Page header in the source code: https://doxygen.postgresql.org/structPageHeaderData.html
typedef struct PageHeaderData
  148 {
  149     /* XXX LSN is member of *any* block, not only page-organized ones */		size	/ offset
  150     PageXLogRecPtr pd_lsn;      /* LSN: next byte after last byte of xlog		8 bytes / 0
  151                                  * record for last change to this page */
  152     uint16      pd_checksum;    /* checksum */					2 bytes	/ 8
  153     uint16      pd_flags;       /* flag bits, see below */			2 bytes / 10
  154     LocationIndex pd_lower;     /* offset to start of free space */		2 bytes / 12
  155     LocationIndex pd_upper;     /* offset to end of free space */			2 bytes / 14
  156     LocationIndex pd_special;   /* offset to start of special space */		2 bytes / 16
  157     uint16      pd_pagesize_version;						2 bytes / 18
  158     TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */	4 bytes / 20
  159     ItemIdData  pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */		4 bytes array / 24
  160 } PageHeaderData;

I added the size and offset numbers myself, in order to make it easier. Above we see the first array member should be at offset 24 from the start of the header. However, we need to understand how ItemIdData looks like: https://doxygen.postgresql.org/structItemIdData.html

typedef struct ItemIdData
   25 {
   26     unsigned    lp_off:15,      /* offset to tuple (from start of page) */
   27                 lp_flags:2,     /* state of item pointer, see below */
   28                 lp_len:15;      /* byte length of tuple */
   29 } ItemIdData;

Now let’s get the offset number of the first row straight from the line pointer array in the block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fd8

However, this number is too high, because it’s decimal 40920, binary: 1001 111 1101 1000; alias: the 16th bit is set. So we need to set the 16th bit to zero:

$ echo $((0x9fd8 & ~$((1<<15))))
8152

This is how to get the tuple length from the line pointer array:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
 004e

This number is too high too: it’s decimal 78, while we know the length is 39 (see above). This is because this includes a bit from the lp_flags field.
We need to right-shift this number to get the actual number:

$ echo $((0x004e >> 1))
39

So there you got it: by reading the block directly, I fetched the line pointer values of the first block (offset number and length), which are 8152 and length 39 bytes. Using the above commands you can easily read the next line pointer value by setting the correct number at ‘-j’, and doing the bit manipulation.

So, to conclude this blogpost: I’ve touched on subject of the pageinspect extension, the postgres searchable source in http://doxygen.postgresql.org, the documentation at https://www.postgresql.org/docs/9.6/static/storage-page-layout.html, and showed the page header, the line pointer array in the page header, and the rows in the block. These can all be seen using functions page_header, heap_page_items and get_raw_page. Then I showed how to fetch the raw block contents, to truly see the block, and showed how to fetch data directly using the encode and get_raw_page functions in postgres, and the xxd and od functions on the linux command line.

NB. Updated July 1st, 17:19 CET after comments of Jan Karremans.

Tagged: blocks, internals, performance, pg, postgres, Postgresql