#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">ORA-01652 unable to extend temporary segment...
#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">ORA-01652 unable to extend temporary segment...
Recently I encountered a situation that demonstrates how, in an Oracle database, an error may stem from a very unrelated cause. A DBA was building a physical standby database for an upcominmg training. The two servers he was using were part of a RAC cluster; so the Oracle binaries were already there. He decided to use the same ORACLE_HOME for the new databases as well - a quite logical decision to savbe on space and administration issues. He created a primary database on server n1 and a standby database on the server n2. Follooing the usual manual procedure in building the standby database, he copied the pfile from primary database, modified the parameters and brought up the standby instance in nomount mode on the server n2.
SQL> startup nomount pfile=initSTBY.ora
But it refused to come up, with the following error:
ORA-00304: requested INSTANCE_NUMBER is busy
Alert log showed:
USER (ospid: 14210): terminating the instance due to error 304
Instance terminated by USER, pid = 14210
This was highly unusual. The primary and standby both were non-RAC; there was no instance number concept in a non-RAC database. By the way, the RAC instance on that server (or on the server n1) was not running; so there was no question of any conflict with the RAC instances either. The primary database was called PRIM while the standby was called STBY - eliminating the possibility of an instance name clash as well. And this error came while merely trying to start the instance, not even while mounting - eliminating the standby controlfile as a cause.
The error 304 showed:
00304, 00000, "requested INSTANCE_NUMBER is busy"
// *Cause: An instance tried to start by using a value of the
// initialization parameter INSTANCE_NUMBER that is already in use.
// *Action: Either
// a) specify another INSTANCE_NUMBER,
// b) shut down the running instance with this number
// c) wait for instance recovery to complete on the instance with
// this number.
Needless to say, being for a non-RAC database there was no "instance_number" parameter in the initialization parameter file of primary or the standby. So, the suggestions for the resolution seemed odd. MetaLink provided no help. All the ORA-304 errors were related to RAC with the instance_number mismatch.
As it always happens, it fell on my lap at this time. With just days to go live, I had to find a solution quickly. Long hours of troubleshooting, tracing the processes and examination of occasional trace files did not yield any clue. All the clue seemed to point to RAC, which this database was not. The Oracle Home was a RAC home, which meant the oracle binary was linked with the "rac" option.
So, the next logical step was to install a new Oracle Home without the rac option. After doing so, I tried to bring up the instance, using the new ORACLE_HOME and LD_LIBRARY_PATH variable; but, alas, the same error popped up.
Just to eliminate the possibility of some unknown bug, I decided to put an instance_number parameter, setting it to "1", from the default "0". The same error. I changed it to "2", again, the result was the same error.
Although this didn't help, at least it gave a clue that the error was not related to instance_number. The error message was clearly wrong. With this in mind, I went back to the basics. I went through the alert log with a fine toothed comb, scanning and analyzing each line.
The following line drew my attention:
DB_UNQIUE_NAME STBY is not in the Data Guard configuration
This was odd; the db_unique_name STBY was not defined in DG configuration. [BTW, note the spelling of "unique" in the message above. That is not what I typed; it was a copy and paste from the actual message in the alert log. Someone in Oracle Development should really pay atytention to typos in messages. This is clearly more than a nuisance; what if some process scans for db_unique_name related errors? It will not find the message at all!]
Checking the dg configuration, I found that the DBA has correctly defined the primary and standby names. In any case, Data Guard has not been started yet; this is merely at the instance startup - why is it complaining for data guard configuration at this time.
Perplexed, I resorted to a different approach. I renamed the pfile and all other relevant files. Then I built the standby myself, from scratch - using the same names - PRIM and STBY. And this time, everything worked fine. The instance STBY did come up.
While this solvbed the urgency problem, everyone, inclduing myself, wanted to know what the issue was in the earlier case where the DBA had failed to bring up the instance. To get the answer, I compared the files I created with the DBA created when tried and failed. Voila! The cause was immediately clear - the DBA forgot to put a vital parameter in the pfile of the standby instance:
db_unique_name = 'STBY'
This parameter was absent; so it took the default value as the db_name, which was "PRIM". This caused the instance to fail with a seemingly unrelated message - "ORA-304 Instance_number is busy"!
Learning Points
Hope this was helpful. Happy New Year, everybody.
Recently, we had a glitch on a Data Guard (physical standby database) on infrastructure. This is not a critical database; so the monitoring was relatively lax. And that being done by an outsourcer does not help it either. In any case, the laxness resulted in a failure remaining undetected for quite some time and it was eventually discovered only when the customer complained. This standby database is usually opened for read only access from time to time.This time, however, the customer saw that the data was significantly out of sync with primary and raised a red flag. Unfortunately, at this time it had become a rather political issue.
Since the DBA in charge couldn’t resolve the problem, I was called in. In this post, I will describe the issue and how it was resolved. In summary, there are two parts of the problem:
(1) What happened
(2) How to fix it
What Happened
Let’s look at the first question – what caused the standby to lag behind. First, I looked for the current SCN numbers of the primary and standby databases. On the primary:
SQL> select current_scn from v$database;
CURRENT_SCN
-----------
1447102
On the standby:
SQL> select current_scn from v$database;
CURRENT_SCN
-----------
1301571
Clearly there is a difference. But this by itself does not indicate a problem; since the standby is expected to lag behind the primary (this is an asynchronous non-real time apply setup). The real question is how much it is lagging in the terms of wall clock. To know that I used the scn_to_timestamp function to translate the SCN to a timestamp:
SQL> select scn_to_timestamp(1447102) from dual;
SCN_TO_TIMESTAMP(1447102)
-------------------------------
18-DEC-09 08.54.28.000000000 AM
I ran the same query to know the timestamp associated with the SCN of the standby database as well (note, I ran it on the primary database, though; since it will fail in the standby in a mounted mode):
SQL> select scn_to_timestamp(1301571) from dual;
SCN_TO_TIMESTAMP(1301571)
-------------------------------
15-DEC-09 07.19.27.000000000 PM
This shows that the standby is two and half days lagging! The data at this point is not just stale; it must be rotten.
The next question is why it would be lagging so far back in the past. This is a 10.2 database where FAL server should automatically resolved any gaps in archived logs. Something must have happened that caused the FAL (fetch archived log) process to fail. To get that answer, first, I checked the alert log of the standby instance. I found these lines that showed the issue clearly:
…
Fri Dec 18 06:12:26 2009
Waiting for all non-current ORLs to be archived...
Media Recovery Waiting for thread 1 sequence 700
Fetching gap sequence in thread 1, gap sequence 700-700
…
Fri Dec 18 06:13:27 2009
FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 700-700
DBID 846390698 branch 697108460
FAL[client]: All defined FAL servers have been attempted.
…
Going back in the alert log, I found these lines:
Tue Dec 15 17:16:15 2009
Fetching gap sequence in thread 1, gap sequence 700-700
Error 12514 received logging on to the standby
FAL[client, MRP0]: Error 12514 connecting to DEL1 for fetching gap sequence
Tue Dec 15 17:16:15 2009
Errors in file /opt/oracle/admin/DEL2/bdump/del2_mrp0_18308.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
Tue Dec 15 17:16:45 2009
Error 12514 received logging on to the standby
FAL[client, MRP0]: Error 12514 connecting to DEL1 for fetching gap sequence
Tue Dec 15 17:16:45 2009
Errors in file /opt/oracle/admin/DEL2/bdump/del2_mrp0_18308.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
This clearly showed the issue. On December 15th at 17:16:15, the Managed Recovery Process encountered an error while receiving the log information from the primary. The error was ORA-12514 “TNS:listener does not currently know of service requested in connect descriptor”. This is usually the case when the TNS connect string is incorrectly specified. The primary is called DEL1 and there is a connect string called DEL1 in the standby server.
The connect string works well. Actually, right now there is no issue with the standby getting the archived logs; so there connect string is fine - now. The standby is receiving log information from the primary. There must have been some temporary hiccups causing that specific archived log not to travel to the standby. If that log was somehow skipped (could be an intermittent problem), then it should have been picked by the FAL process later on; but that never happened. Since the sequence# 700 was not applied, none of the logs received later – 701, 702 and so on – were applied either. This has caused the standby to lag behind since that time.
So, the fundamental question was why FAL did not fetch the archived log sequence# 700 from the primary. To get to that, I looked into the alert log of the primary instance. The following lines were of interest:
…
Tue Dec 15 19:19:58 2009
Thread 1 advanced to log sequence 701 (LGWR switch)
Current log# 2 seq# 701 mem# 0: /u01/oradata/DEL1/onlinelog/o1_mf_2_5bhbkg92_.log
Tue Dec 15 19:20:29 2009Errors in file /opt/oracle/product/10gR2/db1/admin/DEL1/bdump/del1_arc1_14469.trc:
ORA-00308: cannot open archived log '/u01/oraback/1_700_697108460.dbf'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Tue Dec 15 19:20:29 2009
FAL[server, ARC1]: FAL archive failed, see trace file.
Tue Dec 15 19:20:29 2009
Errors in file /opt/oracle/product/10gR2/db1/admin/DEL1/bdump/del1_arc1_14469.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed.
Archiver continuing
Tue Dec 15 19:20:29 2009
ORACLE Instance DEL1 - Archival Error. Archiver continuing.
…
These lines showed everything clearly. The issue was:
ORA-00308: cannot open archived log '/u01/oraback/1_700_697108460.dbf'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
The archived log simply was not available. The process could not see the file and couldn’t get it across to the standby site.
Upon further investigation I found that the DBA actually removed the archived logs to make some room in the filesystem without realizing that his action has removed the most current one which was yet to be transmitted to the remote site. The mystery surrounding why the FAL did not get that log was finally cleared.
Solution
Now that I know the cause, the focus was now on the resolution. If the archived log sequence# 700 was available on the primary, I could have easily copied it over to the standby, registered the log file and let the managed recovery process pick it up. But unfortunately, the file was gone and I couldn’t just recreate the file. Until that logfile was applied, the recovery will not move forward. So, what are my options?
One option is of course to recreate the standby - possible one but not technically feasible considering the time required. The other option is to apply the incremental backup of primary from that SCN number. That’s the key – the backup must be from a specific SCN number. I have described the process since it is not very obvious. The following shows the step by step approach for resolving this problem. I have shown where the actions must be performed – [Standby] or [Primary].
1. [Standby] Stop the managed standby apply process:
SQL> alter database recover managed standby database cancel;
Database altered.
2. [Standby] Shutdown the standby database
3. [Primary] On the primary, take an incremental backup from the SCN number where the standby has been stuck:
RMAN> run {
2> allocate channel c1 type disk format '/u01/oraback/%U.rmb';
3> backup incremental from scn 1301571 database;
4> }
using target database control file instead of recovery catalog
allocated channel: c1
channel c1: sid=139 devtype=DISK
Starting backup at 18-DEC-09
channel c1: starting full datafile backupset
channel c1: specifying datafile(s) in backupset
input datafile fno=00001 name=/u01/oradata/DEL1/datafile/o1_mf_system_5bhbh59c_.dbf
…
piece handle=/u01/oraback/06l16u1q_1_1.rmb tag=TAG20091218T083619 comment=NONE
channel c1: backup set complete, elapsed time: 00:00:06
Finished backup at 18-DEC-09
released channel: c1
4. [Primary] On the primary, create a new standby controlfile:
SQL> alter database create standby controlfile as '/u01/oraback/DEL1_standby.ctl';
Database altered.
5. [Primary] Copy these files to standby host:
oracle@oradba1 /u01/oraback# scp *.rmb *.ctl oracle@oradba2:/u01/oraback
oracle@oradba2's password:
06l16u1q_1_1.rmb 100% 43MB 10.7MB/s 00:04
DEL1_standby.ctl 100% 43MB 10.7MB/s 00:04
6. [Standby] Bring up the instance in nomount mode:
SQL> startup nomount
7. [Standby] Check the location of the controlfile:
SQL> show parameter control_files
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
control_files string /u01/oradata/standby_cntfile.ctl
8. [Standby] Replace the controlfile with the one you just created in primary.
9. $ cp /u01/oraback/DEL1_standby.ctl /u01/oradata/standby_cntfile.ctl
10.[Standby] Mount the standby database:
SQL> alter database mount standby database;
11.[Standby] RMAN does not know about these files yet; so you must let it know – by a process called cataloging. Catalog these files:
$ rman target=/
Recovery Manager: Release 10.2.0.4.0 - Production on Fri Dec 18 06:44:25 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
connected to target database: DEL1 (DBID=846390698, not open)
RMAN> catalog start with '/u01/oraback';
using target database control file instead of recovery catalog
searching for all files that match the pattern /u01/oraback
List of Files Unknown to the Database
=====================================
File Name: /u01/oraback/DEL1_standby.ctl
File Name: /u01/oraback/06l16u1q_1_1.rmb
Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done
List of Cataloged Files
=======================
File Name: /u01/oraback/DEL1_standby.ctl
File Name: /u01/oraback/06l16u1q_1_1.rmb
12.Recover these files:
RMAN> recover database;
Starting recover at 18-DEC-09
using channel ORA_DISK_1
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00001: /u01/oradata/DEL2/datafile/o1_mf_system_5lptww3f_.dbf
...
channel ORA_DISK_1: reading from backup piece /u01/oraback/05l16u03_1_1.rmb
channel ORA_DISK_1: restored backup piece 1
piece handle=/u01/oraback/05l16u03_1_1.rmb tag=TAG20091218T083619
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
starting media recovery
archive log thread 1 sequence 8012 is already on disk as file /u01/oradata/1_8012_697108460.dbf
archive log thread 1 sequence 8013 is already on disk as file /u01/oradata/1_8013_697108460.dbf
…
13. After some time, the recovery fails with the message:
archive log filename=/u01/oradata/1_8008_697108460.dbf thread=1 sequence=8009
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 12/18/2009 06:53:02
RMAN-11003: failure during parse/execution of SQL statement: alter database recover logfile '/u01/oradata/1_8008_697108460.dbf'
ORA-00310: archived log contains sequence 8008; sequence 8009 required
ORA-00334: archived log: '/u01/oradata/1_8008_697108460.dbf'
This happens because we have come to the last of the archived logs. The expected archived log with sequence# 8008 has not been generated yet.
14.At this point exit RMAN and start managed recovery process:
SQL> alter database recover managed standby database disconnect from session;
Database altered.
15.Check the SCN’s in primary and standby:
[Standby] SQL> select current_scn from v$database;
CURRENT_SCN
-----------
1447474
[Primary] SQL> select current_scn from v$database;
CURRENT_SCN
-----------
1447478
Now they are very close to each other. The standby has now caught up.
Recent comments
1 year 45 weeks ago
2 years 5 weeks ago
2 years 9 weeks ago
2 years 10 weeks ago
2 years 15 weeks ago
2 years 36 weeks ago
3 years 4 weeks ago
3 years 34 weeks ago
4 years 18 weeks ago
4 years 19 weeks ago