DBMS Partners blog: Oracle ASM disk failure

Friday, November 25, 2011

Oracle ASM disk failure - Part 2

Introduction

In Part 1, I wrote about a scenario when ASM detects READ_ERRS/WRITE_ERRS and updates these columns in v$asm_disk for the ASM disk. The DBA has to explicitly drop the disk in ASM. This article is about a different scenario when ASM instance itself performs the 'drop disk' operation.

This post assumes that you are using ASM redundancy (Normal or High) and that you are not using ASMLib program. The commands and syntax could be different if you are using ASMLib.

Scenario

In this scenario, ASM drops the disk automatically. Furthermore, the READ_ERRS/WRITE_ERRS in v$asm_disk could be showing a value of NULL (instead of an actual count of READ or WRITE errors noticed).

How to identify the failed disk

Unlike scenario 1 discussed in Part 1 of the ASM series, ASM instance can initiate the 'drop disk' by itself in some situations. Let the failed disk be '/dev/sds1'.

select path

from   v$asm_disk

where  read_errs is NULL;

/dev/sds1

select path

from   v$asm_disk

where  write_errs is NULL

/dev/sds1

Additionally, the HEADER_STATUS in v$asm_disk returns a value of UNKNOWN. 

select  mount_status,header_status,mode_status,state

from    v$asm_disk

where   path = '/dev/sds1';

CLOSED  UNKNOWN      ONLINE  NORMAL

Compare this scenario with that of the scenario mentioned in Part 1, when the HEADER_STATUS is still shown as MEMBER and the READ_ERRS/WRITE_ERRS has a value > 0.

The following are the errors mentioned in the +ASM alert log file when the failure was first noticed.

WARNING: initiating offline of disk

NOTE: cache closing disk

WARNING: PST-initiated drop disk

ORA-27061: waiting for async I/Os failed
WARNING: IO Failed. subsys:System dg:0, diskname:/dev/sds1 

No "drop disk" command required by DBA

The disk is already dropped by ASM instance. There is no need of an "alter diskgroup ...drop disk" command again. Instead the DBA has to work with the system administrator and physically locate the failed disk in the disk enclosure and remove it.Add the replacement disk

1)Get the replacement/new device name, partition it and change ownership to  the database owner.  For example let the disk path after partitioning be  /dev/sdk12)select distinct header_status from v$asm_disk where name  = '/dev/sdk1'; (Must show as CANDIDATE)

3)alter diskgroup #name# add disk '/dev/sdk1';

4)ASM starts the re-balancing operation due to the above disk add command.  One can monitor the progress of the re-balance operation by checking v$asm_operation.

select state,power,group_number,EST_MINUTES
from v$asm_operation;

After a few min/hours the above gets completed (no rows returned)

5)The disk add operation is now considered complete.
How to decrease the ASM re-balance operation time

While  the above ASM re-balancing operation is in progress, the DBA can let it  complete quickly by changing 'ASM power' by running the below command for example.

alter diskgroup #name# rebalance power 8;

The  default power is 1 (i.e ASM starts one re-balance background process to  handle the re-balancing work, called ARB process). The above command  dynamically starts 8 ARB processes (ARB0 to ARB7), which can  dramatically decrease the time to re-balance.  The maximum power limit in 11g R1 is 11 (upto 11 ARB processes can be started).

Conclusion

I am not exactly sure why ASM shows the status of a failed disk in different ways, but these are two scenarios that I aware of so far.

None  of the above maintenance operations (faile disk removal from the disk enclosure, new disk add) causes a  downtime to the end user and therefore can be completed during normal  business hours. The re-balance operation can cause slight degradation of  performance and hence increase the power limit to let it complete  quickly.

1 comment:

Anonymous said...: This is quite helpful !; December 27, 2011 at 12:49 AM

Pages

Friday, November 25, 2011

Oracle ASM disk failure - Part 2

1 comment:

Post a Comment

Ask a Question

Labels

Recent Knowledge Articles

All Knowledge Articles

LinkWithin

Total Pageviews

Popular Articles