HOW-TO: Replace Failed Drives Under SDS

In order to protect the operating system, avoid serious downtime, and most of all save previous man-hours, implementing RAID 1 (or mirror) using SDS was a very viable solution. Not only that, the solution is quick to execute and with considerable success at that.

There are a lot of reasons disks and partitions fail. But when this happens, it should be attended as risks for possible catastrophic downtime mounts. Before replacing disks suspected of failure, that failure has to be verified. What are the symptoms? How are they diagnosed? Answers to the two important questions are equally as important.

Symptoms of Disk Failure. There are several indications of a failing or failed disk. More often than not, a combination of the following logs, errors or outputs may confirm the disk failure.

» Utilities such as iostat can confirm the failure.

root@host # iostat -En cxtxdx

The parameter to watch out would be "Hard Errors". On a failing disk the count on this parameter would be greater than 0 (and increasing).

» Log files should also the failure symptoms. Messages such as ones below should be in the /var/adm/messages.
md_stripe: [ID 641072 kern.warning] WARNING: md: d5: read error on /dev/dsk/c0t1d0s5
md_mirror: [ID 104909 kern.warning] WARNING: md: d5: /dev/dsk/c0t1d0s5 needs maintenance

» The format output can confirm the failure. As seen below, the disk could not be recognized by the system.
AVAILABLE DISK SELECTIONS:
0. c0t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@1e,600000/pci@0/pci@a/pci@0/pci@8/scsi@1/sd@0,0
1. c0t1d0 <drive type unknown>
/pci@1e,600000/pci@0/pci@a/pci@0/pci@8/scsi@1/sd@1,0

» The SDS binary metastat can show the symptom as well.
root@host # metastat d5
d5: Mirror
Submirror 0: d15
State: Okay
Submirror 1: d25
State: Needs maintenance
...
d25: Submirror of d5
State: Needs maintenance
Invoke: metareplace -e d25 c0t1d0s5
...

There could be several scenarios that may arise. Assuming the worst possible scenario the failed disk has to be replaced. Follow along as the following procedure tackes just that.

Partial Disk Failure. Partial disk failure is characterized by a combination of Hard Errors, logged messages and metastat maintenance flag.

[1] Delete any state database replicas from the failed disk. A "W" in metadb output indicates replica device write errors.
root@host # metadb
flags first blk block count
a m p luo 16 8192 /dev/dsk/c0t0d0s7
a p luo 8208 8192 /dev/dsk/c0t0d0s7
W p l 16 8192 /dev/dsk/c0t1d0s7
W p l 8208 8192 /dev/dsk/c0t1d0s7

root@host # metadb -d /dev/dsk/c0t1d0s7
root@host # metadb
flags first blk block count
a m p luo 16 8192 /dev/dsk/c0t0d0s7
a p luo 8208 8192 /dev/dsk/c0t0d0s7

[2] The metadevice that is still in "okay" state should be detached from the mirror. (Assuming metadevice d0 is still in "okay" state.)
root@host # metadetach d0 d20
d0: submirror d20 is detached

[3] Replace the failed disk. In this example, the secondary disk at c0t1d0.

[4] Replicate the partition table of the primary disk to the secondary disk.
root@host # prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

[5] Re-create the state databases. The state database has to exist in the new secondary disk as well.
root@host # metadb -a -c2 /dev/dsk/c0t1d0s7

[6] Run metareplace on the partitions that the "Needs Maintenance" flag (as shown above).
root@host # metareplace -e d21 c0t1d0s1
root@host # metareplace -e d24 c0t1d0s4
...

[7] Re-attach the detached mirror.
root@host # metattach d0 d20

Full Disk Failure. Full disk failure is confirmed by the output of format. The other characteristics would also be present.

[1] Delete the state databases of the failed disk.

[2] Replace the failed disk.

[3] Duplicate the partition table to the new disk.

[4] Re-create the state databases.

[5] Synchronize the mirrors using metareplace.

Monitor the Re-Synchronization. Re-synchronization of the mirrors can be monitored via metastat.

For Solaris 10, an extra step needs to be executed. The OS needs to be notified of the disk change.

root@host # metadevadm -u c0t1d0
Updating SLVM device relocation information for c0t1d0.

A variation of the preceding command can be done using the full pathname. It achieves the same purpose.
root@host # metadevadm -u /dev/dsk/c0t1d0
Updating SLVM device relocation information for c0t1d0.


You might also be interested in:

Feedback

We at pimp-my-rig strive to keep on improving, help us reach that goal by leaving comments or constructive criticisms. Don't miss out on our next feature -- subscribe via RSS (What is RSS?).

Share This

0 comments: