How to troubleshoot RAID 5 failure in the server

Due to the continuous advancement of technology, different types of servers have different processing methods after RAID 5 failure.


Nowadays, the network topology of large-scale applications generally adopts C/S structure or B/S structure, and at least one server with large database needs to be placed in the central computer room. Based on the security and reliability of the server, the disk of the server is usually backed up by the Redundant Array of Inexpensive Disk. The RAID 5 array level is a parity disk array without independent parity disks. It uses data blocking and independent access technology to process multiple access requests in parallel on the same disk while allowing any hard disk in the array to fail. .


In practical applications, some array failures may occur due to some unavoidable objective reasons. The most common situation is that the hard disk is offline, the online status is displayed as DDD (Defunct Disk Drive), and the hard disk has physical or logical failure. If it is a physical fault, only the hard disk replacement; if it is a logical fault, you can restore the online state of the hard disk through targeted technical repair, continue to maintain the striped state of the hard disk data in its original array, and continue the data storage system. consistency.


However, data recovery for some of HP's older servers (such as the HP LH6000) is different from data recovery for new servers such as the HP ProLian series of servers. So different servers handle RAID 5 failures differently. I have been exposed to RAID 5 array card data failure caused by accidental power failure of two servers, and solved the problem by adopting different strategies.


Fault repair


One is the HP LH6000 server, four 18GB hard drives are made into RAID 5 disk array, the array card is NetRaid; the other is HP ProLian ML370 server, four 146GB hard disks are made into RAID 5 disk array, and its array card It is a Smart Array 642 with a hot spare (Hot Spare). Both operating systems are Window 2000 and the database is Server 2000.


The fault of the HP LH6000 is as follows: A red light on a hard disk flashes and the machine is still running normally, but it doesn't take long for the system to operate normally. Only then is the red light of another hard disk shining.

The solution is as follows:


1. Start the server and press Ctrl+M to enter the NetRaid management program when self-testing to the array. Check the array information and find that the hard disk status is Failed. Use the modified configuration to force a hard disk to be OnLine. Restart the server, invalidate the hardware self-test before entering the system, and the startup fails.


2. Start the server and press Ctrl+M to enter the NetRaid management program when self-testing to the array. Select the disk array, manually fail the hard disk that was hanged on the OnLine, and then manually set another Failed hard disk to OnLine. Restart the server and enter the system.


3. After viewing the system and the database are running normally, enter the array configuration tool to manually set the Failed hard disk to Rebuild, restart the server after 100% complete reconstruction, and all the arrays and systems are restored to their original state.


Another server running the ERP system (HP ProLiant ML370) is configured with four 146GB hot-swappable hard drives through a RAID card (Smart array array card) into a RAID 5 RAID array. One of the hard disks suddenly malfunctioned during operation. Server RAID 5 automatically enables hot spares (Hot Spare) to logically replace damaged drives. The data access task of the entire hard disk still runs completely in the original sequence of read and write processes, and the application and database have no effect.


Check the status of the hard disk through HP's own ACU tool to check that the hard disk with the red light alarm is offline. If two hard disks in the Raid 5 of the HP ProLiant server are red, it indicates that the system has crashed and the database cannot be accessed, but the system does not automatically shut down. When the second hard disk is lit red, the data cannot be recovered by conventional means. Only the third-party data recovery company that pays to find a professional recovers the data.


Therefore, for HP's old HP LH6000 series servers, the array design is much different from the array of HP ProLiant series servers. As far as the operation method is concerned, the array operation method of the HP LH6000 server has many options, including the possibility of re-deleting the array and rebuilding after the array fails, and the initialization is also manually selected. However, the initialization of the HP ProLiant series of server arrays is automatically performed in the background after the array is configured, so the ProLiant series servers cannot reconfigure the array after an array error.


The HP LH6000 server may cause the disk in the array to be dropped due to other unexpected reasons. The maintenance personnel can manually select to use Online or Offline, Rebuild, etc. to recover data. However, the current HP ProLiant series servers will no longer have a disk drop like the old server in the array, so when the hard disk is red, the hard disk is basically damaged and needs to be replaced. Of course, you can choose to re-plug the hard drive to rebuild (Rebuild), and see if the hard drive can be used for a while.


Do a good job in technology backup


As can be seen from the above two examples, the same brand and different series of servers have different Raid 5 disk failures due to their different technologies. But after rebuilding the data, the data is saved, from which you can draw the following experience:


We believe that any advanced technology is not foolproof. If you want to ensure data security, you must do a backup job, it is best to do a remote backup of the database once a day. Reserve at least one new hard drive. It should be noted that the hard disk added to the array must be greater than or equal to the capacity of the failed hard disk.

If possible, the RAID 5+ hot spare disk creation scheme is recommended. This way we have two chances to replace the hard drive before the data is lost. For general applications, only RAID 5 can be used to provide data access performance, reliability, and maximum disk space.


Administrators must constantly observe the status of the array, including viewing the yellow warning lights of the disk array and the drive status in the management software. A failure occurred and was eliminated in time. Regardless of the level of the array, data backup should be done before troubleshooting.