We are using Dell 720 and 730xd servers for our Ceph OSD servers. Here is the process that we use in order to replace a disk and/or remove the faulty OSD from service.
In this example we will attempt to replace OSD #45 (slot #9 of this particular server):
Stop the OSD and unmount the directory:
stop ceph-osd id=45
umount /var/lib/ceph/osd/ceph-45
ceph osd crush reweight osd.num 0.0
(wait for the cluster to rebalance):
ceph osd out osd.num
service ceph stop osd.num
ceph osd crush remove osd.num
ceph auth del osd.num
ceph osd rm osd.num
megacli -PDList -a0
If not already offline…offline the drive:
megacli -pdoffline -physdrv[32:9] -a0
Mark disk as missing:
megacli -pdmarkmissing -physdrv[32:9] -a0
Permanently remove drive from array:
megacli -pdprprmv -physdrv[32:9] -a0
NOW PHYSICALLY REPLACE THE BAD THE DRIVE WITH A NEW ONE.
Set drive state to online if not already:
megacli -PDOnline -PhysDrv [32:9] -a0
Create Raid-0 array on new drive:
megacli -CfgLdAdd -r0[32:9] -a0
You may need to discard the cache before doing the last step:
First get cache lsit:
megacli -GetPreservedCacheList -a0
Clear whichover one you need to:
megacli -DiscardPreservedCache -L2 -a0
Recreate OSD using Bluestore as the new default
ceph-deploy disk zap hqosdNUM /dev/sdx
ceph-deploy osd create --data /dev/sdm hqosdNUM