Broken Drives

Draining an OSD

To remove the OSD first instead of osd out use this: ceph osd crush reweight osd.<id> 0.0. OSD OUT will shift the weight to other OSDs around, and will require another rebalance.

Deleting a node from ceph cluster

Drain all node’s OSDs. Wait for data to migrate (expect at least a week). If needed, change the recovery speed:

ceph config get osd osd_mclock_profile
ceph config set osd osd_mclock_profile high_recovery_ops|balanced|high_client_ops

Drain the node.
Get the OSD numbers for the node:
Terminal window
```
ceph osd tree
```
For example the list is 183 184 185 186 202 203 204 205 206 207 208.

Purge those OSDs in tools pod:

for i in 183 184 185 186 202 203 204 205 206 207 208; do ceph osd purge $i --force && sleep 5; done

The sleep pause prevents OSD auth removal problems.

If node won’t return, edit the corresponding cephCluster object and delete the node.

Delete the deployments for the OSDs:

for i in 183 184 185 186 202 203 204 205 206 207 208; do k delete deployment -n rook rook-ceph-osd-$i; done

Replace the broken drive(s). Boot the cordoned node.
On the node delete all LVs, VGs, PVs. Zap drives:
Terminal window
```
for i in ...; do sgdisk --zap-all --clear --mbrtogpt -g /dev/$i && wipefs -a /dev/$i; done
```
You should not be able to see any drives for ceph in the output of blkid
Delete the OSDs from /var/lib/rook (not MON if it exists on the node)
Uncordon the node
Either change back the cephCluster object or wait for operator to reconcile it. Can also change a parameter in cephCluster to kick the reconcile. See the node added back.
Revert the osd_mclock_profile setting

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.