Broken Drives
Draining an OSD
To remove the OSD first instead of osd out use this: ceph osd crush reweight osd.<id> 0.0. OSD OUT will shift the weight to other OSDs around, and will require another rebalance.
Deleting a node from ceph cluster
Drain all node’s OSDs. Wait for data to migrate (expect at least a week). If needed, change the recovery speed:
Terminal window ceph config get osd osd_mclock_profileceph config set osd osd_mclock_profile high_recovery_ops|balanced|high_client_opsDrain the node.
Get the OSD numbers for the node:
Terminal window ceph osd treeFor example the list is
183 184 185 186 202 203 204 205 206 207 208.Purge those OSDs in tools pod:
Terminal window for i in 183 184 185 186 202 203 204 205 206 207 208; do ceph osd purge $i --force && sleep 5; doneThe sleep pause prevents OSD auth removal problems.
If node won’t return, edit the corresponding cephCluster object and delete the node.
Delete the deployments for the OSDs:
for i in 183 184 185 186 202 203 204 205 206 207 208; do k delete deployment -n rook rook-ceph-osd-$i; doneReplace the broken drive(s). Boot the cordoned node.
On the node delete all LVs, VGs, PVs. Zap drives:
Terminal window for i in ...; do sgdisk --zap-all --clear --mbrtogpt -g /dev/$i && wipefs -a /dev/$i; doneYou should not be able to see any drives for ceph in the output of
blkidDelete the OSDs from
/var/lib/rook(notMONif it exists on the node)Uncordon the node
Either change back the cephCluster object or wait for operator to reconcile it. Can also change a parameter in cephCluster to kick the reconcile. See the node added back.
Revert the
osd_mclock_profilesetting
