Our company recently discovered machines on GKE with particularly low resource utilization. Those unused resources are still costing money, so we created a new Node Pool with lower and cheaper resources, then used Cordon + Drain (or manually deleted Pods) to reschedule Pods onto the new Node Pool. On GKE, simply deleting the old Node Pool would achieve the same result, but we chose the safer approach.
This essentially marks a Node as unschedulable, then lets Pods migrate to the expected hosts. Let’s discuss how to safely handle node failures or upgrades.
Scenario
If a Worker Node needs a system upgrade or is about to fail, how do we handle it safely?
Procedure
1. Stop Scheduling on the Node
Use cordon to mark the target node as unschedulable. This only changes the Node’s status to SchedulingDisabled. New Pods won’t be scheduled here. Existing Pods continue serving normally.
kubectl cordon <node-name>
2. Evict Pods from the Node
Two approaches:
First: drain all Pods at once. If all Pods from a single Deployment run on this node, your service may experience downtime.
kubectl drain <node-name> --force --ignore-daemonsets
Second: manually delete Pods one by one. This is the safer method.
kubectl delete pod <pod-name>
3. Restore Scheduling or Delete the Node
After upgrade, mark the node as schedulable again:
kubectl uncordon <node-name>
If the node has confirmed hardware issues, delete it:
kubectl delete node <node-name>
Important Notes
- Pod replicas must be greater than 1, otherwise your service will experience downtime.
- Check for special scheduling policies — otherwise Pods might not be successfully rescheduled after eviction.
- Configure PodAntiAffinity policies to spread Pods across different nodes.
Conclusion
In production, if you’re not fully confident, manually delete Pods one by one during migration.
Feel free to leave a comment on my blog. Your feedback motivates me to keep writing. Thank you for reading, and let’s grow together to become better versions of ourselves.