1. CrashLoopBackOff:
- Description: A pod repeatedly crashes and restarts.
- Troubleshooting:
- Check pod logs:
kubectl logs <pod-name>
- Describe the pod for more details:
kubectl describe pod <pod-name>
- Investigate the application's start-up and initialization code.
- Check pod logs:
2. ImagePullBackOff:
- Description: Kubernetes cannot pull the container image from the registry.
- Troubleshooting:
- Verify the image name and tag.
- Check the image registry credentials.
- Ensure Cluster can pull image from registry.
- Ensure the image exists in the specified registry.
3. Pending Pods:
- Description: Pods remain in the "Pending" state and are not scheduled.
- Troubleshooting:
- Check node resources (CPU, memory) to ensure there is enough capacity.
- Ensure the nodes are labeled correctly if using node selectors or affinities.
- Verify there are no taints on nodes that would prevent scheduling.
4. Node Not Ready:
- Description: One or more nodes are in a "NotReady" state.
- Troubleshooting:
- Check node status:
kubectl describe node <node-name>
- Review kubelet logs on the affected node.
- Ensure the node has network connectivity.
- Check node status:
5. Service Not Working:
- Description: Services are not accessible or routing traffic correctly.
- Troubleshooting:
- Check the service and endpoints:
kubectl get svc
andkubectl get endpoints
. - Verify network policies and firewall rules.
- Ensure the pods backing the service are healthy and running.
- Check the service and endpoints:
6. Insufficient Resources:
- Description: Pods cannot be scheduled due to insufficient resources.
- Troubleshooting:
- Review resource requests and limits in pod specifications.
- Scale the cluster by adding more nodes.
7. PersistentVolume Claims Pending:
- Description: PVCs remain in a "Pending" state.
- Troubleshooting:
- Check if there are available PVs that match the PVC specifications.
- Ensure the storage class exists and is configured correctly.
- Verify that the underlying storage backend is healthy.
8. Pod Stuck Terminating:
- Description: Pods get stuck in a "Terminating" state.
- Troubleshooting:
- Check for finalizers that might be preventing pod deletion.
- Review the logs for shutdown hooks or long-running processes.
- Force delete the pod if necessary:
kubectl delete pod <pod-name> --force --grace-period=0
9. DNS Resolution Issues:
- Description: DNS lookups within the cluster fail.
- Troubleshooting:
- Check the DNS pod logs (e.g., CoreDNS):
kubectl logs <coredns-pod>
- Ensure the DNS service is running:
kubectl get svc -n kube-system
- Verify network policies and firewall rules do not block DNS traffic.
- Check the DNS pod logs (e.g., CoreDNS):
10. Error from server (Forbidden):
- Description: The user does not have permission to perform the requested operation.
- Troubleshooting:
- Ensure user is authorized to access the Kubernetes cluster.
- Ensure user does have the necessary role or permissions to perform the operation.
- Check the resource that the user is trying to access is protected by a role-based access control (RBAC) role or binding.
- Validate Service account permission, Namespace Permissions.
11. Pod Timeout:
- Description: The pod has not started successfully within the specified timeout period.
- Troubleshooting:
- Check if the pod has readiness and liveness probes defined and if they are correctly configured.
- Increase the timeout period or check the pod logs for any errors or warnings.
More Kubernetes Error Codes we might encounter:
ImagePullFailed
: This error occurs when Kubernetes is unable to pull an image from a registry. This can happen for a number of reasons, such as the image does not exist, the registry is unavailable, or you do not have permission to access the image.PodCrashExitCode
: This error occurs when a pod crashes with a non-zero exit code. This can happen for a number of reasons, such as the pod’s container failed, the pod exceeded its resource limits, or the pod encountered a runtime error.ContainerCannotRun
: This error occurs when Kubernetes is unable to start a container. This can happen for a number of reasons, such as the container image is missing or corrupted, the container requires resources that are not available on the node, or the container is not compatible with the node’s operating system.