Kubernetes kubelet monitoring

The Kubernetes kubelet is the primary node agent responsible for ensuring that the containers on each node are running and healthy. Monitoring the kubelet helps you troubleshoot pod startup failures, track container runtime errors, detect PLEG staleness, and ensure consistent pod life cycle management across your cluster. Site24x7 automatically discovers and monitors the kubelet running on each Kubernetes node.

How it works

The Site24x7 agent collects the following Kubelet metrics from each node:

Runtime and process-level metrics for the kubelet process.
Pod life cycle and container state metrics.
Container runtime operation counts, errors, and durations.
Pod startup and worker operation timing.
Storage operation success and failure tracking.
Cgroup manager operation durations.
PLEG (Pod Lifecycle Event Generator) relist intervals and durations.

These metrics help you detect scheduling delays, startup failures, container runtime errors, and degradation in pod life cycle events.

Prerequisites

Install Site24x7 Kubernetes agent version 22.1.00 or above.
For existing customers, upgrade your Kubernetes agent to version 22.1.00 or above.

Note

The Site24x7 Kubernetes agent must be installed and running on the cluster before enabling kubelet monitoring.

View your Kubernetes Kubelet monitor

As soon as you upgrade your agent, the Site24x7 Kubernetes monitoring agent will fetch all the Kubelet monitoring metrics.

To view your Kubernetes Kubelet monitor:

Log in to your Site24x7 account.
Navigate to K8s. Then, select the cluster name.
Click Kubelet.

This will open the list of kubelet monitors in that particular cluster. Select any one to view detailed insights into that monitor.

Supported metrics

The following tables list all kubelet metrics collected at every poll interval.

Utilization

Metric name	Description	Units
Running Pods	The total number of pods that have a running pod sandbox during the last poll time.	Count
Pod Start Count	The count of kubelet seeing a pod for the first time to the pod starting to run during the last poll period.	Count
Total Rest Client Requests	The total number of HTTP requests during the last poll period.	Count
Container Log File System Usage	The total bytes used by the container's logs on the filesystem during the last poll time.	Bytes
Process CPU Time	The CPU time consumed by the kubelet process during the last poll period.	Seconds
Process Open File Descriptors	The number of file descriptors that a process has open by the kubelet process during the last poll time.	Count
Container States
Containers in Created State	The number of containers that are currently with the state of created during the last poll time.	Count
Containers in Exited State	The number of containers that are currently with the state of exited during the last poll time.	Count
Containers in Running State	The number of containers that are currently with the state of running during the last poll time.	Count
Pod Starts Duration
Average Pod Start Duration	The average time taken by the kubelet per pod to start the pod to run during the last poll period.	Seconds
Total Pod Starts Duration	The total time taken for the kubelet to start the pod to run during the last poll period.	Seconds
Runtime Operations
Kubelet Runtime Operations	The total number of runtime operations during the last poll period.	Count
Kubelet Runtime Operations Errors	The total number of runtime operation errors during the last poll period.	Count
Runtime Operations Duration
Average Runtime Operation Duration	The average time taken per runtime operations during the last poll period.	Seconds
Total Runtime Operations Duration	The total time taken for the runtime operations during the last poll period.	Seconds
Process Memory Usage
Process Resident Memory	The amount of resident memory size in bytes used by the kubelet process during the last poll period.	Bytes
Process Virtual Memory	The amount of virtual memory size in bytes used by the kubelet process during the last poll period.	Bytes
Go Usage
Go Threads	The number of OS threads created by the Go runtime in the kubelet process during the last poll period.	Count
Goroutines	The number of goroutines that currently exist for the kubelet process during the last poll period.	Count

Operations

Metric name	Description	Units
Pod Worker Starts	The total count of pods that started a worker during the last poll period.	Count
Cgroup Manager Operations	The total count of cgroup manager operations during the last poll period.	Count
Pod Worker Operations
Pod Worker Operations - Create	The total count of pod to sync by kubelet for Create operation during the last poll period.	Count
Pod Worker Operations - Sync	The total count of pod to sync by kubelet for Sync operation during the last poll period.	Count
Pod Worker Operations - Update	The total count of pod to sync by kubelet for Update operations during the last poll period.	Count
Pod Worker Operations Duration
Average Pod Worker Operation Duration - Create	The average time taken by the kubelet to sync per pod for Create operation during the last poll period.	Seconds
Average Pod Worker Operation Duration - Sync	The average time taken by the kubelet to sync per pod for Sync operation during the last poll period.	Seconds
Average Pod Worker Operation Duration - Update	The average time taken by the kubelet to sync per pod for Update operation during the last poll period.	Seconds
Pod Worker Start Duration
Average Pod Worker Start Duration	The average time taken by the kubelet to see a pod and start a worker during the last poll period.	Seconds
Total Pod Worker Starts Duration	The total time taken for kubelet seeing a pod to starting a worker during the last poll period.	Seconds
Storage Operations
Storage Operations - Success	The total count of storage operation with status success during the last poll period.	Count
Storage Operations - Failure	The total time taken by the kubelet to see a pod and start a worker during the last poll period.	Count
Storage Operations Duration
Average Storage Operation Duration - Success	The average time taken per storage operation with status success during the last poll period.	Seconds
Average Storage Operation Duration - Failure	The average time taken per storage operation with status fail-unknown during the last poll period.	Seconds
Cgroup Manager Operation Duration
Average Cgroup Manager Operation Duration	The average time taken per cgroup manager operations during the last poll period.	Seconds
Total Cgroup Manager Operations Duration	The total time taken for cgroup manager operations during the last poll period.	Seconds

Pod Lifecycle Event Generator (PLEG)

Metric name	Description	Units
PLEG Relists	The total count of relisting in PLEG during the last poll period.	Count
PLEG Last Seen	The timestamp in seconds when PLEG was last seen active.	Timestamp
Time Since PLEG Last Seen	The elapsed time in seconds since PLEG last reported an event, indicating potential PLEG staleness.	Seconds
PLEG Discard Events	The number of discard events in PLEG during the last poll period.	Count
PLEG Relist Interval
Average PLEG Relist Interval	The average time taken between relisting pods in PLEG during the last poll period.	Seconds
Total PLEG Relists Interval	The total time taken between relisting pods in PLEG during the last poll period.	Seconds
PLEG Relist Duration
Average PLEG Relist Duration	The average time taken for relisting pods in PLEG during the last poll period.	Seconds
Total PLEG Relists Duration	The total time taken for relisting pods in PLEG during the last poll period.	Seconds

Kubernetes scheduler monitoring

On this page

How it works

Prerequisites

View your Kubernetes Kubelet monitor

Supported metrics

Kubernetes kubelet monitoring

How it works

Prerequisites

View your Kubernetes Kubelet monitor

Supported metrics

Utilization

Operations

Pod Lifecycle Event Generator (PLEG)

Related article