Observability#
Llumnix provides built-in observability at four levels: request performance, component diagnostics, fine-grained instance state, and engine-native metrics. Each component exposes a /metrics HTTP endpoint. Prometheus Operator CRDs (ServiceMonitor / PodMonitor) scrape these endpoints, and pre-built Grafana dashboards in deploy/observability/ visualize the collected data.
Monitoring Setup#
deploy/base/monitoring.yaml defines Prometheus Operator resources to scrape metrics from Llumnix components:
ServiceMonitor (
llumnix-control-plane): scrapes Gateway (port 8089) and Scheduler (port 8088) via/metricsevery 10s.PodMonitor (
llumnix-engine-neutral/llumnix-engine-prefill/llumnix-engine-decode): scrapes engine pods managed by LeaderWorkerSet (neutral, prefill, decode) via/metricsevery 10s, with relabeling to extractinfer_typeandmodellabels.
monitoring.yaml is included in deploy/base/kustomization.yaml. All deployment configurations under deploy/ reference this base and inherit monitoring automatically. Note that the engine PodMonitors match only LeaderWorkerSet-based pods; deployment examples using plain Deployments (e.g., traffic-mirror/, traffic-splitting/) are not covered by the engine PodMonitors — only Gateway and Scheduler metrics are collected for those examples.
Grafana Dashboards#
Pre-built Grafana dashboard JSON files are located in deploy/observability/.
Llumnix Request Dashboard#
llumnix-request-dashboard.json — end-user request-level metrics.
Panel |
Key Metrics |
Type |
Description |
|---|---|---|---|
Request Rate |
|
Counter |
Total inference requests processed, partitioned by status code |
Request Rate by Status |
|
Counter |
Request rate broken down by HTTP status code |
Request Retry & Fallback Rate |
|
Counter |
Retry count, fallback count, and successful retries after fallback rate-limit (429) |
Input / Output Token Throughput |
|
Counter |
Cumulative input (prompt) and output (completion) token counts |
Input / Output Token Distribution |
|
Histogram |
Per-request token count distribution |
E2E Latency |
|
Histogram |
End-to-end request latency in seconds |
TTFT |
|
Histogram |
Time to first token in milliseconds |
TPOT |
|
Histogram |
Time per output token: (E2E − TTFT) / (output_tokens − 1) in milliseconds |
ITL |
|
Histogram |
Inter-token latency between consecutive output tokens in milliseconds |
Prefix Cache Hit Ratio |
|
Histogram |
Prefix cache hit ratio on the selected instance per request (0–100%) |
Max Prefix Cache Hit Ratio |
|
Histogram |
Maximum prefix cache hit ratio across all instances per request (0–100%) |
Llumnix Component Dashboard#
llumnix-component-dashboard.json — internal component-level metrics for Gateway, Scheduler, and system runtime.
Panel |
Key Metrics |
Type |
Description |
|---|---|---|---|
Queue Duration |
|
Histogram |
Request queue waiting duration in milliseconds |
Preprocess Duration |
|
Histogram |
Request preprocessing duration in milliseconds |
Schedule Duration |
|
Histogram |
Request scheduling phase duration in milliseconds |
Postprocess Duration |
|
Histogram |
Response postprocessing duration in milliseconds |
Gateway Requests |
|
Gauge |
Pending and total in-flight requests in the Gateway |
Scheduling Events |
|
Counter |
Total scheduling attempts and failures |
Rescheduling Events |
|
Counter |
Total rescheduling operations and failures |
CMS Refresh Metadata Duration |
|
Histogram |
CMS instance metadata refresh duration |
CMS Refresh Status Duration |
|
Histogram |
CMS instance status refresh duration |
Full-Mode Schedule Duration |
|
Histogram |
Full-mode scheduling decision duration per request |
Query Prefix Cache Hit Duration |
|
Histogram |
Duration of querying KVS for prefix cache hit |
Calc Prefix Cache Hit Duration |
|
Histogram |
Duration of calculating prefix cache hit length |
Uptime / Goroutines / Go Memory |
|
Gauge |
System runtime diagnostics |
Llumnix CMS Dashboard#
llumnix-cms-dashboard.json — per-instance CMS status, split into Prefill and Decode sections. Each section includes:
Panel |
Key Metrics |
Type |
Description |
|---|---|---|---|
CMS Requests |
|
Gauge |
Per-instance request counts by state |
CMS Used Tokens |
|
Gauge |
GPU tokens currently used per instance |
CMS Prefill Tokens |
|
Gauge |
Uncomputed/unallocated tokens for prefill requests |
CMS Decode Tokens |
|
Gauge |
Tokens for decode and loading requests |
CMS Inflight Dispatch Requests |
|
Gauge |
Inflight dispatch request counts |
CMS Inflight Dispatch Tokens |
|
Gauge |
Tokens for inflight dispatch requests |
CMS KV Cache Usage Ratio |
|
Gauge |
Projected KV cache usage ratio per instance |
CMS Decode Batch Size |
|
Gauge |
Decode batch size per instance |
CMS All Prefill/Decode Tokens |
|
Gauge |
Total tokens for all prefill/decode requests per instance |
The dashboard also includes a Selected Instance Scheduling Metrics section showing the scheduling decision context for the chosen instance:
Panel |
Key Metrics |
Type |
Description |
|---|---|---|---|
Selected Instance KV Cache Usage Ratio |
|
Histogram |
Projected KV cache usage ratio on the selected instance |
Selected Instance Decode Batch Size |
|
Histogram |
Decode batch size on the selected instance |
Selected Instance Prefill/Decode Tokens |
|
Histogram |
Total prefill/decode tokens on the selected instance |
Selected Instance Predicted TTFT |
|
Histogram |
Predicted TTFT for the selected instance in milliseconds |
Selected Instance Predicted TPOT |
|
Histogram |
Predicted TPOT for the selected instance in milliseconds |
Llumnix LRS Dashboard#
llumnix-lrs-dashboard.json — per-instance Local Real-time State (LRS), split into Prefill and Decode sections.
Panel |
Key Metrics |
Type |
Description |
|---|---|---|---|
LRS Requests |
|
Gauge |
Running, waiting, and total requests on a backend endpoint |
LRS Tokens |
|
Gauge |
Running, waiting, and total token counts on a backend endpoint |
vLLM Dashboard#
vllm-dashboard.json — engine-native vLLM metrics.
Panel |
Key Metrics |
Type |
Description |
|---|---|---|---|
E2E Request Latency |
|
Histogram |
End-to-end request latency from vLLM |
Time To First Token Latency |
|
Histogram |
TTFT from vLLM |
Inter-Token Latency |
|
Histogram |
ITL from vLLM |
Scheduler State |
|
Gauge |
vLLM internal scheduler state (running/waiting) |
Cache Utilization |
|
Gauge |
KV cache utilization ratio |
Token Throughput |
|
Counter |
Token generation throughput |
Finish Reason |
|
Counter |
Request completion reason distribution |
Queue Time |
|
Histogram |
Time spent in vLLM queue |
Prefill and Decode Time |
|
Histogram |
Per-request prefill and decode time |
Request Prompt/Generation Length |
|
Histogram |
Prompt and generation length distributions (heatmap) |