KubeBench
KubeBench: A Domain-Specific LLM Benchmark for Kubernetes Code Generation
University of California at Berkeley, School of Information | Masters Data & Information Science Capstone Project Fall 2025
Team Members: Matt K. Robinson, Catherine Weiss, Anni Yao, Nick Cai, Tahlee Stone
Originally detailed at C++ Conference (CppCon) in September, 2025 in Aurora, Colorado during talk titled “You Only Ever Look Once: Fine Tuning Domain-Specific Code Generating LLMs (Matt Robinson).
Original text published November, 2025, University of California at Berkeley, School of Information.
Full technical report is available at KubeBench.dev
Introduction
2026 Context
Kubernetes is an open-source platform that automates the deployment, scaling, and management of containerized applications across clusters of computers. It is the software that "runs the cloud". Kubernetes (or K8s) ensures applications remain available, can handle increased demand, and recover from failures automatically (not autonomously, yet though. More on that later). The global Kubernetes market is currently $2.5B and projected to reach $10.4B by 2033. K8s is central to operating modern cloud infrastructure. Despite this ubiquity and economic significance, there are currently no dedicated LLMs specifically trained for Kubernetes, nor is there a comprehensive way to evaluate how general-purpose LLMs perform when working with such an important and widespread technology.
KubeBench is a novel Kubernetes-specific benchmark that extends prior work by KubeIntellect and CloudEval-YAML. It is a three-dimensional evaluation framework that determines the quality of LLM-generated YAML across three vectors:
Runtime validity
Operational validity
Evaluation of complete reasoning pathways by an LLM-as-judge
The benchmark evaluates more than 16 base, supervised fine-tuned, and agentic open-source code generation LLMs on 810 Kubernetes-specific tasks spanning the foundational resources within a Kubernetes cluster.
The Problem: AI Agents and the Kubernetes Challenge
Kubernetes Has Won the Cloud War
Kubernetes has decisively won the cloud orchestration wars, with enterprises across industries investing billions of dollars in infrastructure and employing hundreds of thousands of human engineers responsible for either working with or directly managing extraordinarily diverse deployments. These deployments span financial services, healthcare, e-commerce, machine learning platforms, IoT systems, and countless other domains with unique compliance requirements, performance constraints, security postures, and operational patterns.
A banking institution's Kubernetes deployment bears little resemblance to a genomics research cluster or a real-time gaming platform's infrastructure. This diversity presents a fundamental challenge to the autonomous agent approach: it is highly unlikely that any single LLM agent, regardless of how well-trained, can achieve expert-level proficiency across all possible Kubernetes deployment patterns, industry-specific requirements, and organizational constraints. The knowledge required to navigate HIPAA-compliant healthcare clusters differs fundamentally from that needed to optimize high-frequency trading infrastructure or manage edge computing deployments across thousands of retail locations.
Where AI Agents Are Breaking
We lack a general evaluation benchmark and framework for Kubernetes YAML generation that aligns with a goal of immediate productivity increase from AI adoption. Recent research has primarily targeted autonomous, self-healing cluster infrastructure. While this is a worthy long-term vision, it nonetheless remains distant from production deployment and fails to address the day-to-day challenges faced by teams managing diverse Kubernetes environments today.
The fundamental challenge with developing Kubernetes expert systems and benchmarking their real-world effectiveness lies in how we measure effectiveness, not in how soon we can remove the human domain expert.
The Infrastructure Engineering Domain Faces Unique Challenges
1. Documentation Velocity
Kubernetes releases occur quarterly, with each release introducing new features, deprecations, and API changes
Cloud providers (GCP, AWS, Azure) continuously extend Kubernetes with proprietary services and custom resource definitions
Keeping current with best practices across this moving target is a substantial cognitive burden for human engineers
2. Context Switching Costs
Infrastructure engineers frequently move between different cluster environments
Each cluster may have distinct configurations, networking policies, storage backends, and service mesh implementations
Recalling the specific syntax, available resources, and operational patterns for each context imposes significant overhead
3. High Stakes of Errors
Unlike application code that can be quickly rolled back, infrastructure misconfigurations can cause cascading failures, security vulnerabilities, or compliance violations with severe business consequences
Engineers need high-confidence, verifiable answers; and probabilistic approximations are markedly less useful and errors are hard to identify in YAML template configurations alone
4. Complexity of Troubleshooting
Diagnosing failures in distributed systems requires correlating evidence across logs, metrics, network policies, resource quotas, RBAC configurations, and application state
So if this is all the case, then why wouldn't traditional LLM evaluation methods apply to an LLM that generates Kubernetes?
Why Traditional Evaluation Metrics Fail
Traditional text-level similarity metrics, such as BLEU and Edit Distance, are impractical when evaluating LLMs specializing in generating Kubernetes configuration code. This inadequacy stems directly from the nature of the configuration language.
Key Issues:
Declarative and Non-Deterministic: In Kubernetes systems, YAML is declarative and the object order of key-value pairs is not critical for functional correctness
Functionally Equivalent Configurations: Two YAML files with identical field values but different orderings are functionally equivalent and are interpreted the same by a Kubernetes system's manager node
Traditional Metrics Penalize Correct Code: BLEU and edit distance systematically fail to align with YAML's semantic properties where field ordering is not relevant. These text-level similarity metrics fail to account for YAML's declarative and non-deterministic nature, penalizing syntactically different but functionally identical outputs. This renders such metrics dead on arrival for meaningful evaluation.
What KubeBench Measures
Three-Dimensional Evaluation Framework
Instead of grading YAML by string similarity, KubeBench evaluates how models behave against a live Kubernetes API:
1. Runtime Validity
Does the generated YAML validate successfully against the Kubernetes cluster specifications without error?
2. Operational Validity
If the configuration deployed successfully to a Kubernetes cluster, do the declared resources exist?
3. LLM-as-Judge Reasoning Assessment
An additional LLM agent analyzes the model's reasoning pathways and comments on whether they appear to have been appropriate for the task
The LLM judge evaluates generated YAMLs for all cases, even in cases of failed execution
Real Production Tasks
Each prompt reflects a concrete cluster scenario, such as configuring quotas for a team namespace or tightening RBAC for a service account.
Difficulty Tiers
Tasks span easy, intermediate, and advanced levels to mirror day-to-day infrastructure engineering work and reflect real-world Kubernetes scenarios.
The 810 Kubernetes Tasks
Eight Foundational Resource Categories
This current iteration of KubeBench comprises 810 Kubernetes-specific CREATE tasks spanning eight foundational Kubernetes resource categories:
clusterrole
clusterrolebinding
configmap
namespace
role
rolebinding
secret
serviceaccount
These resources represent the bedrock of any production-level Kubernetes cluster. A Kubernetes cluster cannot realistically function in a production environment without them properly defined and configured.
Why These Resources Matter (Detail)
Namespace: Provides logical isolation between different applications or teams
ConfigMap and Secret: Store application configuration data and sensitive credentials respectively
RBAC Resources (role, rolebinding, clusterrole, clusterrolebinding, serviceaccount): Collectively implement Role-Based Access Control (RBAC), which determines which users, applications, and processes have permission to perform which operations within the cluster
Critical for Security: A cluster is either completely insecure or completely non-functional without RBAC correctly and securely configured
Why These Resources Are Challenging for LLMs
What makes these resources particularly challenging for LLM code generation is their extreme variability and customization potential. Unlike deploying a simple web application that follows relatively standard patterns, these foundational resources must be tailored to each organization's specific:
Security policies
Compliance requirements
Multi-tenancy needs
Operational workflows
The combinatorial space of valid, secure, and operationally sound configurations for these resources is effectively infinite, making them ideal test cases for evaluating whether LLMs can generate contextually appropriate infrastructure code rather than merely reproducing common templates.
Why This Matters for Infrastructure Engineers
Kubernetes Adoption
Kubernetes has become ubiquitous in modern infrastructure, with 96% of enterprises using it to orchestrate containerized applications. Over 50,000 businesses worldwide — from Fortune 500 companies to healthcare systems to AI platforms — rely on Kubernetes to manage production workloads.
Kubernetes adoption continues to grow across finance, healthcare, telecom, retail, and manufacturing, and thousands of engineers now depend on LLMs to generate, review, and troubleshoot manifests safely.
The Human-Augmentation Approach
Prior work in this area has recently focused on creating fully autonomous, self-healing Kubernetes clusters. KubeBench is instead intentionally building for human infrastructure engineers and the AI agents that exist in their systems.
Fine-Tuned Kubernetes Models
KubeBench measures progress toward creating better tools for human engineers who want to deploy intelligences that hold specific domain expertise. The Kubernetes models are instruction fine-tuned on a large repository of Kubernetes documentation and open source code bases.
Available KubeBench-Trained Models
Qwen K8s Assistant 0.5B (qwen_0.5b_k8s-qlora_v4)
Ultra-lightweight model optimized for manifest generation and quick scaffolding in resource-constrained environments
Ideal for CI/CD pipelines, edge deployments, and rapid iteration on small clusters
Qwen K8s Specialist 1.5B (qwen_1.5b_k8s-qlora_v3)
Balanced model with enhanced multi-step reasoning for production workflows and troubleshooting
Strong performance on rollout planning, CI/CD configurations, and operational tasks
Gemma 2 K8s Deep Reasoner 2B (gemma_2b_k8s-qlora_v4)
Specialized for complex reasoning tasks including migration planning, debugging sessions, and architecture reviews
Excels at providing detailed explanations and "what & why" narrative alongside configurations
Llama 3.1 K8s Expert 8B (llama_v4)
Production-grade model deeply fine-tuned for RBAC, controllers, operators, and enterprise deployment patterns
Provides opinionated best-practice defaults with both code generation and architectural guidance
Fine-Tuning Approach: QLoRA and Domain-Specific Training
The model development strategy employs Quantized Low-Rank Adaptation (QLoRA), a parameter-efficient fine-tuning technique that combines quantized base models with Low-Rank Adaptation (LoRA) modules.
Training Data Strategy
Dataset Source: The Stack
The fine-tuning dataset is derived from The Stack, a comprehensive corpus containing over 6TB of permissively-licensed source code files covering 358 programming languages. Created as part of the BigCode Project — an open scientific collaboration focused on the responsible development of Large Language Models for Code.
From The Stack's extensive collection, the team extracted and focused specifically on YAML configurations from Kubernetes-related repositories, implementing rigorous data quality controls including:
Filtering for correctness (removing buggy or outdated examples)
Validating against current Kubernetes API schemas
Ensuring structural consistency
Dataset Source: Kubernetes Documentation
An additional dataset was curated from Kubernetes open source documentation from official Kubernetes websites covered by the Creative Commons CC-4.0 and Apache 2.0 licenses, including:
Kubernetes official documentation (kubernetes.io)
Kubernetes Developer Documentation
kubectl official documentation
kubectl Developer Documentation
The combined datasets contained hundreds of thousands of valid Kubernetes samples for implementing fine-tuning workflows across the selected models.
Results: Performance Improvements
The results demonstrate clear performance improvements over baseline models:
+18-25% execution validity: Generated YAML successfully validates and applies to Kubernetes clusters
+21-27% YAML quality: Configurations align with best practices and operational requirements
+18-23% overall benchmark score: Combined improvement across runtime validity, operational correctness, and reasoning assessment
Notably, smaller models (0.5B-2B parameters) benefited most from domain-specific fine-tuning. This finding has significant implications for edge deployments and resource-constrained environments where developers still need fast and accurate assistance but are without internet or large compute.
Results from Fine-tuned Llama 3.1 8B (8B parameters)
The fine-tuned Llama 3.1 8B model achieved 70.8% overall success rate across 720 tasks. This model represents the best-performing configuration and demonstrates the potential of QLoRA fine-tuning for domain-specific code generation.
Performance by Resource Type and Complexity:
Resource Type | Complexity | Success Rate |
|---|---|---|
Clusterrole | basic | 96.7% |
Clusterrole | intermediate | 73.3% |
Clusterrole | advanced | 36.7% |
Configmap | basic | 93.3% |
Configmap | intermediate | 80.0% |
Configmap | advanced | 100.0% |
Namespace | basic | 90.0% |
Role | basic | 96.7% |
Rolebinding | basic | 100.0% |
Rolebinding | intermediate | 80.0% |
Rolebinding | advanced | 90.0% |
Serviceaccount | basic | 90.0% |
Serviceaccount | intermediate | 80.0% |
Serviceaccount | advanced | 93.3% |
Variable Effects of Supervised Fine-Tuning
The analysis of the effects of supervised fine-tuning were highly variable: while certain models showed marked improvement on specific task categories, others demonstrated marginal or even negative performance changes. Subsequent analysis underscores the non-uniform nature of domain adaptation across different model architectures and task types.
Platform & System Architecture
Production Deployment
The platform is fully deployed on cloud infrastructure with FastAPI inference endpoints powering an interactive web interface. Users can engage with working chat functionality, explore model capabilities, and download any of the four fine-tuned model files hosted for community use and research.
Evaluation Clusters
Local Development Environment: Minikube
Kubernetes Version: v1.32.1 (server), v1.32.2 (client)
Minikube Version: v1.35.0
Driver: Docker
Host: MacBook Pro (2020) with Apple M1, 8 CPU cores, 16 GB memory
Production Environment: Google Kubernetes Engine (GKE) Autopilot
Name: REDACTED
Type: GKE Autopilot (fully managed)
Current Nodes: 4 (autoscaling)
Region: us-central1 (multi-zone: a, b, c, f)
Kubernetes Version: v1.33.5-gke
Key Learnings
Early user feedback and evaluation results revealed important insights about fine-tuning for infrastructure code generation:
Fine-tuned models demonstrate substantially deeper Kubernetes knowledge and greater specificity than their base model counterparts, particularly in generating correct RBAC configurations and understanding resource relationships
Fine-tuning alone may not produce the desired effect on already instruction-tuned models at larger sizes; achieving strong performance required substantial additional work in prompt engineering, dataset curation, and evaluation methodology refinement
User testimonials confirm that perfection is neither a desired nor expected outcome. Infrastructure engineers value models that reduce cognitive overhead and accelerate initial scaffolding, even if manual review and adjustment remains required
Future Directions
Benchmark Expansion
Immediate priority is extending coverage beyond CREATE operations to include READ, UPDATE, and DELETE tasks, as well as CLI generation scenarios. The goal is to expand from the current 810 tasks to a 10,000-task research-grade benchmark covering the full spectrum of Kubernetes resource types and operational workflows.
Fine-Tuning Improvements
Increase dataset curation through enhanced filtering, deduplication, and validation against current Kubernetes API schemas
Experimentation with hybrid training approaches combining QLoRA with targeted instruction tuning
Adding automated loss-curve analysis and error-pattern detection to accelerate training iteration cycles
Front-End & Agent Capabilities
Planned upgrades include:
Context window management for session memory
Unique chat experiences for each fine-tuned model
File and document upload features for context-aware generation
Multi-file editing with live schema validation
Explain mode that shows why the model generated each field
Architectural Evolution
Long-term development includes:
RAG integration with official Kubernetes documentation for real-time reference
Domain-specific context templates tailored to common infrastructure workflows
Agent orchestration capabilities for handling larger multi-step infrastructure tasks that span multiple resource types and deployment stages
Conclusion
KubeBench demonstrates that domain-specific fine-tuning of coding LLMs can address the inaccuracy and inefficiency of general-purpose models when generating infrastructure code. The result is more reliable YAML output, faster developer workflows, and measurable productivity gains for engineers managing production Kubernetes environments.
The KubeBench benchmark is designed specifically to measure progress toward this human-augmentation goal. The evaluation criteria directly correspond to what a human engineer needs from an LLM assistant: syntactically correct configurations, functionally valid deployments, and transparent reasoning that can be audited and understood.
Learn More
Visit KubeBench.dev to:
Explore the interactive benchmark
Download fine-tuned models
Try the chat interface with Kubernetes-specialized LLMs
Access the full technical report and research findings