Weekly GitHub Report for Kubernetes: April 07, 2025 - April 14, 2025 (14:15:00)
Weekly GitHub Report for Kubernetes
Thank you for subscribing to our weekly newsletter! Each week, we deliver a comprehensive summary of your GitHub project's latest activity right to your inbox, including an overview of your project's issues, pull requests, contributors, and commit activity.
Table of Contents
I. News
1.1 Recent Version Releases:
The current version of this repository is v1.32.3
1.2 Version Information:
The version release on March 11, 2025, introduces key updates and changes to Kubernetes, as detailed in the linked changelog, with additional binary downloads available for users. Notable highlights or trends from this release can be found in the Kubernetes announcement forum and the comprehensive changelog documentation.
II. Issues
2.1 Top 5 Active Issues:
We consider active issues to be issues that that have been commented on most frequently within the last week. Bot comments are omitted.
-
Extensible Node Readiness/Schedulability Conditions: This issue proposes a new mechanism for defining custom, extensible readiness conditions for Kubernetes Nodes to ensure that critical node-level components are operational before application workloads are scheduled. The goal is to improve upon current workarounds using taints, which can be complex and introduce potential race conditions, by allowing nodes to signal true readiness only after essential components like monitoring agents and security scanners are confirmed operational.
- The comments discuss the need for a Kubernetes Enhancement Proposal (KEP) for this feature, highlight related issues, and explore various use cases and potential benefits. Concerns are raised about scheduling DaemonSets on nodes not fully ready, the necessity of controllers to monitor updates, and the potential race conditions with the current taint mechanism. Some commenters express skepticism about the benefits, suggesting that the existing taint mechanism might be sufficient.
- Number of comments this week: 23
-
Probes do not honour the protocol of the port.: This issue highlights a problem where Kubernetes probes do not respect the protocol specified for a port, leading to failures when using protocols like HTTP/3 that rely on QUIC. The user expected the probes to utilize the protocol defined in the port configuration, but instead, the probes default to TCP, causing connection issues.
- The comments discuss the historical behavior of Kubernetes probes defaulting to TCP, regardless of the specified protocol, and the potential need for a Kubernetes Enhancement Proposal (KEP) to address this. There is a consensus that the current behavior is incorrect, especially for HTTP/3, and suggestions include using an exec probe with a custom client as a workaround. The discussion also touches on the need for future API changes to support HTTP/3 once the Go language's net/http package implements it, and the possibility of adding validation to prevent confusion in the interim.
- Number of comments this week: 11
-
DRA Prioritized List: allow alternative with "no devices": This issue discusses the possibility of allowing a "no devices" option in the Kubernetes Device Resource Allocation (DRA) prioritized list, specifically when using the
firstAvailable
allocation mode. The proposal suggests that this option would enable workloads to run on a regular CPU without any device allocation, which could be useful in certain scenarios where device allocation is not necessary.- The comments explore the implications of allowing a
count: 0
in the middle of a request list, with some contributors suggesting it could lead to unintuitive behavior. There is a discussion about whether this option should only be allowed as the last entry, and a pull request has been created to address the issue. Some contributors propose adding anallocationMode: Nothing
to clarify the intent, while others express concerns about the complexity and potential for unintended side effects. - Number of comments this week: 10
- The comments explore the implications of allowing a
-
plugin execution metric buckets are not useful for debugging high latency plugins: This issue highlights the difficulty in debugging high latency plugins in Kubernetes due to the
scheduler_plugin_execution_duration_seconds_bucket
metric not providing sufficient granularity, as it lacks buckets beyond 0.022 seconds, resulting in most observations falling into the +Inf bucket. The user expects a more detailed metric system that can help identify which plugin and extension point are causing the highest latency, especially when scheduling latency for pods with CSI-PVC is high.- The comments discuss potential solutions, including increasing the number of buckets to improve resolution and considering an alpha-level metric with more granular buckets. There is a concern about the impact on performance if additional metrics are exported, and discussions are ongoing to find a balanced solution that addresses the granularity issue without significantly affecting performance.
- Number of comments this week: 10
-
Kubelet failed to start on reboot with memory manager's
Static
policy: This issue addresses a problem with the Kubernetes kubelet failing to start on reboot when using the memory manager'sStatic
policy, due to changes in the total memory of each NUMA node, even though the overall memory remains consistent. The issue persists despite improvements in memory manager reliability in Kubernetes v1.32, as the current implementation still checks for unchanged total memory per NUMA node, leading to errors when there are slight variations.- The comments discuss potential solutions, including modifying the memory state comparison to focus on the total memory across all NUMA nodes rather than individual nodes, and ensuring that memory variations do not affect allocated pods. There is also a request for clarification on the conditions causing memory changes on reboot, and a volunteer offers to work on the issue by proposing code changes and adding unit tests to address the problem.
- Number of comments this week: 9
2.2 Top 5 Stale Issues:
We consider stale issues to be issues that has had no activity within the last 30 days. The team should work together to get these issues resolved and closed as soon as possible.
- Ensure Secret Image Pulls for service accounts: This issue addresses the need to ensure that image pulls using service account tokens are restricted and secure, aligning with the new feature introduced to secure kubelet-pulled images. The enhancement is necessary to maintain consistent and opaque access control across all types of credentials used for image pulling, ensuring that the security measures are uniformly applied. Since there were fewer than 5 open issues, all of the open issues have been listed above.
2.3 Open Issues
This section lists, groups, and then summarizes issues that were created within the last week in the repository.
Issues Opened This Week: 25
Summarized Issues:
- Device Resource Allocation (DRA) Validation and Testing: The modification of Kubernetes' DRA validation logic is being considered to allow configurations where no devices are allocated, which would require changes to the validation code. Additionally, scale testing for the GA of structured parameters in the DRA feature is needed to ensure scalability and performance before the release of version 1.34.
- ValidatingAdmissionPolicy and CEL Type Checker: A problem with the ValidatingAdmissionPolicy in Kubernetes involves the CEL type checker failing when using a complex expression with the
all
macro. This results in an error because theobject.spec
is not recognized as a valid range for a comprehension.
- TLS Verification and
kubectl
Command: Thekubectl
command, when executed within a pod in "in-cluster" mode, does not respect the--insecure-skip-tls-verify
flag, leading to certificate validation errors. This discrepancy occurs because the same command works when a kubeconfig file is used.
- Custom Readiness Conditions for Nodes: A new mechanism is proposed for Kubernetes to define custom, extensible readiness conditions for nodes. This would allow nodes to signal true readiness for application workloads only after essential node-level components are operational.
- Kubernetes Scheduling and Latency Debugging: Debugging high latency in Kubernetes scheduling for pods with CSI-PVC is difficult due to insufficient granularity in the
scheduler_plugin_execution_duration_seconds_bucket
metric. Most observations are grouped into a single large bucket, obscuring the identification of specific plugins causing delays.
- CPU Usage Metrics and Cgroup Conversion: In Kubernetes version 1.30.0, CPU usage metrics for Windows nodes in AKS clusters are inaccurately lower than the sum of their pods' CPU usage. Additionally, Kubernetes workloads experience low CPU priority when converting cgroup v1 CPU shares to cgroup v2 CPU weight.
- Flakiness in Kubernetes Jobs and Tests: The "pull-kubernetes-e2e-gce" job is experiencing random failures potentially due to apiserver timeouts in etcd. Similarly, a flaking test in the Kubernetes project involves the e2e suite's metrics gathering from the kubelet's /metrics/resource endpoint failing due to an "Invalid Kubelet port 0" error.
- Jenkins Pipelines and Windows Node Pool: Jenkins pipelines fail randomly when executing multiple stages on a Windows node pool in an AKS environment. This is potentially due to inconsistencies in the file system of the Windows pods.
- Kube-proxy and Kernel Subsystems Check: An enhancement to the kube-proxy component of Kubernetes is proposed by suggesting a preliminary check for the existence of the "/proc/sys/net/ipv4" and "/proc/sys/net/ipv6" directories. This ensures that both IPv4 and IPv6 kernel subsystems are enabled before proceeding with IPtables and nftables checks.
- GCE Windows Bootstrap and CSI Proxy: Refactoring the GCE Windows bootstrap process in Kubernetes is suggested to utilize a community-maintained release of the CSI Proxy. This would streamline the integration and ensure that upstream Kubernetes uses binaries from upstream build pipelines.
- Pod Status and Service Traffic Routing: A Kubernetes Pod with two containers experiences a status change to NotReady after one container exits successfully. This prevents the associated Service from routing traffic to it, despite the other container continuing to run.
- Node Taints and Resource Versioning: A custom taint added to a Kubernetes node gets overwritten by the system's automatic addition of "not ready" taints. This may be due to a caching issue that does not reflect the latest node state.
- CSI Volume Unmounting and
vol_data.json
File: The Kubernetes CSI unmounter process is unable to detach a volume due to the absence of thevol_data.json
file. This file is necessary for theNewUnmounter
method to function correctly.
- Scheduler Preemption Process and
nominatedNodeName
: The unnecessary clearing of thenominatedNodeName
in the Kubernetes scheduler's preemption process is addressed. This action is redundant since thenominatedNodeName
is already cleared later in the scheduling cycle.
kubectl drain
Command and Node Credentials: In Kubernetes 1.32, thekubectl drain
command fails to complete successfully when using node credentials. This is due to the default activation of the AuthorizeNodeWithSelectors feature gate, which restricts nodes to only query pods associated with them.
- Pod Termination and Node Shutdown: A mechanism to delay the termination of Pods during a node shutdown is proposed, making the process dependent on external conditions. This would allow administrators time to manage critical Pods.
- Kubelet Start and Memory Manager Policy: The Kubernetes kubelet fails to start on reboot when using the memory manager's
Static
policy. This is due to changes in the total memory of individual NUMA nodes, highlighting the need for an improved memory state comparison.
- CPU and Memory Requests for Kubernetes Components: Guidance is sought on setting appropriate CPU and memory requests and limits for Kubernetes components in a cluster with fewer than 50 nodes. This cluster was installed using kubeadm.
- Pause Image Version and Build Process: The unavailability of the pause image version 3.10.1 in the staging and production environments is due to a failed build process for the Windows pause image. This was caused by an invalid reference format related to the
tag@digest
.
- ServiceCIDR Implementation and Cluster Upgrade: A breaking change in the Kubernetes ServiceCIDR implementation requires manual intervention when upgrading a cluster from a single to a dual stack. This highlights the need for documentation and resolution to ensure consistent and automated migration processes.
- CI Job Failures and Network Services Tests: A failure cluster identified in the
ci-containerd-e2e-ubuntu-gce
CI job involves three specific Kubernetes end-to-end tests related to network services intermittently failing. Errors indicate a "context deadline exceeded" when expecting a "200 OK" response.
- CDI Specification and Device Usage: Within a single claim, the same device should not be used twice when admin access is requested. This can lead to an invalid CDI specification and subsequent errors, prompting a discussion on whether to enforce a rule to prevent this scenario.
2.4 Closed Issues
This section lists, groups, and then summarizes issues that were closed within the last week in the repository. This section also links the associated pull requests if applicable.
Issues Closed This Week: 9
Summarized Issues:
- CPU Manager Policy Issues: The Kubernetes project is experiencing failures with the CPU Manager tests under the static CPU Manager policy. The tests are not enforcing the CFS quota for containers as expected, with errors indicating missing log entries in container outputs.
- Test Failures and Panics: A panic occurs in the
TestRegistrationHandler
unit test for Kubernetes, manifesting in a goroutine after test completion. This issue is reproducible only when an artificial delay is introduced into the test code.
- BackOff Implementation Concerns: The
wait.BackOff
implementation in Kubernetes does not meet user needs for executing operations at exponentially increasing intervals up to a cap. The current implementation terminates once the cap is reached, prompting discussions on maintaining the duration at the cap value.
- Pod Capacity and Resource Management: The
pull-kubernetes-e2e-kind-ipv6
job fails due to the test cluster reaching its pod capacity during the[sig-node] Mount propagation
test. The nodekind-worker2
was out of pod resources, having reached its maximum capacity of 110 pods.
- Import and Code Elimination Issues: The
go-cmp
import in the Kubernetes client-go library causes dead code elimination to be disabled for users. This issue was closed as a duplicate of another issue.
- Service Port Synchronization Problems: Updating a Kubernetes service manifest to have the same NodePort for both UDP and TCP protocols does not synchronize the ports correctly. The issue suggests that deleting and recreating the service resolves the problem, highlighting that client-side patching is fundamentally broken for service ports.
- Pod PID Limit Discrepancies: The podPidsLimit setting in kubelet is not correctly applied to containers within a Pod, resulting in a discrepancy between the expected PID limit and the observed limit. This causes runtime errors related to thread creation, and guidance is sought on configuration adjustments or known limitations.
- Go Module Tagging Issues: The absence of a v0.33.0-rc.0 tag for Go modules in the Kubernetes API repository was reported and resolved. This tag was expected to be visible on the Go package documentation site.
- Headless Service Reachability: A headless service with a selector in a Kubernetes cluster is not reachable from within the cluster without specifying the explicit port in the URI. This is due to the lack of port translation typically handled by kube-proxy in non-headless services.
2.5 Issue Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed issues that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed issues from the past week.
III. Pull Requests
3.1 Open Pull Requests
This section provides a summary of pull requests that were opened in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Opened This Week: 43
Key Open Pull Requests
1. Fix several goroutine leaks on controllers: This pull request addresses several goroutine leaks in various Kubernetes controllers by ensuring that handlers registered with informers are properly unregistered upon controller shutdown, thereby preventing resource leaks and improving error handling by returning errors from handler registration processes that were previously ignored.
- URL: pull/131199
- Merged: No
2. WIP: feat(ccm): watch-based route controller reconciliation: This pull request introduces a new watch-based route controller for the cloud-controller-manager in Kubernetes, designed to trigger reconciliation only when specific events occur (such as node additions, deletions, or updates to Status.Addresses
or PodCIDR
fields), thereby preventing the exhaustion of API rate limits and includes a new feature gate, CloudControllerManagerWatchBasedRoutesReconciliation
, which is disabled by default.
- URL: pull/131220
- Merged: No
3. Add SharedInformer.AddContextEventHandler and AddContextEventHandlerW…: This pull request introduces the AddContextEventHandler
and AddContextEventHandlerWithOptions
methods to the SharedInformer in Kubernetes, providing a context-aware mechanism for adding event handlers that ensures they are properly stopped with their controllers, thereby addressing the need for a more efficient way to manage goroutines associated with registered handlers.
- URL: pull/131225
- Merged: No
Other Open Pull Requests
- Enhancements to kube-proxy and resource slice publishing: The pull requests focus on improving the
kube-proxy --cleanup
functionality by reducing unnecessary error logging and addressing missing cleanup tasks. Additionally, enhancements to resource slice publishing include fixing support for dropped fields and improving error reporting.
- Bug fixes in Kubernetes codebase: Several pull requests address bugs such as replacing
os.Exit
with return statements to prevent abrupt terminations, suppressing unnecessary error logs for long-running connections, and enabling mutex profiling to diagnose performance bottlenecks. These fixes enhance error handling and improve system robustness.
- Test improvements and debugging: Pull requests aim to address flaky and failing tests by marking tests as Serial, eliminating dependencies on uninitialized flags, and debugging test jobs. These efforts ensure more reliable test execution and help identify underlying issues.
- Workqueue enhancements and bug fixes: The pull requests introduce an "interruptible queue" feature and fix a bug in the workqueue's shutdown behavior. These changes improve the workqueue's functionality by allowing for more efficient worker management and aligning shutdown behavior with expected functionality.
- Websocket and ResourceClaim improvements: Enhancements include ensuring all websocket streams are created before initiating loops to prevent race conditions and allowing zero count in ResourceClaims for flexible workload scenarios. These changes address specific issues and enhance feature flexibility.
- API and client-go library updates: Updates include adding a QosClass Compare function to facilitate Pod sorting and fixing a bug where the
--insecure-skip-tls-verify
flag was not honored. These updates enhance API functionality and ensure proper flag handling in the client-go library.
- PersistentVolumeClaim and code cleanup: Bug fixes include correcting field name mismatches in PVC status validation and cleaning up unnecessary code related to unsupported features. These changes ensure consistency with documentation and remove redundant code.
- FlowSchema and directory name adjustments: Pull requests focus on removing outdated FlowSchema objects and shortening directory names for Pod logs to prevent test failures. These adjustments prevent potential issues and ensure compliance with system limits.
- Kubelet and test image updates: Bug fixes address issues with CSI volume unmounting and flaky test metrics collection, while updates to the busybox test image resolve memory leak issues. These changes improve kubelet reliability and test stability.
- Documentation and feature updates: Updates include adding a link to the releases badge in documentation and replacing
filepath-securejoin
with Go 1.24'sos.Root()
feature. These updates enhance documentation clarity and consolidate feature implementation.
3.2 Closed Pull Requests
This section provides a summary of pull requests that were closed in the repository over the past week. The top three pull requests with the highest number of commits are highlighted as 'key' pull requests. Other pull requests are grouped based on similar characteristics for easier analysis. Up to 25 pull requests are displayed in this section, while any remaining pull requests beyond this limit are omitted for brevity.
Pull Requests Closed This Week: 12
Key Closed Pull Requests
1. print Env and copy runc to /bin: This pull request involves printing environment variables and copying the 'runc' binary to the '/bin' directory, and it is associated with testing a specific Kubernetes infrastructure change, although it was not merged.
- URL: pull/130883
- Merged: No
- Associated Commits: b71f1
2. DRA kubelet: fix potential flake in unit test: This pull request addresses a bug in the Kubernetes project by fixing a potential flake in a unit test related to the DRA kubelet, where background activities were not being stopped before a test returned, leading to outdated state usage and an invalid testing.T pointer that caused a panic, and it resolves issue #131056 without introducing any user-facing changes.
- URL: pull/131065
- Merged: 2025-04-09T14:06:48Z
- Associated Commits: 52298
3. wip: sustain cap value for wait.backoff: This pull request aims to address issue #131122 by sustaining the cap value for the wait.backoff mechanism in the Kubernetes project, although it was not merged.
- URL: pull/131123
- Merged: No
- Associated Commits: e34e7
Other Closed Pull Requests
- CSI Proxy Update: This pull request updates the CSI Proxy to version v1.2.1-gke.2 to potentially fix a flaky volume resize test. The updated binary is available for download, and testing was conducted through CI jobs in the PD CSI Driver.
- Bug Fixes and Improvements: Several pull requests address bug fixes and improvements in the Kubernetes project. One updates the method used to determine version stability in emulation forward compatibility, while another ensures the name of the
ClusterRole
resource is randomized during e2e testing to prevent conflicts. Additionally, a pull request adds rules for the release-1.33 in the staging/publishing section to facilitate the publication of v1.33 tags.
- Library and Code Refactoring: The Kubernetes project is updated to use the final released version v1.22.0 of the
prometheus/client_golang
library, replacing the previous release candidate version. Additionally, thevalidateNodeIP
function is refactored by replacing cascading if-else statements with a switch statement to enhance code readability and maintainability.
- SELinux Test Tagging: A pull request involves tagging SELinux tests that require the SELinux warning controller, which is only available when the
SELinuxChangePolicy
feature gate is enabled. This assists downstream test runners in identifying necessary feature gates and includes a modification to apply theSELinuxMountReadWriteOncePod
tag at a common level.
- Unmerged Pull Requests: Two pull requests were not merged into the main codebase. One was created by mistake and duplicates an existing pull request, while the other involves the addition of files to the Kubernetes project via upload.
3.3 Pull Request Discussion Insights
This section will analyze the tone and sentiment of discussions within this project's open and closed pull requests that occurred within the past week. It aims to identify potentially heated exchanges and to maintain a constructive project environment.
Based on our analysis, there are no instances of toxic discussions in the project's open or closed pull requests from the past week.
IV. Contributors
4.1 Contributors
Active Contributors:
We consider an active contributor in this project to be any contributor who has made at least 1 commit, opened at least 1 issue, created at least 1 pull request, or made more than 2 comments in the last month.
If there are more than 10 active contributors, the list is truncated to the top 10 based on contribution metrics for better clarity.
Contributor | Commits | Pull Requests | Issues | Comments |
---|---|---|---|---|
BenTheElder | 14 | 4 | 4 | 94 |
pohly | 30 | 5 | 6 | 55 |
liggitt | 17 | 5 | 0 | 50 |
aojea | 5 | 3 | 6 | 53 |
danwinship | 26 | 1 | 0 | 39 |
bart0sh | 19 | 1 | 1 | 17 |
dims | 6 | 1 | 7 | 24 |
tabbysable | 0 | 0 | 5 | 28 |
serathius | 18 | 0 | 2 | 11 |
thisisharrsh | 0 | 0 | 0 | 30 |
Access Last Week's Newsletter: