Over the past few years, Kubex has evolved from a cloud optimization product into a Kubernetes-centric solution, shifting its focus from cost and waste visibility to fully automated resource optimization. As that evolution happened, one of the earliest design decisions we had made began to show its limits: how the product was configured.
Our original automation component, responsible for making optimization decisions across customer clusters, was configured either through Helm chart values or by manually editing a ConfigMap. This worked well early on, but as the product grew in scope and sophistication, configuration through unstructured data became harder to reason about, validate, and scale.
As part of our broader effort to become a Kubernetes-first product, we decided to rethink configuration entirely and move from ConfigMaps to domain-specific Custom Resources.
Why We Started With ConfigMaps
When we first built Kubex’s automation framework, the problem space was relatively small. We were optimizing CPU and memory at pod admission time using historical usage data, and the configuration reflected that simplicity.
At that moment, ConfigMaps were an obvious choice. They were easy to use, required no additional controllers or schema definitions, and let us iterate quickly while the product was still evolving rapidly. They also required no addons on top of Kubernetes, which made deployment and updates straightforward.
At the time, this tradeoff made sense. We wanted to iterate rapidly, the structure wasn’t completely defined, and the configuration was simple enough that the lack of strong typing or validation was manageable.
Where ConfigMaps Started to Break Down
As Kubex is expanding into real-time resizing, GPU optimization, and more expressive policy-driven automation, the limitations of ConfigMaps became increasingly visible.
The lack of structure meant that configuration was essentially free-form. Every new feature required additional parsing logic in our codebase, which steadily increased complexity and technical debt. There was no authoritative definition of what a valid configuration looked like, and correctness depended entirely on application logic executed at runtime.
Validation was another major pain point. Because ConfigMaps are not validated at admission time, configuration errors were only discovered at runtime. Users could successfully apply a configuration that was syntactically valid YAML but semantically wrong, only to discover issues later by digging through logs. This was especially problematic for teams using GitOps workflows, where feedback is expected to surface directly in tools like kubectl, Argo CD, or Flux.
From a scalability perspective, our reconciliation model also had its limits. Although ConfigMaps were hot-reloaded, any change triggered a full re-evaluation of all configuration. A small update, such as changing the labels targeted by a single policy, resulted in all policies being recomputed. This increased time to optimization and made the system harder to reason about.
Finally, the global nature of the Helm values and the ConfigMap made delegation difficult. Platform teams could not easily hand off configuration responsibility to application teams on a per-namespace basis, which limited flexibility in larger, multi-tenant environments.
Treating Configuration as a Kubernetes API
At some point, it became clear that we were treating Kubernetes as a storage layer rather than as an API. Configuration lived in Kubernetes, but it was not truly part of the Kubernetes API model.
Custom Resource Definitions (a.k.a. CRDs) gave us a way to change that. By modeling configuration as first-class Kubernetes resources, we could define explicit structure, validate inputs early, and react to changes in an event-driven way. Instead of polling or reprocessing global state, we could reconcile only the objects that changed and only the logic affected by those changes.
This shift fundamentally changed how both users and the system interacted with configuration.
Implementing CRDs With controller-runtime
We used kubebuilder and controller-runtime to define our configuration APIs directly as Go structs. From those structures, CRDs and OpenAPI schemas were generated automatically. This allowed us to keep the API definition and implementation tightly aligned, without maintaining separate YAML definitions by hand.
Using kubebuilder validation tags, we pushed validation into the CRD schema itself by defining constraints on valid fields and values (for example, enums, required fields, and numeric bounds). Invalid configuration is now rejected immediately when applied, whether via kubectl or through a GitOps controller. Errors that previously appeared in logs are now surfaced where users expect them: at the API boundary. On top of this, we can also add sane defaults at the schema layer instead of hardcoding them in the application or forcing users to define all the fields every time.
Resulting objects
This journey has led us to create the following objects:
- ClusterWideProactivePolicy: Allows to optimize resources for pods cluster-wide using namespace and label selectors. Best used by Platform teams to enforce a set of standards around resource optimization
- ProactivePolicy: Allows to optimize resources for pods in a namespace using label selectors. Best used by Application teams to define more precisely how resource optimization should be done for their own application.
- AutomationStrategy: An object referenced by the policy objects that defines how automation events should be handled. For example, should in-place resizing be used, can requests and/or limits be optimized, and many more granular settings. A limited number of strategies are then frequently referenced in various policies.
Here is an example of how a central Platform team needed to configure our product to enable resource optimization for a Java app owned by an application team via our helm chart. The intent here is to optimize requests and limits, only using eviction and never in-place resizing.
scope: - name: base-optimization-scope # Unique scope name policy: base-optimization # Policy name (optional - uses defaultPolicy if omitted) namespaces: operator: In values: - my-java-app-xyz podLabels: - key: app.kubernetes.io/name operator: In values: - my-java-app policy: policies: base-optimization: allowedPodOwners: "Deployment,CronJob" enablement: cpu: request: downsize: true upsize: true setFromUnspecified: true limit: downsize: true upsize: true setFromUnspecified: true unsetFromSpecified: false memory: request: downsize: true upsize: true setFromUnspecified: true limit: downsize: true upsize: true setFromUnspecified: true inPlaceResize: enabled: false podEviction: enabled: true safetyChecks: maxAnalysisAgeDays: 5
And with our new Kubernetes-native design, the same use-case can be addressed directly by the Application team where they create a targeted resource optimization policy, in self-service, for their Java based applications in a namespace they own.
--- apiVersion: rightsizing.kubex.ai/v1alpha1 kind: AutomationStrategy metadata: name: no-in-place-automation namespace: app-xyz spec: inPlaceResize: enabled: false --- apiVersion: rightsizing.kubex.ai/v1alpha1 kind: ProactivePolicy metadata: name: my-java-app-policy namespace: app-xyz spec: scope: workloadTypes: - Deployment - Cronjob labelSelector: matchLabels: app.kubernetes.io/name: my-java-app automationStrategyRef: name: no-in-place-automation
The result is more concise through sane defaults defined at the CRD layer, more flexible in the way resources can be composed and safer using the improved validation during the admission. The feature-set of our product is also improved as Platform teams retain the ability to instrument these policies centrally but have also gained the flexibility to allow Application teams to perform this type of configuration if needed.
5 Lessons Learned and Best Practices
Lesson 1: Push validation to the schema
One of the biggest lessons we learned was to push as much validation as possible into the CRD schema itself. Structural constraints, required fields, and basic value validation belong at the schema level, where feedback is fast and deterministic. Validating webhooks are best reserved for cases that truly require context, such as cross-object references or checks that depend on cluster state.
Lesson 2: Finalizers are your friends
Finalizers on the Custom Resource objects also became critical as our configuration model evolved. Moving from a single global configuration to multiple interrelated resources makes cleanup logic non-trivial. We created finalizer code that is idempotent, resilient to partial failure, and only removes itself once all dependent resources have been properly handled.
Lesson 3: Prefer inline specs to cross-reference
Another important design consideration was balancing reuse with usability. While shared configuration through references can reduce duplication, it also introduces indirection. Inlining configuration by default often leads to a better user experience, especially when the configuration is tightly scoped to a single object. Using our AutomationStrategy objects as an example, they do use a reference because we found that they were frequently reused based on our current implementation. When starting a new project, we feel it’s usually better to start with inline definitions and introduce reuse only when the need becomes clear.
Lesson 4: Follow proven patterns within the ecosystem
Designing APIs is subjective, but many open-source cloud-native projects have already solved similar problems. Our solution was to lean heavily on existing CRD designs in the ecosystem. In our case, Cilium Network Policies were a strong source of inspiration. Their approach to namespaced policies, label-based targeting, and immediate, event-driven enforcement of their policies closely matched the behavior we wanted for Kubex.
Lesson 5: Create a migration strategy
Redesigning our configuration would not have been successful without a careful migration strategy. Since directly editing ConfigMaps was already a high-friction workflow, most users configured Kubex through Helm values. We used that as our compatibility layer, mapping existing values to the new Custom Resources through templating.
This allowed us to preserve existing workflows while giving us the freedom to design better APIs without being constrained by the legacy format. Over time, as users adopt Custom Resources directly, we expect to deprecate these Helm values and rely entirely on Kubernetes-native configuration.
Conclusion
Configuration is an API, whether you treat it as one or not. Early on, ConfigMaps allowed us to move quickly, but as Kubex grew, their lack of structure and validation became a constraint for both users and engineers.
By embracing Custom Resources and Kubernetes-first design principles, we were able to improve usability, correctness, and scalability, while also simplifying our internal logic. For teams building Kubernetes-native software, investing in well-designed configuration APIs early can pay off significantly as the product evolves.
