The platform engineer's job is to delete toil, not build portals

2026-03-27 Rico Twesten-Weber Principal DevOps Engineer

platform-engineeringdevopsautomation

Every platform engineering conference talk I’ve sat through in the past year starts the same way. Someone shows a portal. It has a service catalog. Developers can click a button to spin up a new microservice from a template. The audience nods. The speaker says “developer experience” fourteen times. Everyone goes home and builds a portal.

Nobody asks whether the developers actually needed to spin up that microservice in the first place.

The portal trap

Here’s what I’ve watched happen at two different organizations. The platform team spends six months building an internal developer portal. It has a service catalog, environment provisioning, deployment dashboards, and a plugin system. It looks impressive in demos.

Developers don’t use it.

Not because it’s bad. Because it solves the wrong problem. The portal makes it easier to do the manual work. But the developers didn’t want the work to be easier. They wanted the work to not exist.

When a developer needs to deploy a service, they don’t want a nicer UI for the deployment process. They want to merge a PR and have the service deploy itself. When they need a new environment, they don’t want a form to fill out. They want the environment to appear automatically when they open a PR. When they need a TLS certificate, they don’t want a self-service portal for cert requests. They want cert-manager to handle it without them knowing.

The portal sits there, meticulously maintained, while developers route around it and keep doing things the old way because the old way has fewer steps than learning the portal.

What toil actually looks like

Google’s SRE book defined toil as manual, repetitive, automatable, tactical work that scales linearly with the size of your service. That definition is precise and useful, but I think most teams don’t recognize their own toil because it’s woven into their daily routines.

Here are examples from teams I’ve worked with. A developer needs a new Kubernetes namespace. They open a ticket. A platform engineer creates the namespace, sets up RBAC, configures resource quotas, and closes the ticket. Time: 20 minutes of human work. Frequency: twice a week per team.

A service needs a new database. Ticket. Someone provisions it, creates the credentials, stores them in the secret manager, updates the Helm values. Time: 45 minutes. Frequency: every new service.

A certificate expires. Someone notices because monitoring fired an alert. They generate a new cert, update the secret, restart the pods. Time: 30 minutes. Frequency: hopefully not often, but always at the worst possible time.

Each individual instance feels manageable. Twenty minutes here, forty-five minutes there. But multiply it across teams and services, and you’ve got a platform engineer spending 60% of their week on work that follows the exact same pattern every single time.

Deleting toil vs automating toil

This is the distinction that changed how I think about platform work. There’s a difference between automating a process and eliminating the need for it.

Automating the namespace creation ticket means writing a script that provisions the namespace when the ticket is filed. The ticket still exists. Someone still files it. Someone still approves it. The script just does the kubectl commands faster than a human would.

Deleting the namespace creation means a developer adds a team entry to a config file in Git, FluxCD reconciles it, and the namespace with all its RBAC and quotas appears automatically. No ticket. No approval workflow. No human in the loop. The config file is the request, and the reconciliation loop is the approval.

Same outcome, completely different approach.

Cert-manager is a perfect example of toil deletion. Before cert-manager, certificate management was a recurring manual task. Someone had to track expiration dates, generate new certificates, update secrets, and verify the rollout. Cert-manager doesn’t automate that process. It eliminates it. Certificates get issued, renewed, and rotated without anyone doing anything. The toil doesn’t get faster. It stops existing.

FluxCD’s reconciliation loop is another example. Before GitOps with reconciliation, deploying meant running a pipeline, verifying it worked, and fixing drift manually when someone changed something in the cluster. Reconciliation doesn’t automate the deployment process. It removes the concept of “deploying” as a human action. You merge, the system converges. The work disappears.

The right order of operations

After screwing this up at least once, I’ve settled on an order that works.

First, ask whether the work needs to happen at all. Can you redesign the system so the task doesn’t exist? Namespaces created by convention from a Git config. Certificates managed by a controller. Environments spun up automatically per branch. If you can delete the work entirely, do that.

Second, for work that genuinely can’t be eliminated, automate it end-to-end with no human in the loop. Not “automate with approval gates.” Actually automated. If the process is safe enough that you’d rubber-stamp every approval anyway, remove the approval.

Third, and only third, build a UI if there’s a remaining human decision that benefits from visual context. If a developer needs to choose between deployment targets or review a security scan before promoting to production, a UI helps. But the UI should be the exception, not the starting point.

Most platform teams invert this. They start with the portal, add automation behind the buttons, and never get around to asking whether the buttons needed to exist.

A practical test

Look at your platform team’s last month of work. List every task that was done more than once. For each one, ask: could this have been designed away?

Not “could we write a script for this?” but “could we change the system so this task doesn’t happen?” The answer is yes more often than you’d expect.

When I ran this exercise on my own work, I found that roughly half the repeating tasks could be eliminated through better reconciliation, convention-based configuration, or controller-driven automation. Another quarter could be fully automated without a human decision point. The remaining quarter genuinely needed human judgment.

That last quarter is where a portal or dashboard actually makes sense. For everything else, the platform engineer’s job was to make the work disappear, not to build a prettier way to do it.

If your team builds dashboards while devs wait

Platform engineering is about leverage. Every hour you spend should save your developers more hours than it cost. Portals can do that, but only after you’ve exhausted the higher-leverage options.

If your platform team is building dashboards while developers wait two days for an environment, your priorities are wrong. If your self-service catalog has 200 features but developers still file tickets for basic tasks, you’re optimizing the wrong layer.

Delete the toil first. Automate what’s left. Build the portal last, if at all. The best platform work is the work that makes other work stop existing.