ML Environment Engineering: Building Machines That Build Machines

For many companies, applied ML means building and maintaining dozens or hundreds of specialized models: a ranking model here, a fraud classifier there, a recommendation system for this surface, a churn predictor for that one. Most of us in applied ML, myself included, still default to hand-driving each of these picking features, choosing architectures, tuning hyperparameters, running experiments, interpreting results. It is satisfying work. But increasingly, it feels like the wrong level of abstraction.

We are at a point where it probably makes sense to think less about building models and more about building environments in which models get built. Instead of being the scientist who runs the experiment, you become the person who designs the whole lab. This post is about that shift, why I think it's coming, and what the early version of it looks like in practice.

Contents

From model building to environment building
What makes automation possible
- Offline-online metric alignment
- The infrastructure you need
The automated ML agent
- Orchestrator and subagents
- Public/private evaluation split
- What we got
- What can be improved
What’s next

From model building to environment building

The shift I’m describing is analogous to what happened with software deployment. Engineers used to manually build, test, and deploy code. Then CI/CD pipelines automated most of that. The engineers didn’t disappear — they moved up a level. They started designing the pipeline, writing the tests, defining what “good” looks like. The system did the repetitive execution.

Applied ML is ready for a similar move. Instead of building a model, you build the environment in which models get built: the evaluation harness, the data contracts, the experiment tracking, the guardrails that prevent overfitting to your benchmarks. Then you let an agent — or multiple agents — explore the space of candidate approaches within that environment.

This isn’t traditional AutoML, which typically means hyperparameter search over a fixed model class. This is broader: the agent can try different feature engineering strategies, different model families, different ensembling approaches, different preprocessing pipelines. It operates at the level of “what would a junior applied scientist try next?” rather than “what’s the optimal learning rate?”

What makes automation possible

Offline-online metric alignment

This approach doesn’t work everywhere. It works in domains where offline evaluation metrics have decent directional alignment with online outcomes. Not perfect alignment, but enough that improving your offline metric reliably moves the online metric in the right direction.

Ranking is a good example: NDCG / MRR / MAP improvements offline tend to translate, at least directionally, to better conversion rates in online A/B experiments. Recommendation systems, search relevance, churn prediction - in these domains, the offline-online gap is often manageable enough that automated iteration on the offline metric is productive. This isn’t universally true — it varies by product, company, and how well the eval was designed, but I believe it holds in a meaningful number of real-world settings.

Where it breaks down is domains where there’s no clear offline proxy for what matters online. Consumer-facing agents are a good example: did a particular tone, a specific tool call, or a combination of steps in the agent trace lead to conversion and retention? There’s no straightforward offline metric for that. Automated iteration in these settings can optimize the wrong thing very efficiently, which is probably worse than not automating at all. This is why good offline evals are critical for automation.

The infrastructure you need

Assuming decent metric alignment, the environment needs a few specific components:

Strong evaluation with guardrails. This is the most important piece. Models under optimization pressure tend to cheat — gaming benchmarks in ways that resemble human shortcutting. One defense is to never let the agent run evaluation directly: expose it to an evaluation skill you wrote, so the scoring logic is immutable. But even that isn’t enough. Over many experiments, the agent will overfit to any validation set it can see. The solution is a public/private holdout split, exactly like Kaggle competitions: a public set for fast iteration, a private set the agent never accesses directly, evaluated only once per experiment after the agent selects its best candidate. Without this split, the agent hill-climbs on whatever metric you give it. With it, benchmark gaming becomes much harder.

Experiment tracking and logging. Every model the agent trains, every evaluation it runs, every approach it tries needs to be logged in a structured way. This serves two purposes: it gives the human reviewer a clear picture of what was explored - it is essentially the interface through which the agent communicates its findings and decisions - and it gives the agent itself a history to reason about (what has been tried, what worked, what plateau has been reached, and what are unexplored directions).

Full-stack access. In many real-world settings, much of the practical work isn’t the model itself - it’s the data pipelines, feature stores, and serving infrastructure around it. Giving the agent access to these through skills or dedicated toolkits (like Databricks AI Dev Kit) lets it iterate on the full stack - data, features, training, serving - so not just the model. Where this naturally leads is an environment where candidate models can be deployed, load-tested, and evaluated under realistic inference constraints (latency budgets, memory limits, traffic patterns) before any human reviews them. This closes the gap between being good on a benchmark and being a full-stack ML agent that can go from research ideas to production-ready models.

The automated ML agent

To make this concrete, I’ve been building a demo template for autonomous ML experimentation using Cursor AI agents. A small agent swarm explores different research directions concurrently, logs everything to MLflow, and communicates findings through a shared knowledge base on main.

main knowledge base

research_directions/ mlruns/ src/

reads history merges results

☻ Human steers direction, reviews, stops

↔

◆ Orchestrator manages pool, never trains

launches launches launches

1 Subagent exp/deeper-trees

read history on main
research + EDA
announce direction
train + log
pick best candidate

2 Subagent exp/feature-interactions

read history on main
research + EDA
announce direction
train + log
pick best candidate

3 Subagent exp/neural-net

read history on main
research + EDA
announce direction
train + log
pick best candidate

Public eval agent scores freely

Private eval one-shot, score hidden

Merge to main results + learnings

The orchestrator launches research agents

The orchestrator manages a rolling pool of up to 3 concurrent subagents. It never trains models itself - it launches agents, monitors their progress, and decides whether to continue or stop. Between completions it checks in with the human: keep going, change direction, or stop.
Each subagent starts by reading past experiments and learnings on main, then announces its research direction by writing a separate file to research_directions/. From there it works freely on an experiment branch - EDA, feature engineering, model training, whatever it decides. When done, it merges results and learnings back to main; the full experimentation code stays on the branch for reference.
Both the orchestrator and the subagents are guided by skills: the orchestration skill controls launch/monitor/stop decisions, while the experiment-running and experiment-submission skills govern how subagents explore, merge and run evaluation with guardrails to prevent model from cheating, log data with MLflow observability layer.

MLflow tracks all experiment data

Every run logs hyperparameters, per-step training/validation metrics, and the trained model as an artifact.
Each run carries structured metadata: direction_rationale (why this approach), run_rationale (why this config), and run_analysis (what happened and why).
Private evaluation scores are logged to the best run from each direction.

Only results merge to main

Training code, predictions, and model files stay on the branch — committed there as a complete archive.
Only mlruns/ and research_directions/ merge to main.
Branches are never deleted — they serve as archives of everything that produced the results.

Evaluation: public/private split

Agents have file system access. If targets lived in the same CSV as features, nothing would stop an agent from reading the answer column directly - and under optimization pressure, this does happen (especially with less powerful models). Isolating target files (*_y.csv) and gating access through evaluation scripts creates a hard boundary.

Public validation — agents use freely during research. evaluate_public() drives all experiment decisions.
Private validation — scored once per direction at submission time via submit.py. The agent never reads the private targets directly.
Public-private gap is the signal — if the gap is large, the agent has overfit the public split. This is the primary failure mode when an autonomous agent optimizes iteratively, and the split is designed to catch it.

What we got

To test this, we generated a synthetic binary-classification dataset where according to Codex Bayes-optimal AUC shouldbe around 0.784. We pointed the agent swarm at it and let it run. In a few hours it explored 9 research directions, logged 110 MLflow runs, and tried everything from gradient-boosted trees to neural nets to stacked ensembles - all autonomously.

The best results: 0.783 public and 0.774 private - came from both a tuned CatBoost and an ensemble blend, both very close to Bayes Optimal AUC. The small public/private gap confirms the agent wasn't overfitting the benchmark. The score progression chart below shows the typical pattern: fast early gains followed by a long plateau where incremental directions yield diminishing returns.

ML Experiment Dashboard showing leaderboard, score progression, public vs private generalization, and score distribution across 9 research directions and 110 runs

What can be improved

Concurrency at scale. Currently each agent merges its own results with rebase-and-retry. This works for ~3 agents but breaks beyond ~5. Serializing all writes through the orchestrator would fix it.
Shared code across agents. Agents work in isolation on their branches. A src/shared/ directory on main where agents promote battle-tested utilities would let the next agent reuse what the previous one built.
Richer inter-agent communication. Currently agents share findings through free-form markdown. Structured metadata per direction (e.g., features tried, best AUC, failure modes) would let agents make more informed choices instead of re-discovering what a previous agent already learned.

What’s next

What we built is a first version (the full code is on GitHub). There are natural next steps: deeper integration with observability tooling, full-stack access to data and serving infrastructure, and the agent improvements outlined in the previous section.

The human role. In this setup, the applied ML scientist doesn’t disappear. The role shifts. You spend less time running experiments and more time on the things that matter most: designing evaluation metrics that actually correlate with business outcomes, identifying new signal sources, reviewing what the agent tried and spotting the creative leaps it missed, and deciding when the offline-online alignment is trustworthy enough to automate and when it isn’t.

In domains where offline metrics have decent alignment with online outcomes, I believe a significant fraction of model research and iteration can be automated with current LLMs. As models improve, so does the case for automation. The investment shifts from building models to building the environments in which models get built.