Workflows That Remember

Most AI-powered automation has a frustrating blind spot. Every workflow run starts from scratch. The agent hits the same rate limit it hit yesterday. It rediscovers the same workaround for that flaky test suite. It burns tokens retrying something that failed the exact same way three runs ago. If a human engineer kept making the same mistakes after being told the fix, you’d wonder what was going on. But we accept this from automated workflows because statelessness is the default. Runs are isolated. Context dies when the process exits. We didn’t want to accept that anymore. Over the past few months, we’ve built a system at Overcut that lets workflows learn from their own execution history. Not in a hand-wavy “AI learns” sense, but through a concrete mechanism: persistent memory, automated retrospectives, and a weight-based system that surfaces useful knowledge while letting stale observations fade away. Here’s how we designed it, the tradeoffs we wrestled with, and what surprised us along the way.

The shape of the problem

The core issue isn’t that agents can’t learn. It’s that they have nowhere to put what they’ve learned. When an agent in a workflow run discovers that a particular repository needs --maxWorkers=2 to avoid Jest running out of memory, that knowledge evaporates the moment the run ends. The next run’s agent starts clean, hits the OOM error, spends a few minutes diagnosing it, finds the same fix, and moves on. Multiply that by dozens of runs and you’re looking at real wasted time and cost. The obvious solution is “just give agents memory.” But memory systems are deceptively tricky to get right. Store too much and you drown the agent in irrelevant context. Store too little and you miss patterns that only emerge across multiple runs. And if you let agents write memories without any validation, you end up with a pile of observations that are half-true, outdated, or actively misleading. We needed something more structured. Not a scratchpad, but a knowledge store with opinions about what’s worth remembering.

Memories as weighted knowledge

Every memory in our system is a structured observation tied to a specific workflow and step. A memory might say: “This repository’s rate limiter returns 429s under batch writes, so use sequential requests for the write phase.” Each one carries a weight between 0 and 1 that represents how confident the system is that this memory is useful. The scoping matters. Memories are linked to both a workflow and a specific step within that workflow. An observation about test failures won’t show up when the agent is working on the implementation step, even within the same workflow. This was a deliberate decision. Early prototypes without scoping led to agents getting confused by memories that were technically about the same codebase but irrelevant to what they were currently doing. When an agent starts a step, the system pulls the top 10 memories by weight (filtering out anything below 0.1) and renders them as a compact list in the agent’s prompt. The agent sees short titles with weights and can choose to read the full content of any memory that looks relevant. We track both events separately: “presented” means it appeared in the prompt, “used” means the agent actually read it. That distinction turns out to be important later. Why cap at 10? After a few dozen retrospectives, a workflow can accumulate 50 or more memories. Injecting all of them would bloat the prompt, increase cost, and dilute attention. The weight system creates a natural priority queue where the most validated knowledge floats to the top. The cap of 10 and the 0.1 weight threshold were tuned to keep the memory section under roughly 500 tokens, though we expect to adjust these as we collect more data.

The retrospective loop

Memory alone isn’t enough. Something has to decide what’s worth remembering, validate that memories are actually helping, and retire the ones that aren’t. That’s the retrospective. After every workflow run completes, the system checks whether it’s time to run one. The logic is simple: count the runs that have completed since the last retrospective, and if that count hits the threshold (default is 10), trigger. There’s a flag to prevent concurrent retrospectives on the same workflow, and internal workflows like the retrospective itself are excluded so the system doesn’t try to introspect on its own introspection. The retrospective doesn’t analyze every accumulated run. It samples a fraction of them, randomly. With a default sample size of 40%, a threshold of 10 runs yields 4 runs for analysis, hard-capped at 5. The random sampling was a conscious choice. An earlier design took the most recent N runs, but that created recency bias. If a bug appeared frequently in early runs but was fixed partway through the window, the sequential approach would miss the pattern entirely. Random selection gives more representative coverage of the whole window. And because analyzing each run involves spinning up an agent that explores logs, delegates to sub-agents, and writes structured findings, sampling also keeps the cost predictable. Each sampled run gets its own investigation step. The investigating agent digs through execution logs, tool calls, and outcomes. It skips non-agentic steps like git clones or cache restores and focuses on the steps where an agent was actually making decisions. One detail that matters: investigation steps run sequentially rather than in parallel, so each one can see what the previous investigator found. This lets the system spot cross-run patterns as it goes rather than stitching them together after the fact. There’s also a manual mode. Users can trigger a retrospective on specific runs, optionally with a focus question like “why did the test step keep failing?” Manual retrospectives skip the weight adjustment and tentative processing steps. This is important because it means users can do exploratory analysis without interfering with the automatic improvement loop. The analyzed runs still count as “unanalyzed” for the automatic trigger, so they’ll be picked up again in the next scheduled retrospective’s sampling pool. After investigation comes the interesting part: deciding what to do with the findings.

Earning trust slowly

One of the hardest design questions was: when should a new observation become a memory that agents can see? Our first instinct was to create memories immediately whenever the retrospective found a recurring pattern. But “recurring” is relative. A test might fail twice because of a genuine systemic issue, or twice because of unrelated transient infrastructure blips. Single-run observations are even noisier. So we built a two-stage lifecycle. When the retrospective finds a pattern in just one run, it creates a tentative memory. Tentative memories have a weight of 0.0, which means they’re invisible to regular workflow agents (the injection threshold is 0.1). They exist only as hypotheses, sitting quietly until the next retrospective either confirms or contradicts them. This is the system saying: “I noticed this once. Let me see if it happens again before I tell anyone.” It’s a surprisingly human instinct, and we found it works well in practice. Confirmation is deliberately strict. A tentative memory about test failures in the “implement-changes” step won’t be confirmed by an observation from a different step, even if both mention similar errors. The same pattern has to reappear in the same step across a separate run. Only then does the tentative memory get promoted to active with a starting weight of 0.3. Why 0.3? It’s high enough to be injected (above the 0.1 threshold) but low enough that the memory has to prove itself before it starts dominating the prompt. A memory at 0.3 will appear in the list but won’t outrank established memories at 0.5 or 0.7. It has to earn its way up through positive impact across future retrospectives. If a tentative memory gets contradicted by new evidence, it’s deleted with a reason. And if there’s simply no evidence either way, it stays tentative. No rush. The system would rather miss a pattern for one more cycle than create an active memory that misleads agents.

The life of a memory: from observation through tentative and active stages to workflow instructions or archival

Weight dynamics: making feedback tangible

The weight system is where the self-improvement loop actually closes. After investigating runs and processing tentative memories, the retrospective adjusts the weights of existing active memories based on their observed impact. The adjustments are applied per run. If a memory had positive impact in 3 out of 4 analyzed runs, it gets +0.3 total (+0.1 per positive run). A single retrospective can meaningfully shift a memory’s trajectory. We deliberately made the system asymmetric. A memory that leads agents in the wrong direction loses 0.15 per bad run, while one that helps gains only 0.1 per good run. False positives are more expensive than false negatives here. A bad memory that confidently tells an agent to do the wrong thing wastes far more time than a good memory that hasn’t been surfaced yet. The system is biased toward skepticism, and we think that’s the right call. There’s also a slow bleed for memories that get ignored. If a memory appears in an agent’s prompt but the agent never reads it, it loses 0.02 per retrospective. This prevents memory hoarding. Even a technically correct observation will eventually archive itself if agents consistently decide it’s not worth reading. Memories that drop below 0.05 get auto-archived. Rising memories can reach weights in the 0.7-0.8 range, where they’re nearly guaranteed to be injected and occupy prime position in the prompt. The trajectory tells a story: a memory that climbs from 0.3 to 0.7 is one the system has validated across many runs. One that rises briefly and then declines probably addressed a temporary situation that has since changed.

Memory weight over time: three trajectories showing a consistently helpful memory graduating, a temporary fix fading away, and a correct but ignored memory slowly decaying

The art of forgetting

A memory system that only accumulates knowledge will eventually collapse under its own weight. We needed the system to forget, but in a controlled way. When a memory is deleted, whether by a user or by the retrospective agent, the system creates a tombstone. It records what was deleted, why, and when. Tombstones expire after 30 days. Without tombstones, we had an annoying cycle during early testing. The retrospective would delete a memory because it was contradicted by recent evidence. Then the next retrospective would sample runs from before the contradiction, see the same pattern that originally created the memory, and recreate it. The system was fighting itself. Tombstones break that cycle by creating a blackout window. For 30 days after deletion, the system won’t recreate anything semantically similar to the deleted memory. But we chose to expire tombstones rather than make them permanent, because codebases change. A memory that was wrong six months ago might become correct after a refactor. The 30-day window balances protection against immediate re-creation with the ability to eventually relearn.

What we can measure and what we can’t

The presented-vs-used tracking gives us a decent signal on memory utility. A memory with a high present count but low use count is one that looks relevant by weight but agents keep ignoring in practice. The slow decay mechanism will eventually archive these, but the metrics also help us understand the system’s behavior during development. Each retrospective produces a structured report that surfaces what was created, confirmed, deleted, or adjusted. These reports live in the UI as normal run outputs, giving users a clear audit trail of how their workflow is evolving. We found this matters more than we expected. When the system makes a change to memory, people want to understand why. The reports make the system’s reasoning legible, which builds trust in ways that silent optimization never could. What we can’t measure yet is causal impact. We know when a memory was present and the run succeeded, but we can’t isolate whether the memory caused the success. True measurement would require A/B testing: occasionally withholding a memory and comparing outcomes. This is architecturally possible, and it’s on our roadmap, but it needs careful design to avoid degrading production workflows. We’re also limited by the per-workflow scope. Right now, memories don’t cross workflow boundaries. If two different workflows operate on the same repository, they can’t share lessons. A scoping hierarchy with workspace-level memories is something we’re actively exploring, but it introduces new challenges around relevance. A memory that’s broadly true (“this repo uses pnpm, not npm”) is different from one that’s step-specific (“the test step needs —maxWorkers=2”), and we want to get the abstraction right before shipping it.

Where this is going

The system we’ve built today is a closed loop: workflows run, retrospectives analyze, memories form and evolve, future runs benefit. It works, and we’re seeing workflows genuinely improve over time without anyone manually tuning them. But there’s more to do. One direction we’re particularly excited about is memory graduation. When a memory’s weight climbs past a certain threshold, it’s no longer really a memory. It’s a known fact about how this workflow should behave. At that point, the system should fold it directly into the workflow’s instructions, treating it as permanent guidance rather than a floating observation. This frees up a slot in the memory window for newer, less proven knowledge to get its chance. The memory system becomes a pipeline: observations enter as tentative hypotheses, get validated into active memories, and the best of them eventually graduate into the workflow itself. It’s a natural lifecycle that keeps the memory store focused on what’s still being learned rather than what’s already settled. Cross-workflow learning is another obvious next step. Right now, two workflows operating on the same repository can’t share lessons, and that feels like leaving knowledge on the table. We’re also thinking about adaptive thresholds that adjust retrospective frequency based on how much the system is still learning, and memory summarization that condenses related observations into denser representations. But the biggest open frontier is a deeper shift we’re working toward. Right now, the retrospective learns from the process: what went wrong during execution, which tools failed, where agents got stuck. But the process succeeding doesn’t always mean the outcome was good. An agent can run cleanly, hit no errors, and still produce a code review that misses the actual problem, or a root cause analysis that sounds plausible but falls apart when someone acts on it. We want the system to eventually learn from the quality of what it produces, not just from whether the machinery ran without friction. That means closing the loop with real-world signals: how did reviewers respond to the code changes? Were the review comments substantive or shallow? Did the root cause analysis hold up when the team investigated further, or did they throw it out? Did the user accept the output or push back? Right now we learn that the workflow ran. We want to learn whether the work was actually good. The conviction underneath all of this is pretty simple. Automation that can’t learn is automation that will always need a human standing behind it, ready to step in when things go sideways. We’re building toward the alternative.

Blog

​The shape of the problem

​Memories as weighted knowledge

​The retrospective loop

​Earning trust slowly

​Weight dynamics: making feedback tangible

​The art of forgetting

​What we can measure and what we can’t

​Where this is going