Nicholas Clooney

AgentOS: The Agent Environment That Gets Smarter As You Build

Part of a series

DISCLAIMER: This blog is Human Driven, Agent Written like many things I am working on recently. The direction, the ideas, the taste, all from this fleshy human. I have also proofread and edited it many times. But agents have done most of the grunt work. You have been informed. Enjoy your reading. 😀

The agent I worked with yesterday does not exist anymore. 💀

Not in any dramatic sense. The model is still there. But the instance, the working context, the half-formed mental model it had of why we were renaming a file or why a particular abstraction looked wrong, all of that is gone. Tomorrow's agent starts from zero. So does next week's.

What I realized, while delegating real work to AI agents, is that the bottleneck stops being the model's capability and starts being the project's ability to explain itself. If the repo cannot tell a fresh agent what we are doing and why, every session begins with the same archaeology. The mechanical tax I have written about before, the quiet compounding cost that kills momentum, shows up here in a new form, in re-establishing context.

So in ProjectSpire (an AI-assisted, multilingual monorepo for exploring Slay the Spire 2 data, tooling, mods, and companion app experiments), I have started treating the surrounding documentation as part of the engineering environment itself. Not notes-for-later. Not changelogs. An actual system, designed so that the project gets better at being worked on the more it is worked on.

That is the claim I want to make in this post: the interesting move is not faster code generation. It is building environments that accumulate context, where the collaboration itself improves over time, even though the agent has no memory of last Tuesday.

The Three Layers, One Level Up

I have written before about a three-layer model for creative work: thinking, craft, and the mechanical. Thinking is judgment and intent. Craft is taste and standards. Mechanical is the boilerplate tax, the keystrokes between the idea and the result.

The same three layers apply to the engineering environment around an AI agent, just shifted up a level:

  • Thinking lives in plans. What are we building, and why this shape and not another?
  • Craft lives in collaboration logs. How do I want this done? What did I correct, and what should not need correcting again?
  • Mechanical lives in skills and workflow files. What is the exact command, the exact sequence, the exact failure mode?

Most "AI workflow" writing collapses these together, usually into one giant prompt, or one CLAUDE.md that tries to be everything. That works until the project gets big enough that the prompt cannot hold its own weight. Then you need to split the layers apart, because they have genuinely different jobs.

The Principles Behind the Practice

I am starting to call this cluster of ideas AgentOS, not a tool or a framework, but a layer of intentional structure that sits around the agent. The plan is to develop it into its own dedicated repo as the patterns solidify. But even at this stage, the underlying principles are reasonably clear:

Context should accumulate, not evaporate. Every session that ends without capturing what mattered is a session that future agents and future-you will partially re-do. The goal is a project that gets cheaper to work on, not harder.

Keep the layers separated. Thinking, craft, and the mechanical have different half-lives and different audiences. A plan and a skill and a Captain Log are not the same kind of document. Collapsing them makes all of them worse.

Doc hygiene is a first-class concern. Documentation that drifts out of sync with reality is worse than no documentation, because it actively misleads. The system needs mechanisms for noticing and correcting staleness, not just for producing documents in the first place.

The system should improve as it is used. This is the hardest one to operationalise, but it is the point. Every Captain Log should make the next session slightly better calibrated. Every encoded workflow should eliminate one category of repeated work. Every plan should mean one fewer archaeology session.

These principles are mostly implemented in practice but still have space for more evolution. Most of the mechanisms are in place. Others are still being designed. That is what active development of an idea looks like. And here is what it looks like in practice.

AgentOS

An operating system does not do the work, it creates the conditions for work to happen reliably. AgentOS is that idea applied to AI collaboration. In ProjectSpire, it looks like this:

Agent Instructions as Project Memory

The root of the system is the repo-level agent instruction file. In ProjectSpire, AGENTS.md is a symlink to CLAUDE.md, and CLAUDE.md acts as the operating manual for any agent working in the project.

		
  1. ## Plans
  2. Implementation-ready plans live in `Documentation/Plans/`.
  3. Use this folder for agreed plans that should be executable by another engineer or agent without rediscovering the intent, target commit, naming convention, verification steps, or important assumptions.
  4. ## Captain Logs
  5. Agent/user collaboration notes live in `Documentation/Captain Logs/`.
  6. Each conversation gets its own numbered markdown file in this folder, using the filename pattern `NNNN - Short Conversation Title.md`. For example, `0001 - The Beginning of Captain Logs.md`.
  7. If the current conversation does not already have a Captain Log file, create the next numbered file and add it to the top of `Documentation/Captain Logs/README.md` with a short summary.
  8. At the end of every meaningful user-agent interaction, add a concise entry to the top of the current conversation's log file, directly below the introductory text, so newest entries appear first and oldest entries remain at the bottom. Do not add entries for purely appreciative or acknowledgement-only replies, such as "nice", "good", or "well done", unless they include a decision, correction, new constraint, or requested change. This is not a transcript and it is not a replacement for full devlogs in `Documentation/Devlogs/`. Capture the collaboration shape: what the user asked for, how the agent responded, how the agent acted, how the user steered or corrected the work, the outcome, and what future agents should carry forward.
  9. Keep `Documentation/Captain Logs/README.md` as an index of conversation log files. Add new log links at the top of the index, with a short summary, so newest conversations appear first and oldest conversations remain at the bottom.
  10. After updating the file, explicitly report back to the user with the prefix `I have updated Captain Log` followed by a short summary of the latest topic, implementation, feedback, or outcome. For example: `I have updated Captain Log on the latest documentation-flow feedback: the confirmation now includes a concise summary.`

What that file deliberately does not try to do is contain every possible instruction. It is a router. It points at durable places: implementation plans, Captain Logs, release-tag workflows, snapshot workflows, build conventions, timeline-writing workflows.

That has become an important pattern for me. A good CLAUDE.md is not one giant prompt. It is the entrypoint into a system of project memory.

Plans: Where Thinking Gets Captured

Documentation/Plans/ is where agreed intent becomes executable.

Plans are not vague TODOs, and they are not something I write alone. A plan is the output of a planning session with an agent. I steer the direction, ask the questions, push back on the shape, and the agent writes it up. Human-driven, agent-written. That distinction matters because the plan ends up reflecting genuine architectural thinking rather than just whatever was convenient to type. I supply the judgment; the agent does the compression.

The result is something another engineer or agent can pick up without rediscovering the goal, the assumptions, the naming convention, or the verification path.

When I moved Neow's Cafe from mock cards toward a real card catalog, the plan captured the central decision: use a boring static catalog first, not a REST API.

		
  1. ## Summary
  2. Integrate `Lab/data/v0.103.2/cards`, `Lab/resources/images/packed`, and card localization data into `Apps/Apple/Neow's Cafe` through a generated, static, versioned catalog serving view.
  3. Use one compact card index for app-side search and filters, lazy-load portrait images, and avoid a REST API until there is a concrete need for server-side querying or sync.
  4. ## Catalog layout
  5. Generate a static catalog serving view under a versioned root:
  6. ```text
  7. Lab/catalog/v0.103.2/
  8. manifest.json
  9. cards.index.json
  10. cards -> ../../data/v0.103.2/cards
  11. images/card_portraits -> ../../../resources/images/packed/card_portraits
  12. ```
  13. `Lab/data/v0.103.2/cards` and `Lab/resources/images/packed/card_portraits` remain the source of truth. The catalog generator should create real app-facing files for `manifest.json` and `cards.index.json`, then use symlinks for existing raw card JSON and portrait assets to avoid duplicating data during local development.

That is the kind of context that is expensive to reconstruct from chat history and almost free to read from a file. The plan tells the agent where the catalog lives, what is the source of truth, what should be generated, what should stay a symlink, and why this is not an API yet. Same questions, every session, asked once.

The plan also records verification and assumptions, which matters because agent-generated work otherwise drifts into "it compiles on my machine" territory.

		
  1. ## Verification
  2. Add unit coverage for:
  3. - manifest decoding and version/checksum comparison
  4. - card index decoding for integer and X-cost cards
  5. - existing search, pool, type, and rarity filters against generated catalog card fixtures
  6. - missing portrait behavior using known cards without portrait assets
  7. - catalog symlink resolution for card JSON and portrait assets
  8. Run the Neow's Cafe test target.
  9. Build the app with compact output:
  10. ```sh
  11. xcodebuild ... | xcbeautify
  12. ```
  13. Manually verify the Cards tab loads all cards, filters still work, search still works, and portraits load lazily without blocking the grid.
  14. Verify the static server can fetch a sample card JSON and portrait through the catalog URL layout.
  15. ## Assumptions
  16. - `v0.103.2` is the first catalog version to expose to the app.
  17. - The card metadata payload is small enough to download as one index.
  18. - Images are the only payload that needs lazy loading.
  19. - English resolved card text from the existing card JSON is sufficient for v1.
  20. - A static catalog is preferred over REST until server-side search, sync, remote hosting policy, or multi-version browsing requires it.
  21. - `Lab/catalog/` is generated, rebuildable, and tracked.
  22. - Symlinks are for local development and repo hygiene; packaged/offline catalogs should contain real files.

One open question I am still working through: how plans and devlogs relate to each other. Plans capture intent before the work. Devlogs capture what actually happened after it lands. In theory they are cleanly separate. In practice the boundary blurs. A plan evolves mid-execution, a devlog ends up recording a decision that should have been in the plan, and sometimes it is not obvious which document you actually need. The right shape for these two is something I am still figuring out. It is one of the genuine growing pains of building this out, and I am not going to pretend otherwise.

Captain Logs: Where Craft Gets Captured

This is the part of the system I think is most underexplored, and the part I would build first if I were starting over.

A Captain Log is not a transcript and not a devlog. It records the shape of the collaboration: what I asked for, how the agent responded, where I corrected or steered it, and what future agents should carry forward.

		
  1. ## 2026-05-06 - Meaningful Interaction Threshold
  2. **Context:** A purely appreciative follow-up had been logged after the daily blog-summary workflow was accepted.
  3. **User Direction:** The user clarified that Captain Logs are only for meaningful interactions and that purely positive acknowledgements such as "nice" do not need log entries.
  4. **Agent Response:** The agent removed the acknowledgement-only entry and updated the root Captain Logs instruction to skip purely appreciative or acknowledgement-only replies unless they include a decision, correction, new constraint, or requested change.
  5. **User Feedback:** The user explicitly wanted the feedback itself recorded in the original Captain Logs conversation file.
  6. **Outcome:** Captain Logs now have a clearer threshold: record meaningful collaboration, steering, decisions, corrections, constraints, requested changes, and outcomes; do not record empty praise or acknowledgement-only replies.
  7. **Carry Forward:** Future agents should avoid treating every user message as log-worthy. Log only interactions that preserve useful collaboration context.

Most writing about AI workflows focuses on prompts, context windows, or tool use. Almost nobody talks about logging the steering itself as a first-class artifact. But that is where most of the actual signal lives.

If I tell an agent "this is too formal, make it feel more like a timeline post," that is not an implementation detail. It is a preference, a piece of taste, a workflow correction. If it stays only in chat, the next agent will repeat the same mistake. If it goes into a Captain Log, the system gets a little better, and crucially, it gets better in the craft dimension, not just the technical one.

This is the move I think matters most. Code quality is one kind of memory. Collaboration quality is another. Treating them as the same thing is how you end up with a 3,000-line prompt that still produces tonally-wrong output. 😜

Where I think Captain Logs are headed, at least in my own usage, is further than just corrections. I want to use the accumulated logs as raw material for something more distilled: a "user guide" that future agents can reference to understand how I tend to think, what I tend to push back on, what good output looks like to me. And eventually, proactive follow-ups grounded in that history: "based on previous Captain Logs, would you like me to make X, Y, or Z modifications to what we just did?" The steering history becomes a working model of the user, not just a record of the session. That is still experimental, but the direction feels right.

Devlogs: The Technical Historian's Journal

In comparison, Devlogs handle the technical-history side separately, and the local instructions are explicit about the split:

		
  1. # Devlog Documentation Instructions
  2. Devlogs in this folder are high-level historical records for humans and future agents.
  3. When adding a devlog:
  4. - Use the filename pattern `NNNN - Short Title.md`.
  5. - Start the title with the log number and topic, for example `# 0002 - Neow's Cafe Card Catalog Integration`.
  6. - Start with the date in this format `Date: YYYY-MM-DD`.
  7. - Record why the work happened, what changed at a system level, and which commits/files are the durable source of detail.
  8. - Prefer commit hashes, plan names, and important paths over long code snippets.
  9. - Include verification commands and notable outcomes.
  10. - Capture decisions, assumptions, and follow-up work that would be hard to infer from diffs alone.
  11. - Keep implementation minutiae out unless it explains a decision or a future maintenance risk.

Two records, two questions:

  • Devlog: what changed technically, why, and how was it verified?
  • Captain Log: how did the human-agent collaboration evolve, and what should future agents remember about the user's direction?

I like that split because it stops every document from collapsing into the same document. A project decision, an unresolved issue, a plan, an implementation history, and a collaboration note all have different jobs. They should not all use the same shape.

Skills and Workflows: Where the Mechanical Gets Absorbed

The third layer is the most operational, and the easiest to spot when it is missing.

In Lab/.claude/skills/decompile-sts2/SKILL.md, I have a small custom skill for decompiling the local Slay the Spire 2 DLL. It describes when the skill applies, the exact command, the options, prerequisites, and failure modes.

		
  1. ---
  2. name: decompile-sts2
  3. description: Decompile the local Slay the Spire 2 DLL into C# source. Use when the user wants to decompile the game, regenerate decompiled source, or update to a new game version.
  4. ---
  5. # Decompile STS2
  6. Run the decompile script:
  7. ```bash
  8. scripts/decompile-sts2.sh
  9. ```
  10. Requires `ilspycmd` on PATH (`dotnet tool install --global ilspycmd`). The script auto-detects the game version from `release_info.json` and outputs to `decompiled/<version>/`.
  11. ## Options
  12. - `--dll PATH` — override the DLL path (default: Steam macOS ARM64 install)
  13. - `--out DIR` — override output directory
  14. - `--clean` — wipe the output directory before decompiling
  15. - `STS2_DLL_PATH` / `STS2_DECOMPILED_DIR` — env var equivalents
  16. ## Known failure modes
  17. - **`ilspycmd` not found**: run `dotnet tool install --global ilspycmd` and ensure `~/.dotnet/tools` is on PATH
  18. - **DLL not found**: pass `--dll PATH` or set `STS2_DLL_PATH` if Steam library is in a non-default location
  19. - **Output directory not empty**: pass `--clean` to wipe and re-decompile
  20. - **Version shows `unknown`**: `release_info.json` not found — verify the DLL path points to the real Steam install, not a copy

That is not thinking. It is not craft. It is pure mechanical knowledge, the exact sequence, encoded so I never have to re-explain it. Snapshot tags get a document. Release tags get a document. The timeline-summary workflow has a document and an example: snapshot tags, release tags and pages, today's work timeline summaries.

Even writing about the work became a repeatable workflow: inspect today's commits, inspect changed documentation, separate committed from uncommitted work, produce a prose-led update with a few source references blended in.

The rule I have settled on: if I notice myself doing the same sequence twice, encode it. A script, a skill, an instruction file, a plan template, a log format. The point is not documentation for its own sake. It is reducing repeated cognitive work so the next session can spend its energy on something that actually requires thinking.

Where This Gets Uncomfortable

I want to be honest about the parts that are still evolving.

The real risk I see is not whether the documents are useful, they are. Agents can navigate plans effectively: keyword search with rg or grep, infer relevance from titles, read only what is needed. The discovery problem is mostly solved. The staleness problem is not completely solved yet, at least not in a systematic way built into AgentOS. But the hope is not completely lost on this either.

If we build something one way and then refactor significantly without updating the corresponding plan, the next agent picks up outdated context as ground truth. That is a small problem the first time it happens. Catch it, patch the docs, move on. But if it compounds quietly across enough documents, the context layer starts to work against you rather than for you.

The mitigation I am working toward is making doc hygiene a first-class concern rather than an afterthought: agent instructions that prompt a check on relevant docs after significant changes, and eventually a workflow where the agent flags stale documentation proactively. The system should notice its own drift. Ideally this becomes one of the encoded principles of AgentOS itself, not something that depends on the human remembering to do it.

Captain Logs have their own version of this: not every steering moment is worth preserving. Some corrections are just me being grumpy on a Tuesday (🤣 Is that how I seemed to you, Claude? I mean it did write it on a Tuesday. lol). Telling the difference between signal worth capturing and noise that would just pollute the log is a judgment call I am still developing.

So I would not pitch this as a solved methodology. It is a working hypothesis, that the cost of structured context is lower than the cost of re-establishing context every session, and that the structure is worth maintaining in layers rather than as one undifferentiated blob.

Makes me wonder. Is there a way that I can concretely test this hypothesis? Maybe I can by comparing the token cost of achieving the same outcome with vs. without AgentOS. But to be honest, I cannot be bothered. Ha.

Self-Enriching Engineering Systems

The pattern underneath all of this is what I find most interesting: the project is designed to accumulate rather than just record.

The agent does a piece of work. I steer it. The steering becomes a Captain Log. If the work becomes a repeatable process, it becomes an agent workflow. If the work is agreed but not done, it becomes a plan. If it lands, it becomes a devlog. If it exposes a bug, it becomes an issue note. If it involves repeated commands, it becomes a skill.

Over time the repo preserves more than code. It preserves the reasoning and the collaboration environment around the code. The thinking layer, the craft layer, and the mechanical layer all get a place to live, and the boundaries between them stay legible.

That changes what I want from AI agents. Not just faster code. Environments where they can operate effectively because the surrounding context has been intentionally designed, and where the design itself improves as the project is used.

The End?

None of this changes what the human is for. The agent still needs someone to supply the judgment, hold the taste, and decide what is worth preserving. AgentOS is not an attempt to automate that. It is an attempt to make it stick.

AgentOS is not finished either. It probably never will be finished finished, that is the point of a self-enriching system. The patterns will keep evolving as the project is used, and eventually I plan to give it its own dedicated repo.

If you are building something similar, or have strong opinions about where this breaks down, I would genuinely like to know. The collaboration log for this whole endeavour is still being written.