I’ve noticed that many people started criticizing the new Cursor experiment - so I wanted to verify it myself.
Not just the autonomous agent swarm part, but in principle, can AI write something as complex as a browser?
And then I wrote it in two days.
Here’s how it went.
Rules
Limited set of agents:
- For this experiment, I’ve run up to 5 different Claude Code instances in parallel worktrees, jumping between them like between units in RTS games.
No full autonomy:
- I’ve manually verified features inside the worktree.
It’s okay to use libraries:
- I didn’t aim to write it from scratch fully, just as with the Cursor experiment; I’ve asked it to implement a JS engine and a CSS parser.
- I’ve explicitly asked it to use a library for HTML, and it chose to use a library from Servo.
The Agentic RTS Coding Experience
I’ve started the project by running multiple instances of Claude Code. First in plan mode, then implementation, accepting all edits.
In my experience, Claude Code is notorious for asking for permissions randomly, and sometimes it requires confirmation before editing the file, so I had to monitor each instance to avoid being stuck and push it further.
Sometimes the agents need follow-ups, you need to answer the questions, point to an issue, so you’re never bored in this game.
Another interesting side-effect of running multiple instances of Claude code - merging git worktrees. Some features get implemented in other worktrees, some get stuck in long merge conflicts resolving loops.
This is the part the Cursor team was trying to explore - how to make multiple agents work efficiently. However, I’m starting to think that the ideal solution doesn’t quite reflect the organization-style specialization for agents. Considering that the agent operator (me) is verifying results and choosing direction, it’s more about removing bottlenecks in the codebase:
- Large classes
- Large integration points
- Better module separation.
I think at some point I will not be able to handle all modules, so I’m not sure if this idea can be scaled.
Verification
I’ve asked each agent to write a lot of unit tests for each component - this covers non-interactive stuff almost perfectly. The rendering, on the other hand, required a lot of manual testing.
Poor offsets were everywhere.
Scaling didn’t work until I told it what to do for my system.
The initial version of the text glyphs renderer had O(n^2) complexity.
I’ve had to guide the agent through optimizations, adding logs, and pasting output, bisecting the bottlenecks together. Eventually, it optimized the code by adding random caches across the stack and fixed the offsets. There is still some other part of the code with O(n^2) complexity at least, because some of the large pages are loading slowly.
Could I have done better?
Later during development, I’ve added screenshot mode for the agent to read the output by itself, but this was another ordeal. The agent decided to duplicate the rendering stack just for the screenshot functionality, and I had to force it to deduplicate the code. I wish I thought about this earlier.
Code Quality
I still didn’t fully look at the code. I know it’s bad - I saw some manual cache cleanups all over the place. This can be fixed - I just need to think by myself about the architecture, write diagrams for myself, and then run agents with specific plans based on my target designs.
Functionality
Did the agent manage to write a functional browser? Heck no, it barely supports CSS, not to mention JS. But it’s clear to me that if I were to set a goal and implement the rest of the specs, I can do it very quickly.
Authorship Problem
The resulting code does use a lot of libraries, and apparently, it brought a lot of stuff from Servo. If I spend more time, I can just re-implement each of the libraries from scratch. In this case, though, will it write new code or just regurgitate all the existing open-source code? I don’t know.
And what is my role in this process? I'm not writing code, I'm merely directing agents, and trying to unblock them as fast as I can. This tweet sums my experience pretty well.
Conclusion
This was a fun experiment, I didn’t think I’d get so far in such a short time - I’ve only spent 2 evenings on this. LLMs are extremely powerful, but in known, trained cases.
When we figure out how to verify code at scale and manage to efficiently parallelize the agents, we’ll get full-blown software engineering factories, producing known code tweaked to the specific task. It’s an ultimate framework.