Kirill Chernakov's Blog

533 PRs, $379, and one rap battle: seven weeks of autonomous agents on an abandoned side project

Double exposure: developer in bed with a phone glowing on his face, overlaid on a Mac terminal showing Claude Code agent sessions.
5 am bedtimes for weeks.

AI agents only amplify what’s already there. The pool didn’t build a product on top of nothing — it compounded a few months of boring scraping I’d done by hand back in 2024, before abandoning the project for a year and a half.

YOLO night

On March 6th, I left Claude Code running overnight on a side project I hadn’t touched in 1.5 years.

The prompt was one sentence: “you have full freedom, add most creative features you can come up with”. The setup was one Docker container with --dangerously-skip-permissions, one branch called clowd, and one supervisor: me, checking Vercel deploys from bed. The project was Townleap, a website for people considering permanent relocation that I’d been building on weekends until my day job ate the weekends.

I woke up to 109 commits (and an alert from Vercel telling me I reached the 100 deployments/day limit).

The first hour was slop. At 22:14, it shipped a Russian City Tinder. At 22:17, This or That (kind of Facesmash but for cities). At 22:19, City Roast. At 22:23, a “Surprise me” button. At 22:26, a satirical news generator (why do LLMs love fake newspapers so much?). At 22:54, City Rap Battle (wtf?).

After 23:00, it stopped messing around. Score Simulator with adjustable factor weights. Cost Breakdown page. Timezone Overlap tool for remote workers. Open Graph tags. A sitemap. The “/” keyboard shortcut to focus the search bar. A Heat Map color-coded by region. Twenty-something things, most of them on a list I’d been thinking to add at some point.

The bill was $90: $20 for the Claude sub I was already paying, $70 in overage from one night.

I kept the good half, reverted the rap battle, and spent the next seven weeks figuring out how to do that on purpose.

Russian-language city Tinder card asking about rent affordability. Buenos Aires vs Medellín head-to-head pick-one card. Tbilisi roast modal with rival takes from Lisbon and Belgrade. Timezone Overlap tool: cities ranked by working-hours overlap with a US East team. The City Dispatch satirical newspaper layout with fake Dubai headlines. Lisbon vs Dubai rap battle output bragging about homicide stats. Score Simulator with adjustable factor weights. City Heat Map color-coded by region. Cost Breakdown for Amsterdam, $3,368/month split by category.

Nine of the YOLO night’s twenty-something features. Five were rolled back; four still serve on master.

Half slop, half roadmap

Half the time, the agent (Opus 4.6 at the time) built something I wanted, or at least along the lines of it; the other half, it produced slop — funny or eerie. Per-commit deployments gave me a window into the process, but genuinely useful features buried in a pile of slop inside one giant PR weren’t workable. Meanwhile, extra usage was eating too much, too quickly. It needed love, and order — mostly structure and order. Then a sweaty sprint hit at my day job, and bye my dear friend: I didn’t touch the site for weeks.

Generated 'Safety Bubble' city passport for Haifa, sent to a friend
A friend, on the auto-generated “Safety Bubble” passport for Haifa: “Также самый безопасный город планеты.” (“Also, the safest city on the planet.”)

Plumbing, then synthetic users

Three weeks later, I returned. First I added a Telegram bot to guide the agent: two modes — /status to check tasks, and messages landing in INBOX.md, which the agent reads between rounds. It worked but wasn’t perfect: unformatted logs messed up parsing, containers crashed, and manual resets were needed. So I added /bug to send feedback to GitHub. Now there was a queue. Better.

To stop walking to the laptop, I added a Claude Code session on the host machine with remote control. Clunky: sessions stalled, permission prompts didn’t always come through, and I’d be staring at “Whackamolling” like a fool for ten minutes, but it gave me room. With the plumbing done, I became the bottleneck. Someone had to review all these half-slop, half-good features. I needed QA. I needed users. I was too embarrassed to invite friends. I’d told a few about the YOLO night and shown them the rap battle, and I saw the look they gave me. Besides, not everyone is vibeslopping at 4 am.

So I went with the most logical move a person in my state could make: synthetic users. The idea: spawn an LLM as a fake user, give it a persona and a goal, let it browse the site for ten minutes, and have it write up what was confusing or broken. Sure, LLMs are just predicting the next token, remember? Dress one as a distinct character and turn it loose. I wrote five: young post-2022 emigrants looking for an optimal city to settle, each drafting a shortlist. A few hours of scaffolding the agent-builder CLI and the skill description, and voilà, the first report came in. Reading… hm, makes sense:

Anya, on Phuket: “Nine percent sunny in summer? In PHUKET? That’s got to be a data bug.”

Dmitri, on the Safety First page: “Hungary? The same country that the blockquote implies they’d exclude? Orbán’s Hungary ranks #7 for ‘safety’ while the democracy index calls it a hybrid regime. This contradicts the promise at the top of the page.”

Max, on Gothenburg’s school data: “One international school. My older kid is 9 and speaks Russian and English. Swedish immersion at 9? That’s a big ask.”

Lena, on Chiang Mai: “I want to know if there are actually trails nearby. Map? Elevation? Not obvious.”

Persona tests on the left, dispatcher opening issues on the right.
Persona tests on the left, dispatcher opening issues on the right.

Nothing earth-shattering, but much meatier than another gstack design review (a CLI toolkit of QA, design, and review skills I lean on heavily; more on it below). I asked the agent to turn the report into tickets for the yolo-agent. A few hours later, the second persona ran: 2 of 5 finished a shortlist. One hyper-pedantic persona went sideways into a dark corner of the site with no links and never returned. Ten more issues, run the cycle again. It was taking shape. Glance at the clock. 5 am. Oops. One more persona test, ten more issues to keep the agent busy while I get my five hours.

Next day, I finally got fed up with the one-branch setup and extended it into a pool: a dispatcher sorting issues by priority, spawning Docker workers, each claiming an issue in its own container and branch. I told it to take screenshots and record before/after for each feature, and to upload the artifacts to my bucket, so I could sanity-check and merge smaller PRs at a glance.

The pool, wired up

Simple dispatcher, no LLM, a cron + bash script: get the sorted list of issues, for each, spawn a container with Claude Code, cap at 5 workers (I was still under the delusion I was counting money here). Claude itself figured out how to extract the OAuth token from Keychain on container start. A launchd process refreshed the token every 8 hours.

OrbStack: the dispatcher and four claude-code workers, one container per issue.
OrbStack: the dispatcher and four claude-code workers, one container per issue.

At the start, a worker checked out master, ran pnpm install, read the issue, and started working. A skill told it to self-review with agent-browser, QA, and screenshot before opening a PR. My comments on PRs triggered a follow-up worker; the dispatcher marked PRs with wip / ready labels. Merge conflicts also triggered auto-resolving.

I’d worried that assigning each worker to a single issue would starve it of project context. The opposite happened: the context window became freer — each worker loaded only what its issue touched, and a denser CLAUDE.md with links to targeted docs covered the rest.

I’d had enough of the extra usage and subscribed to the $200 Max plan. (A year ago, I’d thought “what idiots would pay that much for an LLM?”) I ran persona tests in the morning, checked fixes at lunch, and evolved the Docker setup in the evenings and nights. Anthropic fixed remote control and shipped auto-mode just in time: now I could have a “Pool watcher” session on the laptop without babysitting permissions for the crazy bash commands Claude wrote. Ten more steady days, 5–7 features per day.

Townleap homepage evolving across 533 commits, desktop view. Townleap homepage evolving across 533 commits, mobile view.
Homepage, desktop and mobile, every ~30 commits over seven weeks.

What I gained, what I lost

I got agentic-pilled. I was a Cursor champion for over a year, tried it way back in 2023, and made fun of everyone who declared AGI and stopped looking at the code. Suddenly, the IDE-augmentation level felt one abstraction too low — especially on a side project where I don’t really care about the code.

Essentially, I became a manager of an agent pool. What I lost: the pixel-perfecting and design tweaking I actually enjoyed. Instead of a few delightful hours iterating on fonts, buttons, and layouts in code, I’d find myself writing a third prompt in a row trying to convey my taste. Irritated. Annoyed. Argh. The pet project, where I was supposed to play with fonts, buttons, and layouts that I never get to touch at my day job, turned into a FIFA manager-like game. The designer was almost gone, product manager prevailed.

What I gained: those polishing sessions had never left room to actually ship the things sitting in the roadmap in my ADHD mind. Now I was deploying so much more. A few weeks in, I accepted it: the tradeoff is worth it.

A little nuance: it didn’t detach me from the laptop or the phone. 5 am bedtimes for weeks. Two jobs: Elastic by day, pool by evening into morning. My girlfriend was not pleased.

Meme: hooded developer at a messy desk piled with Red Bull cans, SKILL.md and MCP Protocol books, telling his girlfriend in the doorway 'just 1 more skill babe'.
A friend sent this. Hit a little too close.

Eventually I shared, and friends noticed: it doesn’t feel like typical AI slop. Why? Because the foundation was hand-curated. A brief sidestory: I started Townleap in 2024, around when I discovered one of the first “agentic” coding tools — some solo-dev Cursor-like extension I trusted, for the first time, with a 5-minute round-trip scaffolding scrapers for my dream city-comparison spreadsheet. Long story short: I switched back to writing by hand after a few days, and for the next months I added GDP, taxes, homicide rates, expenses and so on, until I lost motivation and free time grew scarce. Now, with a pool of workers producing slop at times, I could keep it in line — periodically steering toward reproducible facts, forcing it to reintegrate ad-hoc data into the proper dataset, extending the existing architecture instead of band-aiding gaps.

Those months hand-typing GDP tables and tax brackets in late 2024 — before pausing for a year and a half — were essential. The pool didn’t build from nothing; it built on the groundwork I’d already done.

The other thing keeping the slop in check: gstack. The design-review skill helped me converge on a coherent visual style instead of the never-ending flow of generic gradients, and the audit pass surfaced real Web Vitals regressions I’d have missed. Caveats: it’s a token glutton — easily my single biggest line item per run — and the recommendations occasionally drift into generic “increase contrast, add whitespace” territory you have to push back on. Still net positive, with a finger on the budget.

The numbers

What it cost me out of pocket — seven weeks, March 2 to April 26:

Line item Charge
Claude Pro sub (Mar 2, pre-upgrade) $20.00
YOLO night overage (Mar 6) $50.00 + $10.00
Pre-Max overage (Mar 7) $50.00
Claude Max 20x upgrade (Mar 27) $195.80
Extra usage after Max ran out (Apr 21) $53.10
Total Anthropic, seven weeks ~$379
Vercel Pro (incl. YOLO-night overage) $20/mo
Cloudflare R2 + Worker (artifact bucket) $0
SEO SaaS $29/mo

What it would have cost me without the subscription — Claude Code logs each session’s total_cost_usd as if I were paying API rates:

Window Tokens API-equivalent
Seven weeks (Mar 2 – Apr 26) 3B (~98% cache) ~$4,000

So: ~$379 out of pocket against ~$4,000 of metered tokens. ~10× subsidy. Most of that is cache reads, which the API meter still counts — your mileage will vary.

The receipts.
The receipts.

Detour: a week with Codex

Between feedback threads and broadening scope, I burned through even the $200 Max weekly limits. I had a few quiet days. Touching grass, feeling spring in the air, playing padel, and going to sleep before 2 am.

Two days in, the urge came back. Still almost a week until Claude resets. I decided to try Codex: asked it to familiarize itself with the Docker pool setup, transfer the persona-test skill, swap the CLI in the image — all on a parallel branch, no time for a proper refactor. On the $20 ChatGPT sub, btw. So I let it run.

Same cockpit, different brain — Codex CLI in the worker pool.
Same cockpit, different brain — Codex CLI in the worker pool.

At first, it was bliss: GPT-5.4 was faster and felt more thorough on SEO, correctness, and data gaps. Soon, I noticed PR descriptions becoming slop walls. Who cares? I had screenshots for a reason.

Around then, I’d been polishing a new game mechanic — Tinder, but for preferences: swipe yes/no on 15 lifestyle questions, get a ranked city shortlist at the end.

Preferences quiz evolving across PRs, desktop view. Preferences quiz evolving across PRs, mobile view.
The quiz, PR by PR — desktop and mobile.

I shared it with a few groups of friends; they shared it further. Friends — mostly — playing with something I’d built end-to-end as the sole product owner. A small thrill. A handful of them then deep-dived into the comparison page beyond the quiz. But every time someone passed it on, they shared it as screenshots — not the unique link with the carefully crafted OG card.

The OG card the link would have rendered if anyone had actually pasted the URL.
The OG card the link would have rendered if anyone had actually pasted the URL.

Worse: they were stitching together 2–3 captures, because the result list didn’t fit on one screen.

So I filed an issue for codex: make the results page tighter and screenshot-worthy. The agent followed the instructions: reduced paddings and margins. Not ideal, but it works. Then I looked carefully at the copy… it said “here’re your results, screenshot-worthy”. I was taken aback. “Are you for real? Do not leak instructions to the end user.” “Sorry, my bad, here is the corrected version:” — and the same slop again. In the end, the codex-powered pool finished 5–6 issues per 5h quota on a cheap plan — cool, but on the design side, it was annoyingly horrible.

Codex shipped the design brief verbatim. "VERDICT ABOVE THE FOLD." Real users would have seen this.
Codex shipped the design brief verbatim. “VERDICT ABOVE THE FOLD.” Real users would have seen this.

Anthropic kept shipping the thing I needed

Finally, Friday night. But still 24 hours until the limit resets. I gave up and enabled extra usage. Luckily, Mr. Dario decided to send $200 in credits in another round of spending spree. It burned through the $200 in extra usage in one evening. Ouch. It’s painful to spend, even if they’re free credits.

On the bright side: Boris and the team were shipping like crazy and outshipped me on many fronts:

3 billion tokens

At week 5, I set up DuckDB and ingested a pile of JSONL logs to build a rough observability layer. One of the first aggregations — total tokens. 3 billion tokens across seven weeks. Come again? Mostly cache. Still, it’s wild to type. Sure, I’d read the rants about AI labs running on negative margins and the alarmists crying bubble. Seeing it for myself hit different.

Dashboard for the Mar 2 – Apr 26 window: 3B tokens, ~$4k API-equivalent.
Dashboard for the Mar 2 – Apr 26 window: 3B tokens, ~$4k API-equivalent.

The other thing the logs confirmed: persona tests were the critical piece. Continuity of the reports made progress visible, and they surfaced gaps that no unit test could.

Other tools I tried

Codex CLI. Interactively, it felt very close, but for the autonomous workflow I had — auto mode, /loop, monitors — were all missing.

Gemini CLI. Tried it in the first two weeks. Thought I’d arm it as a third titan in the pool to avoid limits. The model was fast, but the terminal flickered like a disco ball at the time, and again — no loops, no monitors, and a myriad of permission dialogs. (Disco-ball clip ↗)

Opencode / Pi / etc. Haven’t tried — I wanted to focus on my website, not set up a harness for a whole weekend. Besides, once I saw how wild the gap is between subs and pay-as-you-go pricing, I committed to the sub.

What it can’t do

The obvious failures are easy to spot and easy to laugh at: a mobile-only bug “fixed” and self-verified with a desktop screenshot. Plausible-sounding copy that, on a second read, contains the literal string “VERDICT ABOVE THE FOLD”. The rap battle. You QA, you push back, you move on.

The harder ceiling is the one you don’t see in any single PR. Three things the pool just can’t do:

That’s why I still review every PR. The pool compresses my time; it doesn’t replace my judgment — and the moment I pretend it does, the cracks show up in the product.

Where it’s going

No roadmap promises here, folks. The project ships again. The data layer keeps deepening. I go to bed before 2 am most nights now.

If you’re picking your next city, take the quiz at townleap.com, see if it can shortlist a place worth moving to. And if it does — please share the link, not a screenshot.


Postscript: two weeks later (May 4)

812 closed PRs total — 439 of them in the two weeks since I locked the draft. Second $200 cycle just rolled. A few things changed:

GitHub Projects kanban board for Townleap — backlog, in progress, in review, blocked, done.
The queue, with a face. p0–p3, status synced by launchd.

Same caveats. The product still ships, the data layer keeps deepening, and the second $200 still hurts a little.