How I Built a Skill That Makes All My Other Skills Better (Using Karpathy's Autoresearch)
I had 20+ skills running my newsletter. The Karpathy Loop showed me most of them were operating at half their potential.
I have over 20 skills running my newsletter right now. Each one is basically a compressed AI brain trained to do one specific job really well. Social notes, LinkedIn posts, SEO optimization, carousels, performance analysis. A skill takes something that used to cost you 30 minutes of context-setting and turns it into a single command.
Hereâs the problem nobody talks about.
When I run my skills, maybe 70 or 80% of the outputs are genuinely good. The rest miss the mark. And my âtesting processâ is basically: run it a few times, tweak what obviously breaks, call it done. If it looks âpretty goodâ on three test runs, I ship it and move on.
But Iâm testing for confirmation, not for failure. I use similar inputs every time. I evaluate on gut feeling. I stop improving when it feels âgood enough.â Which means a skill could sit at 70% of its potential for months and Iâd never know. You donât know what you donât know.
If youâve ever built an app feature or shipped a workflow and thought âthis works, but could it work better?â... same thing. Nobody has a systematic way to answer that question.
Then I saw what Andrej Karpathy released, and it clicked.
What Karpathy Actually Built
Karpathy, for those who donât know him, is one of the people who shaped how we think about AI today. He was a founding member of OpenAI, led Teslaâs Autopilot AI, and taught Stanfordâs most popular deep learning course. When he builds something, people pay attention.
In early March, he open-sourced a project called autoresearch. And the idea behind it is one of those things that sounds obvious once you hear it but nobody was doing it.
Hereâs the setup. Karpathy had a piece of code that trains a small AI model. The training takes about 5 minutes to run and produces a score at the end that tells you how well the model learned. Lower score = better.
Instead of manually tweaking the code himself (change a setting, run the training, check the score, repeat), he gave the code to an AI agent and said:
âYour job is to make this score go down. You can change anything in the code. Run the training after each change. If the score improves, keep your change. If it doesnât, throw it away and try something else. Donât stop. Donât ask me. Just keep going.â
Then he went to sleep.
He left it running for about two days. The agent worked through roughly 700 changes autonomously and found about 20 that actually improved things. Stacking them all together gave an 11% efficiency gain on a project Karpathy thought was already well-tuned after years of manual work.
In his own words:
âI am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.â
And every single improvement transferred to larger models too. The agent found things that generalized.
This is a person who has been doing this exact kind of optimization manually for two decades. And the agent found improvements heâd missed on a project heâd already spent significant time tuning by hand.
He ended his thread with something that stuck with me:
âAny metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm. Itâs worth thinking about whether your problem falls into this bucket too.â
I find Karpathy is a truly fascinating person. In his recent podcast with Sarah Guo, he explained the idea behind auto-research: he wants to maximize what an AI agent can do by letting it handle all of his work. He wants to remove himself from the loop between human and AI, because his presence as the human becomes the bottleneck. Check out this clip:
This is what I've been so obsessed with lately. The idea that your presence as the human becomes the bottleneck. The highest-leverage move is not doing the work yourself, but setting up the right conditions and then stepping back.
Karpathy's tweet about it got over 8.6 million views in two days. Fortune magazine wrote about it and coined the term "The Karpathy Loop." His GitHub repo got starred by 50k+ users and forked by 7.9k people, including me.
Then Shopifyâs CEO Tobi Lutke tried the same pattern the night Karpathy released it. He pointed an agent at a query-expansion model, told it to optimize for quality score and speed, and went to bed.
He woke up to a 19% score improvement after 37 experiments. The 0.8B parameter model was now outperforming his previous 1.6B model. A smaller model, beating the bigger one, because the agent found better settings overnight.
Why The Karpathy Loop Works on Almost Anything
Hereâs the part that most people miss when they see Karpathyâs results. They think:
âCool, an AI that optimizes neural networks. Thatâs neat but not relevant to me.â
But the Karpathy Loop has nothing to do with neural networks specifically. In fact, it works anywhere, as long as you have three things:
Something you can edit: code, config, instructions, a template, a prompt
A way to measure if it got better: a score, a benchmark, a pass/fail test, any number that goes up or down
A time-boxed way to test: run the thing, get a result, decide keep or discard
Thatâs it. Thatâs the whole loop.
What Karpathy Loop cares about is the loop, not your domain: change something, measure the result, keep or discard, repeat.
From Karpathy Loop to Skill Loop
When I saw all of this, I couldnât unsee it. Itâs like my third eye finally opened wide and started making connections to the things I do every day, in this case, skills.
My skills have all three primitives:
Something editable: The
SKILL.mdfile. Thatâs the instruction set that tells the AI what to do, how to do it, and what the output should look like.A way to measure quality: I can build an eval rubric. Score the output on 6-10 dimensions, 1-5 each. Did the tone match? Did it follow the structure? Was the hook compelling? Did it miss any requirements?
A time-boxed test: Run the skill once. Get the output. Score it. Done.
Hereâs how it works:
The one thing I had to figure out: neural network metrics are clean numbers. You train the model, you get a score, done. Skill quality is fuzzy because you need to answer questions like, âWas this LinkedIn post good?â and most of the time, that isnât a yes/no question.
Thatâs why the eval rubric became the critical piece. It turns âdoes this feel right?â into âdoes this score 4/5 on hook quality, 3/5 on tone consistency, 5/5 on structure compliance?â Suddenly you can compare experiment A to experiment B. You can see which dimensions improved and which regressed. You have data instead of vibes.
But hereâs where I had to go further than Karpathyâs setup. A 1-5 rubric is great for humans to understand where quality stands. Itâs terrible for an unsupervised loop running at 2 AM. âWas the hook quality a 3 or a 4?â is a judgment call. An agent canât reliably make that call without you there.
So I added a conversion step: take the weakest dimensions from the rubric and turn them into binary yes/no checks. âDoes the hook include a specific number or data point?â Yes or no. âDoes every content slide have a concrete action?â Yes or no. No gray area. Two different agents scoring the same output should agree.
Thatâs what makes the autonomous loop actually work. The 1-5 rubric tells you where the problems are. The binary evals let the agent fix them without you.
So I built a meta-skill with three phases:
Setup (youâre involved): Analyze the skill, generate test cases, build the rubric, run a baseline, convert weaknesses to binary evals. You approve everything before the loop starts.
Autonomous loop (no human): Mutate one thing, run all test cases, score with binary evals, keep or discard, repeat. The agent doesnât stop. It doesnât ask permission. It just runs experiments until it hits the stopping criteria you set.
Debrief (you review): Re-score with the original 1-5 rubric so you can compare before and after in the same language. Full report on what changed, what worked, what didnât.
The human approves the plan. The machine runs the experiments.
And honestly? The results surprised me.
What the Karpathy Loop Unlocks for Your Skills
By the end of this post, youâll have:
The complete autoresearch skill: The three-phase system I built that you can run on any skill today. Setup, autonomous loop, debrief. Point it at a skill, let it run, get a before/after scorecard.
Real results from my own skills: Before/after scorecards from running autoresearch on [2-3 specific skills], including the specific blind spots the loop found that Iâd missed after months of manual testing.
How to design eval rubrics that actually work: The hardest part of this whole system isnât the loop. Itâs defining what âbetterâ means. Iâll show you how to build scoring criteria that catch real quality problems instead of just validating what you already believe.
The Karpathy Loop for app features and workflows: How to adapt this same loop for code performance, loading times, content templates, or any workflow with a measurable output.
đš One thing to know before we dive in: This entire system runs on Claude Code. Iâd suggest using Opus 4.6 to run this because it requires longârunning tasks and subâagents. If youâre not using Claude Code yet, start with my beginnerâs guide, then level up with the ultimate Claude Code guide. If you already are, you can run autoresearch on your own skills today.
Letâs build it.
How the Autoresearch Skill Works
Before you run this on anything, ask one question: is this skill ready for optimization, or does it need a rewrite?
The Karpathy Loop has a sweet spot. If your skill works 60-80% of the time and fails in specific, repeatable ways, autoresearch will find those failure patterns. If your skill works but the output is generic or bland, the loop can target specificity. But if your skill fails completely or produces the wrong type of output, optimization wonât help. Rewrite it first, then optimize. And if your skill already works 90%+ of the time, youâre probably hitting diminishing returns. The remaining 10% is usually taste or edge cases that binary evals canât capture.
The biggest prerequisite: you need to know what âgoodâ looks like. You canât build evals without a clear picture of quality. If you canât describe what a good output looks like for your skill, thatâs the work to do first. Not configuring the loop.
Once youâve cleared that bar, the skill runs in three phases. The human is involved in the first and last. The middle runs completely on its own.
Iâll share how I improved my LinkedIn Carousel Generator skill as well as my Infographic Generator skill using Nano Banana, which comes as a bonus in todayâs post đ.
Phase 1: Setup (Two Touchpoints)
This is where you and the agent align on what âbetterâ means. The whole phase is designed around just two human decisions. Everything else runs automatically.
What you need to do is simply trigger the autoresearch skill and mention the other skill you want to improve. An example will be provided later in the case study.
The agent works first. It reads your entire skill directory. SKILL.md, reference files, examples, configs, everything. It documents what the skill does, what goes in, what comes out, and which files can be edited vs. which ones stay fixed. Then it scans your project for real test inputs, then proposes what it found.










