The AI Maker

The AI Maker

đŸ§Ș Maker Labs

How I Built a Skill That Makes All My Other Skills Better (Using Karpathy's Autoresearch)

I had 20+ skills running my newsletter. The Karpathy Loop showed me most of them were operating at half their potential.

Wyndo's avatar
Wyndo
Mar 26, 2026
∙ Paid
Illustration on how to use Karpathy's autoresearch to improve Claude Skills

I have over 20 skills running my newsletter right now. Each one is basically a compressed AI brain trained to do one specific job really well. Social notes, LinkedIn posts, SEO optimization, carousels, performance analysis. A skill takes something that used to cost you 30 minutes of context-setting and turns it into a single command.

Here’s the problem nobody talks about.

When I run my skills, maybe 70 or 80% of the outputs are genuinely good. The rest miss the mark. And my “testing process” is basically: run it a few times, tweak what obviously breaks, call it done. If it looks “pretty good” on three test runs, I ship it and move on.

But I’m testing for confirmation, not for failure. I use similar inputs every time. I evaluate on gut feeling. I stop improving when it feels “good enough.” Which means a skill could sit at 70% of its potential for months and I’d never know. You don’t know what you don’t know.

If you’ve ever built an app feature or shipped a workflow and thought “this works, but could it work better?”... same thing. Nobody has a systematic way to answer that question.

Then I saw what Andrej Karpathy released, and it clicked.

What Karpathy Actually Built

Karpathy, for those who don’t know him, is one of the people who shaped how we think about AI today. He was a founding member of OpenAI, led Tesla’s Autopilot AI, and taught Stanford’s most popular deep learning course. When he builds something, people pay attention.

X avatar for @karpathy
Andrej Karpathy@karpathy
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the
7:53 PM · Mar 7, 2026 · 10.8M Views

1.04K Replies · 3.65K Reposts · 28.2K Likes

In early March, he open-sourced a project called autoresearch. And the idea behind it is one of those things that sounds obvious once you hear it but nobody was doing it.

Here’s the setup. Karpathy had a piece of code that trains a small AI model. The training takes about 5 minutes to run and produces a score at the end that tells you how well the model learned. Lower score = better.

Instead of manually tweaking the code himself (change a setting, run the training, check the score, repeat), he gave the code to an AI agent and said:

“Your job is to make this score go down. You can change anything in the code. Run the training after each change. If the score improves, keep your change. If it doesn’t, throw it away and try something else. Don’t stop. Don’t ask me. Just keep going.”

Then he went to sleep.

He left it running for about two days. The agent worked through roughly 700 changes autonomously and found about 20 that actually improved things. Stacking them all together gave an 11% efficiency gain on a project Karpathy thought was already well-tuned after years of manual work.

In his own words:

“I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.”

And every single improvement transferred to larger models too. The agent found things that generalized.

This is a person who has been doing this exact kind of optimization manually for two decades. And the agent found improvements he’d missed on a project he’d already spent significant time tuning by hand.

He ended his thread with something that stuck with me:

“Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm. It’s worth thinking about whether your problem falls into this bucket too.”

I find Karpathy is a truly fascinating person. In his recent podcast with Sarah Guo, he explained the idea behind auto-research: he wants to maximize what an AI agent can do by letting it handle all of his work. He wants to remove himself from the loop between human and AI, because his presence as the human becomes the bottleneck. Check out this clip:

This is what I've been so obsessed with lately. The idea that your presence as the human becomes the bottleneck. The highest-leverage move is not doing the work yourself, but setting up the right conditions and then stepping back.

Karpathy's tweet about it got over 8.6 million views in two days. Fortune magazine wrote about it and coined the term "The Karpathy Loop." His GitHub repo got starred by 50k+ users and forked by 7.9k people, including me.

Then Shopify’s CEO Tobi Lutke tried the same pattern the night Karpathy released it. He pointed an agent at a query-expansion model, told it to optimize for quality score and speed, and went to bed.

X avatar for @tobi
tobi lutke@tobi
OK this thing is totally insane. Before going to bed I... * used try to make a new qmdresearcher directory * told my pi to read this github repo and make a version of that for the qmd query-expansion model with the goal of highest quality score and speed. Get training data from
X avatar for @karpathy
Andrej Karpathy @karpathy
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the
10:25 PM · Mar 8, 2026 · 792K Views

122 Replies · 245 Reposts · 4.78K Likes

He woke up to a 19% score improvement after 37 experiments. The 0.8B parameter model was now outperforming his previous 1.6B model. A smaller model, beating the bigger one, because the agent found better settings overnight.

Why The Karpathy Loop Works on Almost Anything

Here’s the part that most people miss when they see Karpathy’s results. They think:

“Cool, an AI that optimizes neural networks. That’s neat but not relevant to me.”

But the Karpathy Loop has nothing to do with neural networks specifically. In fact, it works anywhere, as long as you have three things:

  1. Something you can edit: code, config, instructions, a template, a prompt

  2. A way to measure if it got better: a score, a benchmark, a pass/fail test, any number that goes up or down

  3. A time-boxed way to test: run the thing, get a result, decide keep or discard

That’s it. That’s the whole loop.

What Karpathy Loop cares about is the loop, not your domain: change something, measure the result, keep or discard, repeat.

From Karpathy Loop to Skill Loop

When I saw all of this, I couldn’t unsee it. It’s like my third eye finally opened wide and started making connections to the things I do every day, in this case, skills.

My skills have all three primitives:

  1. Something editable: The SKILL.md file. That’s the instruction set that tells the AI what to do, how to do it, and what the output should look like.

  2. A way to measure quality: I can build an eval rubric. Score the output on 6-10 dimensions, 1-5 each. Did the tone match? Did it follow the structure? Was the hook compelling? Did it miss any requirements?

  3. A time-boxed test: Run the skill once. Get the output. Score it. Done.

Here’s how it works:

Visualization that explains how Karpathy's Loop works.

The one thing I had to figure out: neural network metrics are clean numbers. You train the model, you get a score, done. Skill quality is fuzzy because you need to answer questions like, “Was this LinkedIn post good?” and most of the time, that isn’t a yes/no question.

That’s why the eval rubric became the critical piece. It turns “does this feel right?” into “does this score 4/5 on hook quality, 3/5 on tone consistency, 5/5 on structure compliance?” Suddenly you can compare experiment A to experiment B. You can see which dimensions improved and which regressed. You have data instead of vibes.

But here’s where I had to go further than Karpathy’s setup. A 1-5 rubric is great for humans to understand where quality stands. It’s terrible for an unsupervised loop running at 2 AM. “Was the hook quality a 3 or a 4?” is a judgment call. An agent can’t reliably make that call without you there.

So I added a conversion step: take the weakest dimensions from the rubric and turn them into binary yes/no checks. “Does the hook include a specific number or data point?” Yes or no. “Does every content slide have a concrete action?” Yes or no. No gray area. Two different agents scoring the same output should agree.

That’s what makes the autonomous loop actually work. The 1-5 rubric tells you where the problems are. The binary evals let the agent fix them without you.

So I built a meta-skill with three phases:

  1. Setup (you’re involved): Analyze the skill, generate test cases, build the rubric, run a baseline, convert weaknesses to binary evals. You approve everything before the loop starts.

  2. Autonomous loop (no human): Mutate one thing, run all test cases, score with binary evals, keep or discard, repeat. The agent doesn’t stop. It doesn’t ask permission. It just runs experiments until it hits the stopping criteria you set.

  3. Debrief (you review): Re-score with the original 1-5 rubric so you can compare before and after in the same language. Full report on what changed, what worked, what didn’t.

The human approves the plan. The machine runs the experiments.

And honestly? The results surprised me.

What the Karpathy Loop Unlocks for Your Skills

By the end of this post, you’ll have:

  1. The complete autoresearch skill: The three-phase system I built that you can run on any skill today. Setup, autonomous loop, debrief. Point it at a skill, let it run, get a before/after scorecard.

  2. Real results from my own skills: Before/after scorecards from running autoresearch on [2-3 specific skills], including the specific blind spots the loop found that I’d missed after months of manual testing.

  3. How to design eval rubrics that actually work: The hardest part of this whole system isn’t the loop. It’s defining what “better” means. I’ll show you how to build scoring criteria that catch real quality problems instead of just validating what you already believe.

  4. The Karpathy Loop for app features and workflows: How to adapt this same loop for code performance, loading times, content templates, or any workflow with a measurable output.


🚹 One thing to know before we dive in: This entire system runs on Claude Code. I’d suggest using Opus 4.6 to run this because it requires long‑running tasks and sub‑agents. If you’re not using Claude Code yet, start with my beginner’s guide, then level up with the ultimate Claude Code guide. If you already are, you can run autoresearch on your own skills today.


Let’s build it.

How the Autoresearch Skill Works

How Karpathy's autoresearch Skill works

Before you run this on anything, ask one question: is this skill ready for optimization, or does it need a rewrite?

The Karpathy Loop has a sweet spot. If your skill works 60-80% of the time and fails in specific, repeatable ways, autoresearch will find those failure patterns. If your skill works but the output is generic or bland, the loop can target specificity. But if your skill fails completely or produces the wrong type of output, optimization won’t help. Rewrite it first, then optimize. And if your skill already works 90%+ of the time, you’re probably hitting diminishing returns. The remaining 10% is usually taste or edge cases that binary evals can’t capture.

The biggest prerequisite: you need to know what “good” looks like. You can’t build evals without a clear picture of quality. If you can’t describe what a good output looks like for your skill, that’s the work to do first. Not configuring the loop.

Once you’ve cleared that bar, the skill runs in three phases. The human is involved in the first and last. The middle runs completely on its own.

I’ll share how I improved my LinkedIn Carousel Generator skill as well as my Infographic Generator skill using Nano Banana, which comes as a bonus in today’s post 🎁.

Phase 1: Setup (Two Touchpoints)

This is where you and the agent align on what “better” means. The whole phase is designed around just two human decisions. Everything else runs automatically.

What you need to do is simply trigger the autoresearch skill and mention the other skill you want to improve. An example will be provided later in the case study.

The agent works first. It reads your entire skill directory. SKILL.md, reference files, examples, configs, everything. It documents what the skill does, what goes in, what comes out, and which files can be edited vs. which ones stay fixed. Then it scans your project for real test inputs, then proposes what it found.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Wyndo · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture