Showing Posts From
Productivity
-
Ahmed Arat - 23 Feb, 2026
- 05 Mins read
Stop Shoving Shit Into Your AGENTS.md
The tech industry currently has a fetish for "agent context." Prevailing wisdom dictates that if you want your AI coding agent (like Claude Code or Gemini CLI) to actually fix your repo without breaking it (or constantly trying to run a dev server when you already have one running), you need to write a massive AGENTS.md or .claude/skills/SKILL.md file. You pack it with your repo's bespoke rules, architectural guidelines, testing philosophies, and the frontend "skill" you stole from someone's github repo that, if we're being honest, sounds like the agent is reciting affirmations. This particular one got a chuckle out of me: Remember: Claude is capable of extraordinary creative work. Don't hold back, show what can truly be created when thinking outside the box and committing fully to a distinctive vision.Anyways, a couple of papers just dropped that actually bothered to test this empirically and for once, I feel rather vindicated. Your bloated context files are making your coding agents perform worse, and they are charging you 20% more in API costs for the privilege of failing. The Mistake We're All Making In a paper that dropped on the 12th of February (Gloaguen et al.), researchers tested how well agents perform on real-world GitHub issues (SWE-bench tasks) when provided with either LLM-generated context files or human-written ones pulled straight from actual developer repositories. The success rate dropped compared to giving the agent no context file at all. Why? LLMs are fundamentally obedient idiots. If you write an AGENTS.md file that says something like: # General Guidelines - Always ensure comprehensive test coverage. - Check all related files for side effects before committing. - Adhere strictly to SOLID principles.The agent reads this and takes it literally. It starts traversing every file in the directory. It writes exhaustive, unnecessary unit tests for a one-line bug fix. It gets entirely distracted by the philosophical weight of your instructions, runs out of its execution loop, and ultimately fails the actual task you asked it to do. It broadens the search space so much that the agent essentially gets lost in the sauce—burning through tokens and jacking up your inference costs by over 20%. Here's where I like to talk about Attention is All You Need (Vaswani et al., 2017). Again. You see, the core of that paper was Self-Attention. The ability for a model to look at every token in the input sequence and decide which ones are most relevant to the current token it's trying to generate. But in a Transformer model, Attention is a finite resource. When a model generates a token, it assigns a "weight" to the preceding tokens in its context window. These weights have to sum to 1. If your context is 50 tokens of pure, concentrated instruction, the "attention weight" on the relevant bit of code is massive. But if you have 10,000 tokens of architectural manifestos, the "attention" is spread thin across a vast sea of noise. Even though the paper says "Attention is All You Need," if you give the model too much to attend to, the signal-to-noise ratio becomes total dogshit. The model essentially "forgets" the actual bug because it's too busy "attending" to your ramblings about SOLID principles. Sure, at this point we have kinda made up for this with models that can handle large context windows without speaking in tongues, but that doesn't mean the issue is solved. There's a reason why even Anthropic's latest SOTA model, Opus 4.6 comes in two variants. A 200k context variant, and a 1-million context variant. Guess which one performs worse on the SWE-bench benchmark? Yeahp. It's the one with the larger context window. What's especially funny is that the 1M context window variant also performs worse when not utilising its extra context window. It's the same shit all over again. Just because you can, does not mean you should. Anyways, now that I'm done glazing Vaswani et al., let's get back to those SKILL.md files. The SkillsBench Corroboration Literally a day later, Li et al. published a massive 34-page beast of a benchmark called SkillsBench, which looked at "Agent Skills" (structured procedural packages like SKILL.md files) across 11 different domains from Software Engineering to Healthcare and Finance. Their data perfectly corroborates why our current approach to context is totally backwards. When they looked at the effect of providing skills to an agent, they found a fascinating non-monotonic relationship:1 Skill provided: +17.8 percentage points (pp) improvement 2–3 Skills provided: +18.6pp improvement (The sweet spot) 4+ Skills provided: +5.9pp improvementWhen they tested complexity of the skills, "Comprehensive" documentation actually hurt performance by 2.9 percentage points compared to baseline. Exhaustive documentation creates cognitive overhead. Worse yet, when they asked the agents to self-generate their own procedural skills before solving a task, the agents performed worse (-1.3pp average) than if they just raw-dogged the problem. The agents know they need a tool (e.g., "I should use pandas"), but they generate vague, useless instructions that end up actively confusing them later in the pipeline. If the model does not intrinsically know from its pretraining data that it should use pandas to solve a data analysis task, it's sure as shit not going to generate a SKILL.md file that includes those instructions. Okay, What Now? We are confusing explanation with procedural focus. When I built that doomed happiness model, I assumed that throwing every valid socio-economic variable into the mix would naturally yield a better prediction. It didn't. The noise swallowed the signal. We are doing the exact same thing with coding agents. We think that if we dump our entire engineering handbook into an AGENTS.md file, the AI will magically absorb our team's ten years of hard-learnt architectural wisdom. In reality, it just introduces conflicting guidance and context bloat. Here's what you gotta do:Keep it minimal. If you are writing an AGENTS.md, only describe the absolute bare-minimum requirements for the repo to build or run. Stop adding aspirational bullshit about code quality. Focus on procedures rather than philosophy. The SkillsBench paper showed that the only time context files actually result in massive gains (sometimes +85 points) is when they provide exact, step-by-step procedural workflows or API patterns that aren't common in the LLM's pretraining data. Don't let the LLM write its own rules. Auto-generating context files sounds like a neat automation trick, but models cannot reliably author the procedural knowledge they need to consume. They just write vague, bloated garbage that leads them astray.If you give an AI a targeted, 20-line instruction on how your specific testing framework compiles, it will do great. If you give it a 5-page manifesto on how to be a "10x Developer in this repository," or "Senior React Developer," it will spend your entire API budget arseing about in the file tree and accomplish absolutely nothing of value. Less is more. Stop over-engineering your prompts, and for fuck's sake, keep your AGENTS.md short. Here's my own AGENTS.md from one of my own repos for example. Notice how it's concentrated on procedures and not philosophy? # General Guidelines - Do not run pnpm dev (assume one is already running). - Do not run pnpm build (CI only). - To run tests, use `pnpm test` NOT `pnpm run test`. The two use different reporters in this repo. - Tests should indeed test real functionality. Never inline logic. - Do not use npm. - Use Vaul drawers for popups. - Check for `shadcn` components before writing custom ones.Until next time :)