Stop Shoving Shit Into Your AGENTS.md
  • 23 Feb, 2026
  • 05 Mins read

Stop Shoving Shit Into Your AGENTS.md

The tech industry currently has a fetish for "agent context." Prevailing wisdom dictates that if you want your AI coding agent (like Claude Code or Gemini CLI) to actually fix your repo without breaking it (or constantly trying to run a dev server when you already have one running), you need to write a massive AGENTS.md or .claude/skills/SKILL.md file. You pack it with your repo's bespoke rules, architectural guidelines, testing philosophies, and the frontend "skill" you stole from someone's github repo that, if we're being honest, sounds like the agent is reciting affirmations. This particular one got a chuckle out of me: Remember: Claude is capable of extraordinary creative work. Don't hold back, show what can truly be created when thinking outside the box and committing fully to a distinctive vision.Anyways, a couple of papers just dropped that actually bothered to test this empirically and for once, I feel rather vindicated. Your bloated context files are making your coding agents perform worse, and they are charging you 20% more in API costs for the privilege of failing. The Mistake We're All Making In a paper that dropped on the 12th of February (Gloaguen et al.), researchers tested how well agents perform on real-world GitHub issues (SWE-bench tasks) when provided with either LLM-generated context files or human-written ones pulled straight from actual developer repositories. The success rate dropped compared to giving the agent no context file at all. Why? LLMs are fundamentally obedient idiots. If you write an AGENTS.md file that says something like: # General Guidelines - Always ensure comprehensive test coverage. - Check all related files for side effects before committing. - Adhere strictly to SOLID principles.The agent reads this and takes it literally. It starts traversing every file in the directory. It writes exhaustive, unnecessary unit tests for a one-line bug fix. It gets entirely distracted by the philosophical weight of your instructions, runs out of its execution loop, and ultimately fails the actual task you asked it to do. It broadens the search space so much that the agent essentially gets lost in the sauce—burning through tokens and jacking up your inference costs by over 20%. Here's where I like to talk about Attention is All You Need (Vaswani et al., 2017). Again. You see, the core of that paper was Self-Attention. The ability for a model to look at every token in the input sequence and decide which ones are most relevant to the current token it's trying to generate. But in a Transformer model, Attention is a finite resource. When a model generates a token, it assigns a "weight" to the preceding tokens in its context window. These weights have to sum to 1. If your context is 50 tokens of pure, concentrated instruction, the "attention weight" on the relevant bit of code is massive. But if you have 10,000 tokens of architectural manifestos, the "attention" is spread thin across a vast sea of noise. Even though the paper says "Attention is All You Need," if you give the model too much to attend to, the signal-to-noise ratio becomes total dogshit. The model essentially "forgets" the actual bug because it's too busy "attending" to your ramblings about SOLID principles. Sure, at this point we have kinda made up for this with models that can handle large context windows without speaking in tongues, but that doesn't mean the issue is solved. There's a reason why even Anthropic's latest SOTA model, Opus 4.6 comes in two variants. A 200k context variant, and a 1-million context variant. Guess which one performs worse on the SWE-bench benchmark? Yeahp. It's the one with the larger context window. What's especially funny is that the 1M context window variant also performs worse when not utilising its extra context window. It's the same shit all over again. Just because you can, does not mean you should. Anyways, now that I'm done glazing Vaswani et al., let's get back to those SKILL.md files. The SkillsBench Corroboration Literally a day later, Li et al. published a massive 34-page beast of a benchmark called SkillsBench, which looked at "Agent Skills" (structured procedural packages like SKILL.md files) across 11 different domains from Software Engineering to Healthcare and Finance. Their data perfectly corroborates why our current approach to context is totally backwards. When they looked at the effect of providing skills to an agent, they found a fascinating non-monotonic relationship:1 Skill provided: +17.8 percentage points (pp) improvement 2–3 Skills provided: +18.6pp improvement (The sweet spot) 4+ Skills provided: +5.9pp improvementWhen they tested complexity of the skills, "Comprehensive" documentation actually hurt performance by 2.9 percentage points compared to baseline. Exhaustive documentation creates cognitive overhead. Worse yet, when they asked the agents to self-generate their own procedural skills before solving a task, the agents performed worse (-1.3pp average) than if they just raw-dogged the problem. The agents know they need a tool (e.g., "I should use pandas"), but they generate vague, useless instructions that end up actively confusing them later in the pipeline. If the model does not intrinsically know from its pretraining data that it should use pandas to solve a data analysis task, it's sure as shit not going to generate a SKILL.md file that includes those instructions. Okay, What Now? We are confusing explanation with procedural focus. When I built that doomed happiness model, I assumed that throwing every valid socio-economic variable into the mix would naturally yield a better prediction. It didn't. The noise swallowed the signal. We are doing the exact same thing with coding agents. We think that if we dump our entire engineering handbook into an AGENTS.md file, the AI will magically absorb our team's ten years of hard-learnt architectural wisdom. In reality, it just introduces conflicting guidance and context bloat. Here's what you gotta do:Keep it minimal. If you are writing an AGENTS.md, only describe the absolute bare-minimum requirements for the repo to build or run. Stop adding aspirational bullshit about code quality. Focus on procedures rather than philosophy. The SkillsBench paper showed that the only time context files actually result in massive gains (sometimes +85 points) is when they provide exact, step-by-step procedural workflows or API patterns that aren't common in the LLM's pretraining data. Don't let the LLM write its own rules. Auto-generating context files sounds like a neat automation trick, but models cannot reliably author the procedural knowledge they need to consume. They just write vague, bloated garbage that leads them astray.If you give an AI a targeted, 20-line instruction on how your specific testing framework compiles, it will do great. If you give it a 5-page manifesto on how to be a "10x Developer in this repository," or "Senior React Developer," it will spend your entire API budget arseing about in the file tree and accomplish absolutely nothing of value. Less is more. Stop over-engineering your prompts, and for fuck's sake, keep your AGENTS.md short. Here's my own AGENTS.md from one of my own repos for example. Notice how it's concentrated on procedures and not philosophy? # General Guidelines - Do not run pnpm dev (assume one is already running). - Do not run pnpm build (CI only). - To run tests, use `pnpm test` NOT `pnpm run test`. The two use different reporters in this repo. - Tests should indeed test real functionality. Never inline logic. - Do not use npm. - Use Vaul drawers for popups. - Check for `shadcn` components before writing custom ones.Until next time :)

I Built a Complex Model to Predict Happiness. A Dumb Guess Beat It.

I Built a Complex Model to Predict Happiness. A Dumb Guess Beat It.

Here's the result up front, because you deserve to know what you're getting into. I trained a multiple linear regression model on a decade of World Happiness Report data merged with World Bank indicators. Six predictors: GDP per capita, life expectancy, social support, freedom, trust, and generosity. Proper temporal train-test split. I trained on 2010–2021, test on 2022 onwards. The works. Then I tested a model that just says "next year's happiness score equals this year's." No features, no coefficients, no thinking at all.Model RMSENaive Baseline ("same as last year") 0.1872Full Structural Model (6 predictors) 0.5306Pure Economic Model (GDP + Health) 0.6336The dumb guess won by a factor of three. This isn't a story about a shit model. The model is actually brilliant... at the wrong task. It is in fact a story about the difference between explanation and prediction, about why those are fundamentally different things, and about why conflating them led me to nearly publish something so embarrassingly wrong that my blogging career would have been over before it began. The Mistake That Started Everything A few months ago, I wrote a glowing blog post about predicting national happiness. R² of 0.78! RMSE of 0.44 on test data! I was rather chuffed with myself. I thought I'd cracked human well-being with a linear model and some World Bank data. As one does. That post never got published, thank fuck. The problem wasn't even the statistics you see. The problem was that I never asked the most basic question in all of forecasting: can a braindead baseline beat me? I just looked at the RMSE, thought "ooh, 0.44, that's small," and declared victory. I never checked what "small" meant in context. There was also a deeper methodological rot, which I've written about separately in [companion post link]. Short version: my original predictors were derived from the happiness scores themselves, creating a tautology dressed up as science. I rebuilt the entire dataset from scratch using raw World Bank indicators and independent WHR survey responses. That story's worth reading on its own, but for now, just know that the dataset underpinning everything below is clean. 1,358 country-year observations, 2010–2024, no circular reasoning. The Warning Sign I Should Have Heeded Before building anything, I checked whether happiness scores exhibit temporal inertia. That is, whether this year's score predicts next year's. df_lagged <- df %>% arrange(country, year) %>% group_by(country) %>% mutate(happiness_prev = lag(happiness_score)) %>% drop_na(happiness_prev)inertia_cor <- cor(df_lagged$happiness_score, df_lagged$happiness_prev)r = 0.9848 Ninety-eight percent of the variance in this year's happiness is explained by last year's happiness. Every point clusters tightly around y = x:That should have been the moment I put down the keyboard and gone to bed. If inertia explains 98% of the variance, my model would need to find a meaningful signal in the remaining 2%. I don't know how else to describe this other than that it is stupid at best. National happiness is structurally stable. Denmark doesn't suddenly become poor. Afghanistan doesn't suddenly develop Nordic-level institutions. The factors that determine happiness change glacially. A 2% GDP growth rate doesn't shift a country's happiness score from 7.6 to 7.8. Inertia is far more likely to dictate that than a few percentage gains in growth in wealth, health, freedom, or trust. But I pressed on, because I'm stubborn and hadn't yet learned my lesson. I also checked multicollinearity across the predictors. GDP and life expectancy correlate at r ≈ 0.7–0.8, which makes sense. After all, rich countries are healthy countries. VIF for log GDP hit 5.0, right at the conventional threshold for concern. This doesn't invalidate the model for inference (the combined effect is stable), but it means you can't confidently rank individual predictors' importance. Keep that in mind.Where the Model Actually Shines Here's the thing that makes this story interesting rather than just depressing: the model is actually pretty good at explaining why some countries are happier than others. model <- lm(happiness_score ~ log(raw_gdp) + raw_health + social_support + freedom + trust + generosity, data = df)R² = 0.765. F-statistic of 733 on 6 and 1,349 degrees of freedom. Every predictor significant. This model explains three-quarters of the cross-country, cross-year variance in happiness.Freedom at 1.35 points per unit increase. Trust and social support follow. GDP does matter but isn't the end all be all. The social and political environment i.e. freedom to make life choices, trust in institutions, having people you can rely on; those matter more than raw wealth. Health has the smallest coefficient (0.02), but it's measured in years of life expectancy, so a decade of extra lifespan adds 0.2 points. Not actually trivial. So far, this aligns with decades of well-being research. Money buys happiness logarithmically. Going from poverty to security is transformative whilst going from comfortable to luxurious barely registers. But the institutional and social fabric of a country matters even more. The model tells you why Finland is an 8 and Afghanistan is a 3. It answers the question beautifully. It just can't tell you what either of them will score next year. The Humbling I set up a three-way fight. Proper temporal validation: train on everything before 2022, test on 2022 onwards. The Pure Model uses only GDP and health. The economist's bet if you will: money and medicine are all that matter. The Hybrid Model uses all six predictors. The full structural story. The Naive Baseline predicts that next year's score equals this year's score. The dumb guess. train_time <- df %>% filter(year < 2022) test_time <- df %>% filter(year >= 2022)model_pure <- lm(happiness_future ~ log_gdp + raw_health, data = train_time) model_hybrid <- lm(happiness_future ~ log_gdp + raw_health + social_support + freedom + trust, data = train_time)pred_pure <- predict(model_pure, newdata = test_time) pred_hybrid <- predict(model_hybrid, newdata = test_time) pred_naive <- test_time$happiness_scorermse_pure <- rmse(test_time$happiness_future, pred_pure) rmse_hybrid <- rmse(test_time$happiness_future, pred_hybrid) rmse_naive <- rmse(test_time$happiness_future, pred_naive)You've already seen the results. The naive baseline's RMSE of 0.19 annihilates the hybrid model's 0.53 and the pure model's 0.63. The model with no features, no training, no thinking whatsoever is 2.8× more accurate than my carefully constructed structural model. Why? Because happiness changes year-to-year are tiny compared to the cross-sectional differences the model was built to explain. The model learns "Denmark is happy because it's rich, free, and trusting." That's true. But it doesn't help you forecast whether Denmark moves from 7.58 to 7.61 or 7.55 next year. That movement is smaller than the model's error margin. The naive baseline succeeds precisely because it assumes nothing changes, and almost nothing does. I also tried spatial generalisation where I held out 20 random countries, training on the rest, predicting the unseen ones. RMSE: 0.59. Better than temporal forecasting but still mediocre, and for a different reason: individual countries have unique cultural, historical, and institutional contexts that no six-variable regression can fully capture.The model handles countries near the global mean reasonably well but struggles with outliers: The exceptionally happy Nordics, the war-torn states. These are the countries defined most strongly by factors no regression coefficient can reach. Lessons Learnt (Write These DOWN FFS!) The distinction between explanation and prediction isn't semantic. They require different validation, different interpretation, and different humility. Inference asks: "What factors explain the current state of the world?" You test it with goodness-of-fit measures: R², F-tests, p-values. My model scores brilliantly here. It tells you that freedom, trust, and social support drive happiness more than raw GDP. That's actually useful for policymakers. Both results are valuable. But they answer completely different questions, and conflating them the way I did in my original post inevitably leads to overconfident claims that collapse under scrutiny. There's a deeper philosophical point here, and I ask you to forgive the stench of pretentiousness it bears: Just because you can't predict something doesn't mean you don't understand it. We understand why Finland is happy. We understand what structural factors drive national well-being. We just can't forecast the year-to-year fluctuations, because those fluctuations are smaller than the measurement noise and dominated by inertia. it's not really a failure of understanding as much as it is a property of the phenomenon. If you're a policymaker, this means: focus on structural reforms. Invest in institutions, healthcare, social safety nets, democratic freedoms. These are the levers that determine long-run well-being. Don't chase quarterly happiness indicators. They're usually inertia based and very prone to noise in the short-term. If you're a data scientist, this means: always, always test against a naïve baseline. And never confuse a high R² with predictive power. They're not the same thing, and learning that the hard way is how you end up writing a 3,000-word mea culpa instead of publishing something wrong. Until next time :) This analysis uses a custom dataset built from raw World Bank indicators and independent WHR survey responses, specifically to avoid the circular reasoning problem inherent in the WHR's published contribution scores. The dataset construction process is detailed in Part 2: The Frankenstein Dataset. Full R pipeline available on GitHub. Built with tidyverse, wbstats, countrycode, car, and Metrics.