Your LLM treats the style guide as a suggestion. Make CI the contract.

Blog

engineering

Your LLM treats the style guide as a suggestion. Make CI the contract.

17 June 2026

·

13 min read

Most of StayHawk's interface is written with a language model in the loop. That is great for speed and quietly terrible for consistency, unless you build for it.

The problem isn't that LLMs can't write Tailwind. They write it fluently. The problem is that they write it without knowing the decisions behind it. Ask for a badge and you get text-amber-600, bg-slate-100, text-[10px], and a hand-rolled shadow-[0_8px_24px_rgba(0,0,0,0.08)]. Every value is reasonable. None of them is necessarily ours. Do that a few hundred times across a component tree and the interface slides into a thousand slightly different greys.

We hit exactly this. So we changed the rule: an off-brand colour or font size now fails CI, the same way a type error does. This post walks through what was drifting, why we couldn't catch it by reading diffs, and the small custom ESLint rule that catches it now.

The drift you can't see in a diff

Here is the audit that kicked it off. We scanned frontend/src/components/atoms — 48 components, the lowest layer of the UI — and found raw Tailwind palette colours (slate-500, amber-100, red-600, yellow-800) scattered across most of them, plus a couple of dozen arbitrary one-offs like text-[10px] and inline shadows.

None of that fails a build. None of it looks wrong in isolation. A reviewer reading a 12-file PR will not notice that one badge reaches for amber-600 while another uses our coral, because the two values never sit next to each other in the diff. The drift only shows up in the running app, weeks later, when a designer squints at a card and says "why are there three different greys here."

There was a sharper version of this bug hiding underneath. Our brand greys are deliberately warm — a brown cast at roughly 30° hue, so they sit naturally on the cream background instead of fighting it. We define them in Tailwind v4's @theme block:

@theme {
  --color-slate-900: #292524;
  --color-slate-600: #57534e;
  --color-slate-400: #706a65;
  /* ...and that was it. 900, 600, 400. */
}

Two things combine here, and neither is obvious. @import "tailwindcss" ships Tailwind's complete default palette, and @theme only overrides the specific keys you name — it doesn't replace the set. So when we redefined three steps, every grey we didn't override — slate-200, slate-500, slate-700 — silently fell through to Tailwind's stock slate, which is cool and blue. Warm and cool greys were mixing inside the same component, and nobody had written a single "wrong" line to make it happen. You used a normal Tailwind class; it just resolved to a colour from a different family than the one beside it.

The demo below is that bug as it reaches you in review. Four ordinary lines, all using slate-* classes, all green in the diff — nothing to flag. Two of them silently resolve to a cool grey, and the class name gives you no hint which. Resolve the colours yourself:

+<span class="text-slate-900">Hotel Le Germain</span>
+<p class="text-slate-500">Montreal · Apr 18–21</p>
+<hr class="border-slate-200" />
+<span class="text-slate-600">Cancel free</span>

Four ordinary lines, all using slate-* classes, all green in the diff — nothing to flag. Resolve what each class actually renders and two of them turn out cool: the steps we never redefined, silently falling through to Tailwind's stock blue.

It is subtle. That is the whole point. Subtle is what survives code review.

Docs are a suggestion, CI is a contract

The brand decisions already existed — they were sitting right there in @theme. Nothing required anyone, human or model, to use them. We had a design system in the old sense: a set of decisions we hoped people would follow.

Hoping is not a strategy when most of your code is written by a model that has never read your design docs and weighs your CLAUDE.md against everything else in its context window. A style guide is one more input the model can quietly outvote. A failing CI check is not.

So we drew a hard line. Docs are a suggestion. CI is a contract. If a PR is green, it is safe to merge. If something is off-brand and no rule caught it, that is a gap in our rules — not a failure of whoever wrote the code, model or human. That reframing matters: it turns "please remember the design tokens" (an instruction that decays) into "the off-brand class does not exist as far as the build is concerned" (a property of the system).

@theme

the vocabulary

ESLint rule

the contract

CI

the gate

on-brand → mergesoff-brand → blocked

Docs are a suggestion; CI is a contract. The brand vocabulary lives in @theme, the rule makes it the only expressible vocabulary, and the build is where that rule actually holds.

The honest version: we can't delete the palette yet

There is a clean way to do this in Tailwind v4, and it is one line:

@theme {
  --color-*: initial; /* then define only the brand tokens */
}

That one line wipes the default palette. Setting --color-* to initial clears every colour variable Tailwind defined, so the utilities built on them — bg-slate-500, text-amber-600 — have nothing left to generate and stop existing. The off-brand class is unexpressible, which is the strongest guarantee you can get.

We can't flip it yet. It is global, so it can't go in until every component in the app has stopped relying on a stock palette colour, and we are not there. There is also a gap it doesn't close: --color-*: initial kills off-brand named colours, but arbitrary values like bg-[#888] and text-[13px] are a Tailwind language feature. No theme edit removes them. You need a lint layer regardless.

So the plan is: tighten with lint now, scoped to the layer we've cleaned (atoms first), and keep "delete the palette" as the endgame for when the whole app is clean.

How we made one badge safe, step by step

Theory is easy. Here is the actual path, walked on one atom — the small "needs review" pill we use across booking rows. It starts as an LLM's off-brand draft and ends as a typed, intent-named API, getting safer at every step while the rendered pill converges on-brand. Step through it:

Step 1 · An LLM writes a badge

Ask for a “needs review” pill and you get fluent, reasonable Tailwind. Every value is plausible — none of it is necessarily ours.

<span className="bg-amber-100 text-amber-900 text-[10px] px-2 py-0.5 rounded-full font-semibold">
  Needs review
</span>
Renders:Needs review

amber-100/900 are Tailwind's stock palette, not our warm one. text-[10px] is off our type scale.

1 / 6

One atom, six steps: an LLM's off-brand draft becomes a typed, intent-named API. The code gets safer at every step while the pill converges on-brand. Step through it.

A few of those steps deserve a closer look.

The rule reads more than className. The obvious way to lint Tailwind is to walk JSX and check className attributes — which is what the off-the-shelf plugins do, and it would have missed most of our violations. Our atoms don't keep their classes in className; they keep them in Record<string, string> variant maps and small template-string builders:

const variantStyles = {
  default: "bg-white/[0.08] border border-white/[0.1] text-white",
  coral:   "bg-[var(--color-coral)] border border-[var(--color-coral)] text-white",
};

By the time that coral line reaches a className it has been computed from a variable, so a className-only linter never sees it. Ours does something blunter and more thorough: it scans every string literal and template chunk in the file, then matches class-shaped fragments against the palette. Variant maps, buttonClasses() helpers, a ternary that picks a colour — all of it gets read. The enforcement itself is small: a flat-config ESLint plugin of about 150 lines, scoped to src/components/atoms/** at error, with one rule for colour and one for font size.

The errors teach. When the rule fires (step 3 above) it names the offending fragment, says why default-palette colours drift, and points at the exact token to use instead. That matters more than usual here, because half the time the thing reading the error and fixing the code is the model, inside its edit loop. A good error message is a prompt.

We only enforce the axes that actually drift. Borrowed from Panda CSS's two-axis strictness: strict about tokens, relaxed about layout. We enforce colour and font size and leave legitimate one-offs alone — grid-cols-[1fr_320px], a min-w-[44px] tap target, a transition-[grid-template-rows] list. A blanket ban on every arbitrary value would just breed fake tokens to satisfy the linter, which is worse than the disease. Spacing is a later, separate job.

Completing the warm ramp fixed the root bug. Defining the missing slate-100…900 steps (step 2) means there is no longer any neutral that falls through to cool stock slate — the invisible warm/cool drift from the demo above simply can't happen any more. In all, the rule flagged 28 genuine violations across the 48 atoms and we fixed every one. The rule ships with its own unit tests, so a change that stops catching amber-600 — or starts rejecting a legitimate grid-cols-[…] — fails the build too.

Escape hatches are design debt, not a feature

Every eslint-disable for one of these rules is a crack in the guarantee, and we treat a growing list of them as a bug in the design system rather than a fact of life. The usual fix for "I need a value the rule won't let me use" is not to disable the rule — it is to add a token to @theme and mirror its name in the rule's allowlist. If you genuinely need a one-off, the disable carries a reason and gets reviewed. We audit them periodically. The moment disabling the rule becomes routine, the contract is worthless.

The strongest rule is no class at all

A lint rule is a guardrail: it catches the off-brand class after someone writes it. The stronger move is to make the class impossible to write in the first place — and for an atom, the place an off-brand class slips in is the className prop.

Sam Pierce Lolla makes the same point in "Tips for getting LLMs to write good UI code": hand a component an open className and a model will use it to route around the design system, reaching for className="bg-red-500" instead of the variant you meant it to use. His fix is the one we landed on too — prefer a semantic prop over a class string, and take the escape hatch away.

So we piloted it on one atom: our Badge. Before, it was a Record<string, string> colour map plus a className passthrough that callers used for everything from a legitimate shrink-0 to a one-off border colour. After, it is a set of typed variants built with cva — a tiny library that maps typed props like tone="warning" to the right class string — and there is no className at all:

<Badge tone="warning" size="sm" uppercase>Needs review</Badge>

Note the prop is tone, not color, and the value is warning, not amber. That is deliberate: the caller — and the model writing the call — picks what the badge means, not which colour to paint it. The eight tones (danger, caution, pending, success, neutral, brand, warning, outline) are the only ones that exist; there is no slot to pass bg-red-500 into, because the prop is gone.

Here are both versions side by side. The left badge still has an open className, so anything lands — a stock red, an off-scale font, a cool-slate hex. The right one only takes typed props, so every badge you can build is on-brand:

className="…"

anything goes — including drift

<Badge className="bg-red-500 text-white">Pro</Badge>
Pro

tone / size / uppercase

pick a prop, not a class string

<Badge tone="brand" size="sm" uppercase>Pro</Badge>
Pro

A closed union

tone only accepts the eight values above. An unknown tone is a type error, not a render.

Layout moved out

shrink-0 / mb-1.5 live on a wrapper now. The atom owns its look; the parent owns its place.

One less dependency

No incoming className means nothing to merge — tailwind-merge never had to be installed.

The same badge, two ways to ask for it. With an open className, any class lands — including off-brand ones. With typed props, the only expressible badge is an on-brand one.

Three things fell out of the pilot that we did not expect:

  • The old "typed" prop wasn't typed. color had been declared keyof typeof colorMap, which reads as safe. The catch is in how the map was annotated: Record<string, string> tells TypeScript "the keys are any string," so keyof of it is just string — not a union of the actual colour names. The prop quietly accepted anything. The moment cva gave it a real union, the compiler flagged two status-to-colour maps that had been handing it loosely-typed values for months.
  • Layout had to move out. A few call sites used className for real layout — a shrink-0 in a flex row, a mb-1.5 above a title. Fair enough, but it is not the badge's job. Those classes now live on a wrapper around the badge: the atom owns its identity, the parent owns its placement.
  • We dropped a dependency instead of adding one. The usual companion to cva is tailwind-merge, which de-duplicates conflicting classes when a component blends its own with an incoming className. With no incoming className, there is nothing to merge. Removing the escape hatch removed the reason the library would have existed.

One atom is a pilot, not a finish line. But it is the shape of the next phase: every variant map that becomes a typed prop is one more class string that no longer exists to drift.

How to apply this yourself

Our stack is Tailwind and ESLint, but the moves carry over to anywhere a model writes most of the code. Here is the order we would do it again:

  1. Put the real vocabulary somewhere a rule can read it. Your brand colours and type scale, in one place, named for meaning — warning, not amber. You can't ban a value before the on-brand one exists to replace it.
  2. Lint every string, not just className. The classes that drift hide in variant maps and template-string builders, where a className-only linter never looks. Read the whole file and match class-shaped fragments against your tokens.
  3. Turn it on at error, scoped to what's already clean. We started at the atoms. Green should mean on-brand; widen the scope as each layer gets fixed.
  4. Make the message name the fix. Half the time it's the model reading it, mid-edit. "Use bg-warning-surface" is a failure and a prompt at the same time.
  5. Remediate, then pull the escape hatch. Swap the off-brand values for tokens, then drop className on the components you've cleaned — a typed cva variant leaves nowhere to paste a stray class.
  6. Name the variants for what they mean. tone="warning", not color="amber", so the colour behind it can change later without anyone touching the call site.

Where this goes

We deliberately started narrow. Atoms are the foundation, they are small, and getting them clean is a contained job. The rest is a dial we turn up over time:

  • Next: molecules and organisms. Widen the ESLint glob one layer at a time, fixing each layer as it comes into scope. Introduce intent-named colour aliases (text-primary, text-muted, surface-card) over the raw tokens, and keep moving variant maps to typed variant props with cva — the Badge pilot above is the first one. The fewer free-text class strings exist, the less there is to drift.
  • Then: close the surface. Once the app is clean, flip @theme { --color-*: initial } and delete the default palette for good. Move dark surfaces off opacity stacks (white/10) onto light-dark() semantic tokens, so "forgot a dark: variant" becomes a bug you can't write.
  • Eventually: repo-wide. Promote the rules out of the atoms-only scope into the default config.

Even if your stack looks nothing like ours, the distinction underneath it is the part worth stealing: review is where good intentions go, and the build is where rules actually live. When a model is doing most of the typing, that stops being pedantic and becomes the only thing holding the line.