Bojador

A small language model that reports confidence you can act on, and spends its thinking only where that pays. On a laptop.

A three-billion-parameter model runs on your laptop in seconds and costs nothing to ask. Two things stop people from trusting it. It spreads its one scarce resource, thinking, evenly over easy and hard questions alike. And when it is wrong, it sounds every bit as sure as when it is right.

Bojador is a small study of two fixes for one such model, SmolLM3-3B. Each was measured against a yardstick fixed before the experiment ran, then picked apart by a second machine acting as a hostile reviewer. Everything here reproduces to the hash.

What was found

It can spend its thinking only where thinking pays. Answer a question twice, cheaply; if the two tries agree, stop there. Only when they disagree does it sample several more times and vote. On grade-school math that holds the accuracy of voting on everything while using about 47% of the tokens, checked with a non-inferiority test fixed in advance rather than cherry-picked.

It can be taught to say when it is unsure. A small adapter gets the model to put a number on its confidence that genuinely tracks whether it is about to be wrong: AUROC ≈ 0.87, where the most you can wring from the raw signal is about 0.75. The adapter never touches the answers, and the result held up on data it had not seen.

Two things did not work, and they sit here in the same size type as the wins. A calibration claim failed to hold up across three separate goes. An edge I hoped would carry over to other domains did not. None of this is a leaderboard entry. It is a recipe, an evaluation suite anyone can run, and a fairly drawn map of where the thing works and where it stops.

Read it · run it

“Quem quer passar além do Bojador
tem que passar além da dor.” Fernando Pessoa · Mar Português