The risk factor is where the honesty lives, and in autonomous driving the most underexamined risk factor is not the technology — it is the actuary's problem of pricing a thing whose risk profile refuses to hold still. A new paper by Doyeon Jang takes that problem head on, framing what it calls a foundational ratemaking challenge for Automated Driving System deployments: sparse experience, shifting operational design domains, and non-stationary risk across software releases. Translate that out of insurance jargon and it is the question every robotaxi balance sheet eventually has to answer — how do you set a premium for a fleet whose crash rate changes every time the software updates and every time it enters a new city?

The standard insurance answer is credibility theory: blend a specific account's own loss experience with the broader pool, weighting toward the account's own data only as that data accumulates. The trouble for robotaxis is that no individual city, on any single software version, has accumulated enough verified crashes to be credible on its own. Jang's response is a hierarchical Bayesian credibility framework that pools across cities, software versions, and territories through a learned operational-design-domain similarity kernel — essentially, a model that decides how much one city's experience should inform another based on how alike their operating conditions are. The framework is constructed so that the classic Buhlmann-Straub credibility model is a limiting case, which keeps it anchored to established actuarial practice rather than reinventing it.

"Demonstrated on 648 verified-engaged Waymo crashes across four U.S. metros from the NHTSA Standing General Order database against 116 million matched miles, city-aggregate credibility weights are moderate (0.12-0.46), partial pooling decisively outperforms no pooling, and a power analysis shows the learned kernel's advantage becomes detectable at approximately twelve deployed cities."— arXiv 2606.17451, source

That sentence is the entire empirical payload, and it rewards a slow read. The data source is the NHTSA Standing General Order database — the federal reporting requirement that compels AV operators to disclose crashes involving their automated systems — filtered to 648 verified-engaged Waymo crashes, meaning incidents where the automated system was actually engaged rather than a human driver. Against those crashes sits 116 million matched miles of exposure. The headline finding is the credibility weights: 0.12 to 0.46. In plain terms, even a city with its own crash record should have that record weighted only 12 to 46 percent, with the remaining majority of the estimate borrowed from the pooled experience of other cities. The single-city data simply is not trustworthy enough on its own yet.

Why "partial pooling" is the load-bearing finding

The paper's claim that partial pooling decisively outperforms no pooling sounds technical, but it cuts directly at how the robotaxi industry talks about safety. Operators love to cite per-city or per-metro safety statistics — crashes per million miles in Phoenix, in San Francisco — as if each city stood alone as evidence. The credibility analysis says those isolated numbers are statistically thin: with weights below one-half, a city's standalone crash rate is more noise than signal, and treating it as gospel (no pooling) is the worst option. Borrowing strength across cities through the similarity kernel (partial pooling) is what produces a defensible estimate. For anyone reading robotaxi safety disclosures, that is a useful filter: a single-city loss number presented without a pooled comparison is closer to marketing than to ratemaking.

The operational-design-domain framing is what makes this more than a generic insurance exercise. An ODD is the bounded set of conditions a system is designed to handle — the roads, speeds, weather, and times of day where it is rated to operate. Two cities with similar ODDs should share risk information; a sunny-grid metro and a dense rainy old-city core should not be pooled naively. The learned similarity kernel is the mechanism that decides how much to share, and that is precisely the variable that shifts under the operators' control. Every time a robotaxi service expands its ODD — night driving, freeways, a new weather envelope — it resets part of the risk picture, which is exactly the non-stationarity the paper names up front.

The twelve-city threshold and what it implies

The most forward-looking line is the power analysis showing the learned kernel's advantage becomes detectable at approximately twelve deployed cities. A power analysis asks how much data you need before an effect is reliably distinguishable from chance, and here the answer is roughly a dozen cities. That is a notable marker because it sits just ahead of where the leading robotaxi operator's footprint is heading. Below that threshold, the sophisticated similarity-kernel approach is hard to statistically justify over simpler pooling; at and beyond it, the structure starts to pay off. For regulators and reinsurers, twelve cities is therefore a rough line where AV liability becomes genuinely model-able rather than guessed, and it is a concrete, falsifiable prediction rather than a vibe about "more data is better."

To be fair to both the model and its limits: the framework is demonstrated on one operator's verified-engaged crashes across four metros, which is a real but narrow base, and the Standing General Order data carries known caveats around reporting completeness and severity classification. A credibility model prices expected loss; it does not adjudicate fault, and "verified-engaged" tells you the system was on, not that it was at fault. The estimate is also only as good as the matched-mileage denominator, and exposure measurement for AVs is itself contested. None of those caveats undercut the central contribution, which is to give the AV-insurance conversation an actuarial spine: a way to say, with numbers, how much any city's experience should count and when a sophisticated pooling structure earns its keep.

For the broader autonomy debate, the value is in reframing the question. The vision-versus-LiDAR argument and the disengagement-stat theater both obscure a quieter truth — the bottleneck on scaling robotaxis profitably is partly an insurance-pricing problem, and that problem has a shape. It is sparse, it is non-stationary, and per this work it does not become reliably tractable until roughly a dozen cities of verified experience are on the books. That is a more honest read of where AV deployment economics actually stand than any single-city safety headline.