Vision-only versus mapped gets the headlines, but the fusion camp has a quieter answer: use everything, and learn it together. US12051001B2, granted to UATC in July 2024 and naming AV researcher Raquel Urtasun, covers "Multi-task multi-sensor fusion for three-dimensional object detection."
Two words carry the design: multi-sensor and multi-task. Classified under G06N 3/084 (neural-network training), G01S 17/89 (lidar) and G06V 20/58 (object detection), the patent fuses camera, lidar and radar into one model that detects objects in 3D — and does several perception tasks at once, sharing computation across them rather than running a separate network per job.
The fusion philosophy answers the sensor wars by declining to enlist. Cameras are cheap and rich in semantics but weak on depth; lidar is precise on geometry but sparse and costly; radar sees through weather and measures velocity but is coarse. Fusing them lets each cover the others' weaknesses — the whole is more robust than any single modality the rival camps fight over.
The multi-task half is an efficiency argument. Running one shared network that does detection and related perception jobs together is cheaper than a zoo of specialized models — a real concern when the compute rides in a car. The patent is as much about fitting perception in a power budget as about accuracy.
The honest cost is complexity and calibration. Fusing three modalities means keeping three sensor types calibrated, time-synchronized and trained together — a heavier engineering burden than a vision-only stack. The fusion camp accepts that burden as the price of robustness; the vision camp rejects it as needless cost. Both are coherent positions.
For readers tired of the binary camera-versus-lidar framing, this patent is the reminder that a serious third camp exists and is winning a lot of real deployments. The answer many AV programs actually shipped was not 'pick the right sensor' but 'fuse them well' — and the IP behind that is exactly this kind of multi-task, multi-sensor model.