GaussianDet3D: Bridging Gaussian Splatting and Sparse LiDAR Detection for Multi-View 3D Object Detection

1Technical University of Munich, 2UC Berkeley, 3Munich Center for Machine Learning, 4DeepScenario
*Corresponding author: malaz(dot)tamim(at)tum(dot)de
DriveX Workshop at CVPR 2026

Abstract

Accurate 3D object detection from cameras alone remains a fundamental challenge in autonomous driving, particularly for precise localization and velocity estimation, two metrics critical for safe trajectory planning and collision avoidance. Existing camera-based methods lift image features into dense Bird's-Eye View (BEV) grids, which struggle to capture fine-grained geometry and motion cues.

We present GaussianDet3D, the first method, to the best of our knowledge, to apply 3D Gaussian Splatting from multi-view images to 3D object detection in the context of autonomous driving, treating predicted Gaussian primitives as a pseudo-LiDAR point cloud fed directly into a sparse LiDAR detector. Unlike a LiDAR point which carries only coordinates and intensity, each Gaussian encodes parameters capturing geometry, orientation, opacity, and per-class semantic distributions. By aggregating Gaussian point clouds across multiple frames, GaussianDet3D captures temporal motion cues that enable precise velocity estimation without explicit tracking.

On the nuScenes benchmark, GaussianDet3D achieves state-of-the-art translation error and velocity error among all camera-based methods, outperforming BEVFormer by 8.1% and 13.1% respectively, while remaining competitive in overall detection score. These results demonstrate that Gaussian Splatting provides a geometrically precise, semantically rich representation that bridges the gap between image-based perception and LiDAR-quality spatial reasoning, particularly for the localization and motion estimation tasks most critical to autonomous driving safety.

Method

GaussianDet3D consists of four components: an image encoder (ResNet-101-DCN with FPN), a Gaussian lifter that backprojects image features into an initial set of 3D Gaussian primitives via depth estimation and Farthest Point Sampling, a Gaussian encoder that iteratively refines the Gaussians through sparse 3D convolution and deformable cross-attention, and FSD V2, a fully sparse LiDAR detector that receives the Gaussian point cloud directly.

Each Gaussian is defined as Gi = [mi, si, ri, ai, ci] ∈ ℝ28, where m ∈ ℝ3 is the 3D mean, s ∈ ℝ3 the axis-aligned scale, r ∈ ℝ4 a unit quaternion, a ∈ ℝ the opacity, and c ∈ ℝ17 unnormalized semantic logits. The mean serves as the 3D coordinate and the remaining 25 parameters form the feature vector, yielding a pseudo-LiDAR point cloud passed directly to FSD V2 without architectural modification.

For temporal aggregation, Gaussian point clouds from K consecutive keyframes are transformed into the current ego-vehicle frame, with a dedicated timestamp channel encoding relative frame time to enable velocity estimation without explicit tracking.

GaussianDet3D pipeline overview.

Overview of the GaussianDet3D pipeline. Multi-view images are encoded into deep feature maps, lifted into initialized 3D Gaussians, refined by the Gaussian encoder, and finally interpreted as a Gaussian point cloud fed to FSD V2 for 3D bounding box prediction.

Results

nuScenes Test Set

Comparison on nuScenes test set. All methods use ResNet101-DCN backbone. Bold = best.

Method Frames mATE ↓ mASE ↓ mAVE ↓ mAOE ↓ mAAE ↓ mAP ↑ NDS ↑
DETR3D 1 0.641 0.255 0.845 0.394 0.133 0.412 0.479
Focal-PETR 1 0.617 0.250 0.862 0.398 0.146 0.426 0.486
PETR 1 0.647 0.251 0.933 0.433 0.143 0.391 0.455
PolarDETR 1 0.588 0.253 0.845 0.408 0.129 0.431 0.493
BEVFormer 4 0.631 0.257 0.435 0.405 0.143 0.445 0.535
GaussianDet3D (ours) 4 0.580 0.253 0.378 0.500 0.129 0.398 0.514

nuScenes Validation Set

Comparison on nuScenes validation set. All methods use ResNet101-DCN backbone. Bold = best.

Method Frames mATE ↓ mASE ↓ mAVE ↓ mAOE ↓ mAAE ↓ mAP ↑ NDS ↑
DETR3D 1 0.716 0.268 0.842 0.379 0.200 0.349 0.434
Focal-PETR 1 0.678 0.263 0.804 0.395 0.202 0.390 0.461
PETR 1 0.717 0.267 0.834 0.412 0.190 0.366 0.441
PolarDETR 2 0.707 0.269 0.518 0.344 0.196 0.383 0.488
BEVFormer 4 0.672 0.274 0.397 0.369 0.198 0.415 0.517
GaussianDet3D (ours) 4 0.649 0.272 0.340 0.537 0.199 0.380 0.490

Ablation Studies

Detector Backbone

Single-frame comparison with 6,400 Gaussians. All three LiDAR-style detectors are viable consumers of Gaussian point clouds.

Model mAP NDS
PointPillars 0.2677 0.3188
FocalFormer3D 0.3016 0.3628
FSD V2 0.3198 0.3676

Gaussian Feature Components

Incremental ablation of Gaussian parameters using PointPillars with 6,400 Gaussians, single frame.

Ch. Components mAP NDS
4 mean + opacity 0.1786 0.2655
7 + scale 0.1979 0.2842
11 + rotation 0.2156 0.2880
28 + semantics 0.2664 0.3187

Temporal Aggregation

Adding frames consistently improves detection. Going from 1 to 2 frames cuts velocity error by 64.0%.

Frames Gauss. Total mATE ↓ mASE ↓ mAVE ↓ mAOE ↓ mAP ↑ NDS ↑
1 6400 6400 0.7326 0.2772 1.2596 0.7084 0.3115 0.3619
2 6400 12800 0.7182 0.2774 0.4536 0.5981 0.3284 0.4386
4 6400 25600 0.6811 0.2772 0.3477 0.5685 0.3423 0.4637

Qualitative Results

Detection results under diverse conditions. BEV predictions in blue, ground truth in green.

Daytime

Qualitative results - daytime scene.

Cloudy

Qualitative results - cloudy scene.

Rainy

Qualitative results - rainy scene.

Nighttime

Qualitative results - nighttime scene.

BibTeX

Citation information will be available upon arXiv publication.