ACT-Bench: Towards Action Controllable World Models for Autonomous Driving

World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility.

To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity.

Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.

Evaluation Results of Vista, Terra v1, and Terra v2

	Vista	Terra(v1)	Terra(v2)
Accuracy (↑)	0.307	0.441	0.632
ADE (↓)	4.50	3.98	3.86
FDE (↓)	8.66	8.21	8.05

While Vista remains the state-of-the-art driving world model in terms of FID/FVD, we observe that its instruction adherence is lower compared to Terra v1 and v2. The confusion matrix for Vista indicates a tendency of Vista to generate videos where the ego vehicle moves straight ahead slowly, with limited capability to handle left or right turns effectively.

In contrast, Terra v1 demonstrates improved performance in generating videos based on left and right turning instructions. However, it exhibits a bias toward generating videos that veer significantly to the right. Terra v2, on the other hand, eliminates this left/right turning bias and achieves superior accuracy compared to the other two models. Notably, its instruction-execution consistency is more than double that of Vista, highlighting its robustness in following specified trajectory instructions.

BibTeX

@misc{arai2024actbench,
                title={ACT-Bench: Towards Action Controllable World Models for Autonomous Driving}, 
                author={Hidehisa Arai and Keishi Ishihara and Tsubasa Takahashi and Yu Yamaguchi},
                year={2024},
                eprint={2412.05337},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2412.05337}, 
          }

ACT-Bench: Towards Action Controllable World Models for Autonomous Driving

ACT-Bench framework overview.

Abstract

How it works

Evaluation Results of Vista, Terra v1, and Terra v2

Visualization of Videos Generated by Terra v2

BibTeX