ACT-Bench: Towards Action Controllable World Models for Autonomous Driving

Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, Yu Yamaguchi,
Turing Inc.
ACT-Bench framework overview

ACT-Bench framework overview.

Abstract

World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility.

To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity.

Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.


How it works



  1. Dataset: The ACT-Bench dataset provides paired frames and trajectory instructions, which serve as input for driving world models.
  2. Video Generation: World models generate driving videos conditioned on the provided prior frames and specified trajectory instructions.
  3. Video Analysis: The generated videos are analyzed using the ACT-Estimator, a video analysis model that performs two tasks:
    • Predicts the ego vehicle's motion class.
    • Estimates the ego vehicle's trajectory.
  4. Metrics Calculation: The estimated trajectory is compared with the instructed trajectory to calculate the Trajectory Alignment metrics. Similarly, the predicted motion class is matched against the motion class derived from the instruction trajectory to compute the Instruction-Execution Consistency metric.

Evaluation Results of Vista, Terra v1, and Terra v2


Vista
Classification results of Vista
Terra v1
Classification results of Terra v1
Terra v2
Classification results of Terra v2


Vista Terra(v1) Terra(v2)
Accuracy (↑) 0.307 0.441 0.632
ADE (↓) 4.50 3.98 3.86
FDE (↓) 8.66 8.21 8.05

While Vista remains the state-of-the-art driving world model in terms of FID/FVD, we observe that its instruction adherence is lower compared to Terra v1 and v2. The confusion matrix for Vista indicates a tendency of Vista to generate videos where the ego vehicle moves straight ahead slowly, with limited capability to handle left or right turns effectively.


In contrast, Terra v1 demonstrates improved performance in generating videos based on left and right turning instructions. However, it exhibits a bias toward generating videos that veer significantly to the right. Terra v2, on the other hand, eliminates this left/right turning bias and achieves superior accuracy compared to the other two models. Notably, its instruction-execution consistency is more than double that of Vista, highlighting its robustness in following specified trajectory instructions.


Visualization of Videos Generated by Terra v2

Instruction: curving_to_right/curving_to_right_sharp

Instruction: curving_to_left/curving_to_left_moderate

Instruction: straight_constant_speed/straight_constant_speed_35kmph

Instruction: straight_decelerating/straight_decelerating_35kmph

Instruction: stopping/stopping_35kmph

Instruction: straight_constant_speed/straight_constant_speed_25kmph

Instruction: straight_accelerating/straight_accelerating_25kmph

Instruction: shifting_towards_right/shifting_towards_right_short

Instruction: shifting_towards_left/shifting_towards_left_short

BibTeX

@misc{arai2024actbench,
                title={ACT-Bench: Towards Action Controllable World Models for Autonomous Driving}, 
                author={Hidehisa Arai and Keishi Ishihara and Tsubasa Takahashi and Yu Yamaguchi},
                year={2024},
                eprint={2412.05337},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2412.05337}, 
          }