Three-phase harmonized optimization programme
Orion solves the semantic-action alignment challenge with the following architectural design:
- cross-modal alignment layer: EVA-CLIP visual encoder (224 × 224 inputs) with QLoRA fine-tuned LLM (7B parameters) sharing attention mechanism
- trainable interface design: Add lightweight Adapter (only 0.5M parameters) to VLM outputs to encode textual commands as kinematic parameters
- Online correction mechanism: Correcting trajectory deviations with real-time feedback from CARLA's RGBD camera
Specific operational procedures:
- Preparation phase: download eva02_petr_proj.pth and pretrain_qformer.pth weights files
- Training configuration: set joint_optimization=True in configs/train.yaml
- Validation method: run python eval_gap.py -metric semantic_action_gap
The scheme achieves 82.3% instruction-action matching on the nuScenes validation set, which is a 2.1x improvement over the baseline method. It especially excels in complex scenarios such as 'yielding to pedestrians'.
This answer comes from the articleOrion: Xiaomi's Open Source End-to-End Autonomous Driving Reasoning and Planning FrameworkThe




























