Résumé IA
General Motors développe une infrastructure d'entraînement pour ses systèmes de conduite autonome capable de simuler des scénarios à 50 000 fois la vitesse réelle. L'entreprise combine simulation à grande échelle, apprentissage par renforcement et modèles de fondation pour préparer ses véhicules aux situations rares et imprévisibles — ce que les ingénieurs appellent la « longue traîne » : une panne généralisée de feux de signalisation à San Francisco, un matelas sur la chaussée, un chantier de construction guidé par des ouvriers qui font des gestes. Pour traiter ces scénarios complexes, GM développe des modèles Vision-Language-Action (VLA), issus des grands modèles de vision par langage, auxquels sont ajoutées des têtes de décodage spécialisées pour la conduite. Ces modèles permettent au véhicule de comprendre, par exemple, qu'un geste d'un policier a priorité sur un feu rouge, ou d'identifier visuellement une zone de chargement dans un aéroport. Face au problème de latence inhérent aux modèles de grande taille, GM a conçu une architecture dite « Dual Frequency VLA » : un grand modèle tourne à basse fréquence pour les décisions sémantiques de haut niveau (« cet objet est-il une branche ou un parpaing ? »), tandis qu'un modèle léger gère en temps réel le contrôle spatial — direction et freinage. Cette séparation permet de bénéficier du raisonnement profond sans compromettre les temps de réaction nécessaires à la sécurité. Les modèles génèrent également des traces de raisonnement lisibles par les ingénieurs, facilitant le débogage et la validation des comportements du véhicule. La conduite autonome reste l'un des défis les plus exigeants de l'intelligence artificielle physique : un système doit interpréter un environnement chaotique en temps réel, anticiper le comportement humain et fonctionner de manière fiable dans une infinité de configurations. GM, qui vise d'abord la conduite autonome sur autoroute sans surveillance avant d'atteindre une autonomie totale, mise sur la simulation massive pour compenser l'impossibilité de collecter suffisamment de données réelles sur ces situations exceptionnelles.
This is a sponsored article brought to you by General Motors. Autonomous driving is one of the most demanding problems in physical AI. An automated system must interpret a chaotic, ever-changing world in real time—navigating uncertainty, predicting human behavior, and operating safely across an immense range of environments and edge cases. At General Motors, we approach this problem from a simple premise: while most moments on the road are predictable, the rare, ambiguous, and unexpected events — the long tail — are what ultimately defines whether an autonomous system is safe, reliable, and ready for deployment at scale. (Note: While here we discuss research and emerging technologies to solve the long tail required for full general autonomy, we also discuss our current approach or solving 99% of everyday autonomous driving in a deep dive on Compound AI.) As GM advances toward eyes-off highway driving, and ultimately toward fully autonomous vehicles, solving the long tail becomes the central engineering challenge. It requires developing systems that can be counted on to behave sensibly in the most unexpected conditions. GM is building scalable driving AI to meet that challenge — combining large-scale simulation, reinforcement learning, and foundation-model-based reasoning to train autonomous systems at a scale and speed that would be impossible in the real world alone. Stress-testing for the long tail Long-tail scenarios of autonomous driving come in a few varieties. Some are notable for their rareness. There’s a mattress on the road. A fire hydrant bursts. A massive power outage in San Francisco that disabled traffic lights required driverless vehicles to navigate never-before experienced challenges. These rare system-level interactions, especially in dense urban environments, show how unexpected edge cases can cascade at scale. But long-tail challenges don’t just come in the form of once-in-a-lifetime rarities. They also manifest as everyday scenarios that require characteristically human courtesy or common sense. How do you queue up for a spot without blocking traffic in a crowded parking lot? Or navigate a construction zone, guided by gesturing workers and ad-hoc signs? These are simple challenges for a human driver but require inventive engineering to handle flawlessly with a machine. Autonomous driving scenario demand curve Deploying vision language models One tool GM is developing to tackle these nuanced scenarios is the use of Vision Language Action (VLA) models. Starting with a standard Vision Language Model, which leverages internet-scale knowledge to make sense of images, GM engineers use specialized decoding heads to fine-tune for distinct driving-related tasks. The resulting VLA can make sense of vehicle trajectories and detect 3D objects on top of its general image-recognition capabilities. These tuned models enable a vehicle to recognize that a police officer’s hand gesture overrides a red traffic light or to identify what a “loading zone” at a busy airport terminal might look like. These models can also generate reasoning traces that help engineers and safety operators understand why a maneuver occurred — an important tool for debugging, validation, and trust. Testing hazardous scenarios in high-fidelity simulations The trouble is: driving requires split-second reaction times so any excess latency poses an especially critical problem. To solve this, GM is developing a “Dual Frequency VLA.” This large-scale model runs at a lower frequency to make high-level semantic decisions (“Is that object in the road a branch or a cinder block?”), while a smaller, highly efficient model handles the immediate, high-frequency spatial control (steering and braking). This hybrid approach allows the vehicle to benefit from deep semantic reasoning without sacrificing the split-second reaction times required for safe driving. But dealing with an edge case safely requires that the model not only understand what it is looking at but also understand how to sensibly drive through the challenge it’s identified. For that, there is no substitute for experience. Which is why, each day, we run millions of high-fidelity closed loop simulations , equivalent to tens of thousands of human driving days, compressed into hours of simulation. We can replay actual events, modify real-world data to create new virtual scenarios, or design new ones entirely from scratch. This allows us to regularly test the system against hazardous scenarios that would be nearly impossible to encounter safely in the real world. Synthetic data for the hardest cases Where do these simulated scenarios come from? GM engineers employ a whole host of AI technologies to produce novel training data that can model extreme situations while remaining grounded in reality. GM’s “Seed-to-Seed Translation” research , for instance, leverages diffusion models to transform existing real-world data, allowing a researcher to turn a clear-day recording into a rainy or