Résumé IA
Amazon Bedrock, la plateforme d'intelligence artificielle d'AWS, propose désormais une solution open source permettant d'analyser des vidéos à grande échelle grâce à des modèles multimodaux capables de traiter simultanément images et texte. Cette solution, disponible sur GitHub, s'articule autour de trois architectures distinctes, chacune adaptée à des cas d'usage et des compromis coût/performance différents. Elle répond à un besoin croissant des entreprises dans des secteurs aussi variés que la surveillance, la production médiatique, les réseaux sociaux ou les communications d'entreprise. Là où les approches traditionnelles de vision par ordinateur se limitaient à détecter des patterns prédéfinis — lentes, rigides et incapables de saisir le contexte sémantique — les nouveaux modèles fondationnels d'Amazon Bedrock changent la donne. La première approche, dite "frame-based", extrait des images à intervalles réguliers, élimine les doublons visuels grâce à des algorithmes de similarité (dont les embeddings multimodaux Nova d'Amazon en 256 dimensions, ou la détection de features OpenCV ORB), puis soumet ces frames à un modèle de compréhension d'image pendant que la piste audio est transcrite séparément via Amazon Transcribe. Ce workflow convient particulièrement à la surveillance de sécurité, au contrôle qualité industriel ou à la conformité réglementaire. Deux autres architectures complètent l'offre, chacune optimisée pour des scénarios différents comme l'analyse de scènes médiatiques, la détection de coupures publicitaires ou la modération de contenu sur les réseaux sociaux. L'ensemble du pipeline est orchestré par AWS Step Functions, garantissant une scalabilité et une fiabilité industrielle. L'analyse vidéo automatisée à grande échelle est devenue un enjeu stratégique majeur pour les organisations qui génèrent ou reçoivent des volumes massifs de contenus visuels. Jusqu'ici, ce travail reposait largement sur la révision manuelle ou des systèmes à règles figées, coûteux et peu adaptables. L'intégration de modèles multimodaux capables de comprendre le sens d'une scène, de répondre à des questions sur le contenu ou de détecter des événements nuancés représente un saut qualitatif important pour l'automatisation de workflows métier complexes.
Video content is now everywhere, from security surveillance and media production to social platforms and enterprise communications. However, extracting meaningful insights from large volumes of video remains a major challenge. Organizations need solutions that can understand not only what appears in a video, but also the context, narrative, and underlying meaning of the content. In this post, we explore how the multimodal foundation models (FMs) of Amazon Bedrock enable scalable video understanding through three distinct architectural approaches. Each approach is designed for different use cases and cost-performance trade-offs. The complete solution is available as an open source AWS sample on GitHub . The evolution of video analysis Traditional video analysis approaches rely on manual review or basic computer vision techniques that detect predefined patterns. While functional, these methods face significant limitations: Scale constraints: Manual review is time-consuming and expensive Limited flexibility: Rule-based systems can’t adapt to new scenarios Context blindness: Traditional CV lacks semantic understanding Integration complexity: Difficult to incorporate into modern applications The emergence of multimodal foundation models on Amazon Bedrock changes this paradigm. These models can process both visual and textual information together. This enables them to understand scenes, generate natural language descriptions, answer questions about video content, and detect nuanced events that would be difficult to define programmatically. Three approaches to video understanding Understanding video content is inherently complex, combining visual, auditory, and temporal information that must be analyzed together for meaningful insights. Different use cases, such as media scene analysis, ad break detection, IP camera tracking, or social media moderation, require distinct workflows with varying cost, accuracy, and latency trade-offs.This solution provides three distinct workflows, each using different video extraction methods optimized for specific scenarios. Frame-based workflow: precision at scale The frame-based approach samples image frames at fixed intervals, removes similar or redundant frames, and applies image understanding foundation models to extract visual information at the frame level. Audio transcription is performed separately using Amazon Transcribe. This workflow is ideal for: Security and surveillance: Detect specific conditions or events across time Quality assurance: Monitor manufacturing or operational processes Compliance monitoring: Verify adherence to safety protocols The architecture uses AWS Step Functions to orchestrate the entire pipeline: Smart sampling: optimizing cost and quality A key feature of the frame-based workflow is intelligent frame deduplication, which significantly reduces processing costs by removing redundant frames while preserving visual information. The solution provides two distinct similarity comparison methods. Nova Multimodal Embeddings (MME) Comparison uses the multimodal embeddings model of Amazon Nova to generate 256-dimensional vector representations of each frame. Each frame is encoded into a vector embedding using the Nova MME model, and the cosine distance between consecutive frames is computed. Frames with distance below the threshold (default 0.2, where lower values indicate higher similarity) are removed. This approach excels at semantic understanding of image content, remaining robust to minor variations in lighting and perspective while capturing high-level visual concepts. However, it incurs additional Amazon Bedrock API costs for embedding generation and adds slightly higher latency per frame. This method is recommended for content where semantic similarity matters more than pixel-level differences, such as detecting scene changes or identifying unique moments. OpenCV ORB (Oriented FAST and Rotated BRIEF) takes a computer vision approach, using feature detection to identify and match key points between consecutive frames without requiring external API calls. ORB detects key points and computes binary descriptors for each frame, calculating the similarity score as the ratio of matched features to total key points. With a default threshold of 0.325 (where higher values indicate higher similarity), this method offers fast processing with minimal latency and no additional API costs. The rotation-invariant feature matching makes it excellent for detecting camera movement and frame transitions. However, it can be sensitive to significant lighting changes and may not capture semantic similarity as effectively as embedding-based approaches. This method is recommended for static camera scenarios like surveillance footage, or cost-sensitive applications where pixel-level similarity is sufficient. Shot-based workflow: understanding narrative flow Instead of sampling individual frames, the shot-based workflow segments video into short clips (shots) or