Résumé IA
Amazon SageMaker AI introduit des métriques améliorées avec une fréquence de publication configurable, offrant une visibilité granulaire au niveau des instances EC2 et des conteneurs individuels. Ces nouvelles métriques permettent de surveiller l'utilisation CPU, GPU et mémoire par instance, ainsi que les patterns de requêtes, erreurs et latences avec des dimensions précises. Grâce aux Inference Components, il est désormais possible de calculer le coût réel par modèle en suivant l'allocation GPU au niveau de chaque composant d'inférence.
Running machine learning (ML) models in production requires more than just infrastructure resilience and scaling efficiency. You need nearly continuous visibility into performance and resource utilization. When latency increases, invocations fail, or resources become constrained, you need immediate insight to diagnose and resolve issues before they impact your customers. Until now, Amazon SageMaker AI provided Amazon CloudWatch metrics that offered useful high-level visibility, but these were aggregate metrics across all instances and containers. While helpful for overall health monitoring, these aggregated metrics obscured individual instance and container details, making it difficult to pinpoint bottlenecks, improve resource utilization, or troubleshoot effectively. SageMaker AI endpoints now support enhanced metrics with configurable publishing frequency. This launch provides the granular visibility needed to monitor, troubleshoot, and improve your production endpoints. With SageMaker AI endpoint enhanced metrics, we can now drill down into container-level and instance-level metrics, which provide capabilities such as: View specific model copy metrics . With multiple model copies deployed across a SageMaker AI endpoint using Inference Components, it’s useful to view metrics per model copy such as concurrent requests, GPU utilization, and CPU utilization to help diagnose issues and provide visibility into production workload traffic patterns. View how much each model costs . With multiple models sharing the same infrastructure, calculating the true cost per model can be complex. With enhanced metrics, we can now calculate and associate cost per model by tracking GPU allocation at the inference component level. What’s new Enhanced metrics introduce two categories of metrics with multiple levels of granularity: EC2 Resource Utilization Metrics : Track CPU, GPU, and memory consumption at the instance and container level. Invocation Metrics : Monitor request patterns, errors, latency, and concurrency with precise dimensions. Each category provides different levels of visibility depending on your endpoint configuration. Instance-level metrics: available for all endpoints Every SageMaker AI endpoint now has access to instance-level metrics, giving you visibility into what’s happening on each Amazon Elastic Compute Cloud (Amazon EC2) instance in your endpoint. Resource utilization (CloudWatch namespace: /aws/sagemaker/Endpoints ) Track CPU utilization, memory consumption, and per-GPU utilization and memory usage for every host. When an issue occurs, you can immediately identify which specific instance needs attention. For accelerator-based instances, you will see utilization metrics for each individual accelerator. Invocation metrics (CloudWatch namespace: AWS/SageMaker ) Track request patterns, errors, and latency by drilling down to the instance level. Monitor invocations, 4XX/5XX errors, model latency, and overhead latency with precise dimensions that help you pinpoint exactly which instance experienced issues. These metrics help you diagnose uneven traffic distribution, identify error-prone instances, and correlate performance issues with specific resources. Container-level metrics: for inference components If you’re using Inference Components to host multiple models on a single endpoint, you now have container-level visibility. Resource utilization (CloudWatch namespace: /aws/sagemaker/InferenceComponents ) Monitor resource consumption per container. See CPU, memory, GPU utilization, and GPU memory usage for each model copy. This visibility helps you understand which inference component model copies are consuming resources, maintain fair allocation in multi-tenant scenarios, and identify containers experiencing performance issues. These detailed metrics include dimensions for InferenceComponentName and ContainerId . Invocation metrics (CloudWatch namespace: AWS/SageMaker ) Track request patterns, errors, and latency at the container level. Monitor invocations, 4XX/5XX errors, model latency, and overhead latency with precise dimensions that help you pinpoint exactly where issues occurred. Configuring enhanced metrics Enable enhanced metrics by adding one parameter when creating your endpoint configuration: response = sagemaker_client.create_endpoint_config( EndpointConfigName='my-config', ProductionVariants=[{ 'VariantName': 'AllTraffic', 'ModelName': 'my-model', 'InstanceType': 'ml.g6.12xlarge', 'InitialInstanceCount': 2 }], MetricsConfig={ 'EnableEnhancedMetrics': True, 'MetricsPublishFrequencyInSeconds': 10, # Default 60s }) Choosing your publishing frequency After you’ve enabled enhanced metrics, configure the publishing frequency based on your monitoring needs: Standard resolution (60 seconds) : The default frequency provides detailed visibility for most production workloads. This is sufficient for capacity planning, troubleshooting, and optimization, while keeping costs manageable. High resolution (10