Résumé IA
Amazon Bedrock, la plateforme cloud d'IA d'AWS, propose depuis décembre 2025 le Reinforcement Fine-Tuning (RFT), une méthode avancée de personnalisation de modèles de langage. Le service a d'abord été lancé avec les modèles Nova d'Amazon, avant d'être étendu en février 2026 aux modèles open source comme OpenAI GPT OSS 20B et Qwen 3 32B. Concrètement, le RFT permet d'entraîner un modèle à partir d'un petit ensemble de prompts — sans avoir besoin de milliers d'exemples étiquetés — en lui faisant générer plusieurs réponses possibles, puis en lui attribuant des scores selon la qualité de chaque réponse. Le modèle apprend ensuite à privilégier les stratégies qui produisent les meilleurs résultats. L'exemple utilisé dans le tutoriel est le dataset mathématique GSM8K, appliqué au modèle gpt-oss-20B hébergé sur Bedrock. Ce qui distingue le RFT du fine-tuning supervisé classique, c'est sa capacité d'apprentissage en boucle fermée : le modèle génère lui-même les réponses sur lesquelles il s'entraîne, plutôt que de mémoriser des paires entrée-sortie figées. Cette approche est particulièrement puissante pour des tâches vérifiables comme les mathématiques ou la génération de code, où la correction peut être évaluée automatiquement sans intervention humaine. Au fil de l'entraînement, le modèle rencontre naturellement des scénarios de plus en plus complexes, ce qui lui permet de s'améliorer en continu sans que l'équipe doive constituer et annoter un dataset massif en amont. Le résultat : des gains de performance significatifs sur des tâches complexes comme le raisonnement logique ou les conversations multi-tours. Le Reinforcement Learning appliqué aux LLMs est la technique qui a permis à des modèles comme ChatGPT d'aligner leurs réponses sur les préférences humaines — une méthode connue sous le nom de RLHF. Amazon Bedrock l'industrialise ici en automatisant tout le pipeline, de l'authentification au déploiement d'une fonction de récompense via Lambda, jusqu'à l'inférence sur le modèle personnalisé.
In December 2025, we announced the availability of Reinforcement fine-tuning (RFT) on Amazon Bedrock starting with support for Nova models. This was followed by extended support for Open weight models such as OpenAI GPT OSS 20B and Qwen 3 32B in February 2026. RFT in Amazon Bedrock automates the end-to-end customization workflow. This allows the models to learn from feedback on multiple possible responses using a small set of prompts, rather than traditional large training datasets. In this post, we walk through the end-to-end workflow of using RFT on Amazon Bedrock with OpenAI-compatible APIs: from setting up authentication, to deploying a Lambda-based reward function, to kicking off a training job and running on-demand inference on your fine-tuned model. Here, we use the GSM8K math dataset as our working example and target OpenAI’s gpt-oss-20B model hosted on Bedrock. How reinforcement fine-tuning works Reinforcement Fine-Tuning (RFT) represents a shift in how we customize large language models (LLMs). Unlike traditional supervised fine-tuning (SFT), which requires models to learn from static I/O pairs, RFT enables models to learn through an iterative feedback loop where they generate responses, receive evaluations, and continuously improve their decision-making capabilities. The core concept: learning from feedback At its heart, reinforcement learning is about teaching an agent (in this case, an LLM) to make better decisions by providing feedback on its actions. Think of it like training a chess player. Instead of showing them every possible move in every possible situation (which is impossible), you let them play and tell them which moves led to winning positions. Over time, the player learns to recognize patterns and make strategic decisions that lead to success. For LLMs, the model generates multiple possible responses to a given prompt, receives scores (rewards) for each response based on how well they meet your criteria, and learns to favor the patterns and strategies that produce higher-scoring outputs. Key components of RFT Key RFT components include the agent/actor (policy) model, input states to the model, output actions from the model, and the reward function as shown in the following diagram: The actor model is the foundation model (FM) that you’re customizing. In Amazon Bedrock RFT, this could be Amazon Nova, Llama, Qwen, or other supported models . The state is the current context, including the prompt, conversation history (for multi-turn interactions), and the relevant metadata. The action is the model’s response to a prompt. The reward function assigns a numerical score to a (state, action) pair, evaluating the goodness of a model response for a given state. In doing so, the reward function can use additional information like ground truth responses or unit tests for code generation. This is the critical feedback signal that drives learning. Higher rewards indicate better responses. One of RFT’s key advantages is that the model learns from responses it generates during training, not only from pre-collected examples. This approach unlocks several compounding benefits. Because the model actively explores novel approaches and learns from the results, it can adapt in real time: as it improves, it naturally encounters new scenarios that push it further. This also makes the process far more efficient, alleviating the need to pre-generate and label thousands of examples upfront. The result is a system capable of continuous improvement, growing stronger as it encounters an ever-more-diverse range of situations. This online learning capability is what enables RFT to achieve superior performance on complex tasks like code generation, mathematical reasoning, and multi-turn conversations. For verifiable tasks like math, this is especially effective because correctness checking is fully automatic – avoiding the need for human labeling. How Amazon Bedrock RFT works Amazon Bedrock RFT is built to make reinforcement fine-tuning practical at the enterprise level. It handles the heavy lifting, so teams can focus on the problem that they’re solving rather than the infrastructure underneath it. The entire RFT pipeline runs automatically. For each prompt in your training dataset, Amazon Bedrock generates multiple candidate responses from your actor model, managing batching, parallelization, and resource allocation behind the scenes. Reward computation scales just as seamlessly. Whether you’re using verifiable rewards or an LLM-as-Judge setup, Amazon Bedrock orchestrates evaluation across thousands of prompt-response pairs while handling concurrency and error recovery without manual intervention. Policy optimization runs on GRPO, a state-of-the-art reinforcement learning algorithm, with built-in convergence detection so training stops when it should. Throughout the process, Amazon CloudWatch metrics and the Amazon Bedrock console give you real-time visibility into reward trends, policy updates, and overall