At re:Invent 2025, Peter DeSantis, Senior Vice President of Utility Computing at AWS, delivered a keynote that was equal parts retrospective and revelation. Joined by Dave Brown, Vice President of Compute and Machine Learning Services, the session unpacked two decades of architectural bets while unveiling the next wave of innovations that will define how modern applications, AI systems, and global-scale services are built.
The foundations of the cloud - security, availability, elasticity, cost efficiency, and agility are becoming more important than ever as AI innovation takes front stage.
Security, which has always been table stakes, now carries amplified urgency as generative tools accelerate both defensive and offensive capabilities. Availability and performance remain non-negotiable as model sizes balloon and inference traffic becomes uneven and unpredictable. Elasticity has moved from convenience to necessity since AI workloads can spike dramatically, often without warning. And underlying it all is cost: in a world where experimenting, iterating, and retraining models becomes part of everyday development, organizations that can do so efficiently will be the ones that advance fastest.
Against this backdrop, AWS introduced major updates across silicon, compute, serverless, vector storage, and inference systems that not only address today’s AI builders' needs but also anticipate the demands of tomorrow’s models.
Silicon is increasingly where performance, efficiency, and scale are won, and AWS continued its long investment in custom chips with several major updates.
AWS introduced Graviton5, the latest generation of its cloud-native CPU line, along with the preview of the M9g instance family. Graviton5 includes upgraded cache architecture, stronger memory throughput, and new efficiency optimizations designed for modern distributed workloads. The most compelling part of the launch came from customer results, which show how these improvements translate into production impact.
Cloud workloads running on early Graviton5 hardware have already shown significant real-world improvements, with performance gains of 20-60% across latency-sensitive services, high-throughput pipelines, and large-scale transactional systems. These results come from production environments rather than controlled benchmarks, which underscores how meaningful the architectural upgrades in Graviton5 actually are.
M9g instances make these improvements available to a wide range of workloads. General-purpose applications, large-scale microservices fleets, cache layers, and internal platforms can all benefit immediately. This continues a trend that has been clear for several years. Graviton is becoming the default compute path for organizations that want stronger performance per dollar and better energy efficiency without sacrificing compatibility.
For generative AI, AWS introduced Trainium3 along with a second generation of Trainium UltraServers. This hardware is designed for large model training and high-throughput inference at a global scale. While AWS's public materials highlight performance and cost efficiency, Peter’s keynote gave a rare behind-the-scenes look at the architectural innovations:
144 Trainium3 chips per UltraServer
20 TB of high-bandwidth memory
700 TB/s of aggregate memory bandwidth
Up to 360 PFLOPS of FP8 compute performance
A redesigned Neuron switch fabric enabling low-latency, high-throughput interconnects
AWS also confirmed the release of NKI, the Neuron Kernel Interface, a low-level ML kernel development language in the Neuron SDK. NKI enables developers to fine-tune kernels for Trainium and Inferentia through capabilities such as custom operator development, more efficient attention mechanisms, deep performance tuning, and zero-overhead profiling. It offers the level of control typically associated with CUDA, but tailored specifically for AWS silicon.
This level of vertical integration signals a clear strategic intent. AWS is not following the trajectory of GPU-based systems. It is building an end-to-end platform for training and inference that prioritizes real-world workload efficiency, power management, and cost-effectiveness.
Dave Brown introduced Lambda Managed Instances, one of the most impactful updates to Lambda since its launch. For the first time, Lambda functions can run on EC2 instances inside a customer’s account while AWS continues to manage provisioning, patching, scaling, and availability. This model combines the operational simplicity of serverless with the predictable performance and configurability of instance-based compute.
This shift brings Lambda to a new class of workloads that previously required containers or custom orchestration. High-throughput media processing, machine learning preprocessing, and persistent data transformation workflows can now operate in a familiar environment without operational overhead. It also reduces architectural fragmentation. Teams no longer have to choose between Lambda for simplicity and EC2 for performance. They can have both without compromise.
One of the most foundational announcements was Mantle, the next-generation inference engine that powers Amazon Bedrock. Mantle redefines how large language models execute at scale. As described in the keynote, it introduces system-level innovations:
Service tiers allow customers to choose priority, standard, or flexible execution depending on latency requirements.
Per customer, fairness queues ensure consistent performance in multi-tenant inference environments.
Journaling lets long-running inference sessions resume mid-stream rather than restart.
Fine-tuning runs on the same fleet as inference and pauses automatically during periods of high demand.
Confidential computing capabilities ensure model weights and prompts remain encrypted during execution.
This architecture reflects a deep understanding of modern AI systems. Reliability in this context comes not just from hardware performance but from how work is prioritized and executed within shared clusters.
Retrieval has become a central part of AI application design. AWS introduced several advancements that move vector intelligence into the center of the cloud.
S3 Vectors allows customers to store and query billions of vectors directly inside Amazon S3 with no servers to manage. Query latency stays under 100 milliseconds even at massive scale. This transforms S3 from a passive data store into an active retrieval engine that supports RAG pipelines at a fraction of the cost of specialized vector databases.
OpenSearch has been updated to combine keyword search with semantic and vector-based retrieval. When paired with AWS’s new multimodal embeddings model, applications can perform unified search across text, audio, images, video, and documents without juggling multiple embedding approaches. This significantly simplifies development and improves retrieval quality.
The keynote highlighted real examples of how organizations are adopting these capabilities. Apple is running Swift on Graviton for services like the App Store and Apple Music with notable improvements in latency and cost. 12 Labs is using S3 Vectors to power video intelligence across petabytes of content. Descartes demonstrated real-time visual intelligence models running entirely on Trainium3, generating live video with near-zero latency.
The announcements from Peter DeSantis and Dave Brown show a cloud platform that is being reshaped for the AI era. Compute is becoming more specialized. Storage is becoming more intelligent. Serverless is becoming more flexible. Inference is becoming more predictable. Retrieval is becoming more powerful. The cloud’s next chapter will be defined by infrastructure that understands AI deeply and supports it at every layer.
If your organization is exploring how to adopt AI, modernize infrastructure, or take advantage of the latest AWS innovations, OpsGuru can help. Our teams specialize in cloud modernization, AI platforms, and large-scale architectures that turn these capabilities into real outcomes. to begin planning your next step.