A major financial-services firm has successfully upgraded more than 2,000 Amazon EC2 instances from Windows Server 2016 to Windows Server 2025, avoiding the traditional lift-and-shift migration by harnessing an automated in-place upgrade path built on AWS Systems Manager. The project, which transformed a fleet of aging virtual machines running a soon-to-be-unsupported operating system into a modern, secure environment, stands as one of the largest-scale in-place Windows Server upgrades on AWS to date.
With Windows Server 2016 entering its final year of mainstream support and end of extended security updates looming, enterprises face mounting pressure to modernize. For this financial services customer, a greenfield redeployment of over 2,000 servers was not an option. Business-critical applications, tightly coupled configurations, and complex interdependencies meant that any migration strategy had to preserve existing identities, IP addresses, and application states. In-place upgrade emerged as the only viable path—and AWS Systems Manager Automation provided the engine to execute it at scale.
The End-of-Support Clock and Why In-Place Upgrade Won
Microsoft’s published lifecycle policy puts Windows Server 2016’s extended support end date at January 12, 2027. After that date, no more security patches or bug fixes will be provided unless customers purchase costly Extended Security Updates (ESUs). For a highly regulated financial institution, operating unsupported software is a compliance nonstarter. The customer’s security and risk teams mandated a full migration to Windows Server 2025, which offers advanced security features like SMB over QUIC, improved Windows Defender Application Control, and secure-core server capabilities.
Two broad migration strategies exist: side-by-side migration, where a new OS instance is built and applications are moved over, and in-place upgrade, where the existing OS is transformed to the new version in situ. Side-by-side migrations, often implemented with AWS Application Migration Service or manual rebuilds, preserve the old instance as a rollback point but require reconfiguring networking, security groups, and application dependencies. For a fleet of 2,000+ servers running proprietary trading systems, risk models, and data pipelines, the operational overhead and risk of misconfiguration were simply too high.
In-place upgrade, by contrast, keeps the instance identity intact: the EC2 instance ID, private IP address, security group memberships, tags, and even the underlying EBS volumes remain the same. The operating system is swapped from within. AWS Systems Manager Automation makes this feasible by orchestrating the entire sequence—pre-upgrade checks, snapshotting, OS media injection, silent setup, and post-upgrade validation—across an arbitrary number of targets.
AWS Systems Manager Automation: The Orchestration Backbone
At the heart of the migration is AWS Systems Manager Automation, a service that lets you define and execute IT workflows as code. Automation runbooks—called documents—can be authored in YAML or JSON and can chain together steps that run on EC2 instances, call other AWS APIs, and wait for conditions to be met. For Windows in-place upgrades, AWS provides a pre-built SSM document: AWS-UpdateWindowsServerInstance. This document encapsulates the logic required to perform an in-place upgrade using an installation media image stored in an S3 bucket or from a public source.
The customer’s cloud operations team customized this document to suit their environment. They created a parameterized Automation runbook that:
- Queried the target EC2 instances via Systems Manager Inventory to confirm the existing OS version and installed applications.
- Submitted an API call to take an Amazon EBS snapshot of the root volume and any critical data volumes, providing a quick rollback mechanism without a full AMI.
- Attached a second, temporary EBS volume containing the Windows Server 2025 installation ISO (converted to a VHD for faster access) to each instance.
- Invoked the Windows Setup executable (
setup.exe) with the appropriate command-line arguments to perform a silent, unattended upgrade that preserved data and settings. - Monitored the upgrade progress by polling the instance’s status and waiting for a successful reboot.
- Performed post-upgrade validation by running a PowerShell script that checked OS version, service health, and application connectivity.
- Removed the temporary volume and cleaned up snapshots after a configurable retention period.
The Automation ran in batches of 50–100 instances, aligned with maintenance windows. Systems Manager Maintenance Windows defined the allowed upgrade windows—typically Saturday evenings—and the Automation execution was gated by an AWS Lambda function that verified that the instance was in a healthy state before proceeding. Tags on the EC2 instances controlled which batch they belonged to and when they would be upgraded.
The Upgrade Process in Detail
Each instance upgrade followed a structured, idempotent workflow that could be retried if it failed. The high-level steps were:
-
Preflight Checks: The Automation verified that the instance had at least 20 GB of free space on the root volume, that it was running a supported edition (Standard or Datacenter) of Windows Server 2016, and that the Systems Manager agent was active and up to date. It also checked for incompatible software that was known to block the upgrade, such as certain antivirus drivers. A custom script pulled a list of blocked applications from a DynamoDB table maintained by the security team.
-
Snapshot and Backup: A crash-consistent snapshot of all EBS volumes attached to the instance was created. Additionally, for instances running SQL Server, a VSS-aware snapshot was taken using a pre-step that invoked the Windows Volume Shadow Copy Service. The snapshot ID was stored as a tag on the Automation execution for easy recovery.
-
ISO Injection: A temporary EBS volume was created from a gold-image snapshot that contained the Windows Server 2025 installation files. This volume was attached to the instance as drive letter
Z:. The Automation used PowerShell to mount the ISO from the volume, making the installation media available locally. -
Silent Upgrade Execution: A command-line invocation of setup.exe was fired with parameters including
/auto upgrade /quiet /noreboot /compat ignorewarning /dynamicupdate disable. The/auto upgradeflag told Windows Setup to keep all files, settings, and applications. The process typically took 30–60 minutes, during which the instance remained running but inaccessible for remote management. -
Post-Upgrade Validation: After the mandatory reboot, the Automation waited for the SSM agent to come back online, then triggered a validation script. The script confirmed the OS build number matched Windows Server 2025, checked that all previously registered services had a status of ‘Running’, and executed a simple network test to a known internal endpoint. If any check failed, the Automation automatically rolled back by stopping the instance, replacing the root volume with the pre-upgrade snapshot, and restarting—a fully automated fallback that required no human intervention.
-
Cleanup: Once validation passed and a 48-hour observation period elapsed, the Automation deleted the temporary volume and the pre-upgrade snapshot, unless the snapshot was tagged for longer retention.
Real-World Challenges and How They Were Solved
No upgrade of this scale goes perfectly. The financial firm encountered several hurdles that shaped their final approach.
Incompatible Drivers and Agents. Many instances ran third-party monitoring agents, backup clients, and hardware-level security drivers that had no Windows Server 2025‑compatible versions available early in the project. Before the Automation was scaled, the team ran a discovery phase using Systems Manager Inventory to collect the installed driver and application list across the fleet. Incompatible components were identified and either updated through the agent’s native deployment channels or temporarily removed before upgrade via a pre‑script. A “blocking” DynamoDB table allowed the team to rapidly add new incompatible software without modifying the runbook.
SQL Server Considerations. The customer ran dozens of SQL Server 2016 and 2019 instances. While Windows Server 2025 supports these database versions, the upgrade process can break SQL Server if the service is not stopped cleanly or if the system databases become inconsistent. The Automation included a dedicated SQL Server pre-check: it stopped the SQL Server services in the correct dependency order, detached user databases when necessary, and performed a VSS snapshot. Post-upgrade, a T-SQL script verified the SQL Server instance started and that key databases were online.
Domain Controller Isolation. A small number of instances were Active Directory domain controllers. Upgrading domain controllers in-place is supported only under strict conditions, and the team initially attempted to include them in the Automation. After encountering schema versioning issues, they carved out domain controllers into a separate, manually executed procedure using the AWS-UpdateWindowsServerInstance document with additional manual validation steps.
Networking and Security Groups. Because the in-place upgrade retains the instance’s private IP address, security group memberships, and network interface, application connections remained unbroken. However, the temporary volume attachment required that the instance’s security group allow traffic to the Systems Manager endpoints and the instance’s own metadata service. The team updated their CloudFormation templates and Service Catalog product to include the necessary VPC endpoints (for Systems Manager, EC2 Messages, and S3) and outbound rules before the upgrades began.
Why This Matters for Windows on AWS
In-place OS upgrades on EC2 have historically been viewed with skepticism. Unlike deploying a new AMI, which guarantees a clean, immutable artifact, an in-place upgrade carries the risk of configuration drift, leftover registry entries, and subtle compatibility issues. However, the financial firm’s success demonstrates that with rigorous pre-flight validation, automated rollback, and a disciplined batch approach, in-place upgrades can be a safe and highly efficient path for large fleets.
Amazon itself has been improving the tooling. The AWS-UpdateWindowsServerInstance document now supports Windows Server 2025 as a target, and as of the latest updates, it can pull the installation media directly from an S3 bucket or from a publicly available source without requiring a custom AMI. Additionally, Systems Manager Fleet Manager now provides a centralized dashboard to view the upgrade status of all managed instances, making it easier to track progress at scale.
The move also highlights the growing maturity of infrastructure-as-code practices in traditional enterprise IT. The entire upgrade workflow was defined in a Systems Manager document stored in a Git repository and deployed via a CI/CD pipeline. Parameters such as batch size, maintenance windows, and snapshot retention were injected from AWS Parameter Store, allowing the same runbook to be used across development, test, and production environments with different configurations.
Performance and Cost Implications
One common concern with in-place upgrades is that the resulting instance may carry forward fragmentation, legacy drivers, and other performance-sapping artifacts. The customer’s performance engineering team compared pre- and post-upgrade metrics for CPU utilization, memory consumption, disk I/O, and network throughput across a representative sample of workloads. The results showed that the upgraded instances performed at parity or slightly better than their 2016 predecessors, largely because Windows Server 2025 includes kernel optimizations and an updated TCP/IP stack. Disk fragmentation, often cited as a drawback of in-place upgrades, was mitigated by the fact that EBS volumes are already thin-provisioned and do not suffer from the same rotational latency issues as physical drives.
From a cost perspective, avoiding a full redeployment saved the customer hundreds of hours of application configuration and testing. The EBS snapshots added temporary storage costs, but the team implemented a snapshot lifecycle policy that deleted rollback snapshots after 14 days, keeping the overall storage bill minimal. By using Spot Instances for the Automation fleet and scheduling upgrades during off-peak hours, they further reduced compute costs.
Lessons Learned and Best Practices
For enterprises considering a similar journey, several best practices emerged:
- Inventory Everything First: Use Systems Manager Inventory to build a complete catalog of installed applications, drivers, and services. Identify incompatible components early and create a remediation plan.
- Embrace Automated Rollback: The ability to automatically revert a failed upgrade by swapping the root volume with a pre-upgrade snapshot turned major incidents into minor blips. Ensure that snapshots are created before any change and that the rollback step is thoroughly tested.
- Batch and Monitor: Start with a small, non-critical batch and gradually increase size. Use Amazon CloudWatch dashboards to track Automation success rates, failure reasons, and timing. Set up SNS alerts for any failure that cannot be automatically rolled back.
- Treat Domain Controllers Differently: In-place upgrades of domain controllers are possible but riskier. If possible, demote and promote new ones on Windows Server 2025 instead.
- Test Application Compatibility: Even if the OS upgrade succeeds, applications may behave differently due to changes in .NET versions, TLS defaults, or security policies. Dedicate ample time to regression testing.
The Road Ahead
With Windows Server 2016 entering its final years, many AWS customers will face the same decision. The tooling and documented patterns now exist to make in-place migration an attractive alternative to the traditional rip-and-replace model. The financial firm’s successful upgrade of 2,000+ EC2 instances is more than a case study; it is proof that automation and careful planning can turn a very large, very risky project into a predictable, repeatable engineering exercise.
AWS continues to invest in the Windows migration experience. The recent addition of Windows Server 2025 to the Automation document’s support matrix, combined with the ability to use VPC endpoints for private connectivity, means that even isolated, air-gapped environments can leverage the same automated workflows. As ESUs for Windows Server 2016 approach, the window for planning is closing. For those still running 2016, the message is clear: the tools are ready, and the runway is shortening.