Industry: Internet Infrastructure / Domain Registry
Goal: Exit data centers and modernize critical registry infrastructure with a globally available, multi-region AWS platform designed for resilience, scalability, and zero-disruption migration.
OpsGuru Services: Cloud Architecture Redesign, Re-platforming, High Availability & DR Strategy, Migration Execution, DevOps Enablement, Automation & Observability
Tooling: Amazon EKS, Amazon EC2 (PostgreSQL, RabbitMQ), Karpenter, Amazon EFS (CRR), Amazon S3, AWS Global Accelerator, ALB/NLB, Transit Gateway, VPC, AWS Control Tower, AWS Backup, AWS CloudWatch, Terraform, Terragrunt, Helm, GitHub Actions, tfsec, checkov
Seamless migration and data center exit ahead of schedule
Multi-region active/passive architecture with validated RTO of 5–10 minutes
Improved operational efficiency with IaC, CI/CD, and automated scaling
Cost-optimized DR strategy with minimal passive footprint and rapid failover
Strengthened the resilience of critical global internet infrastructure
Future-ready architecture with native IPv6 and TLD expansion capability
Tucows Registry Services provides modern, scalable back-end infrastructure for gTLD, ccTLD, and DotBrand operators. The platform underpins critical domain operations for registries around the world, delivering the stability, compliance, and performance required to keep top-level domains running reliably at global scale.
Tucows entered the registry infrastructure space in 2021 through its acquisition of UNR Corp’s registry services platform, which it subsequently expanded, modernized, and relaunched as Tucows Registry Services. Since that transition, the business has grown rapidly—from roughly 500,000 domains to a projected 6 million by the end of 2025—as national registries and large portfolio operators have adopted Tucows as their trusted back-end provider.
In addition, Radix—one of the world’s largest portfolio TLD operators—has selected Tucows Registry Services as its future back-end provider. The migration of Radix’s 11 TLDs, representing millions of additional domains, is planned for 2026, underscoring the significant expansion expected in the coming years.
This rapid growth highlighted a clear technical inflection point: the inherited on-premises architecture, while proven, was not designed for the level of resiliency, scalability, and global availability now required.
To meet the expectations of operators who depend on Tucows for mission-critical infrastructure, the company needed to modernize its core systems. This meant exiting legacy data centers and moving to a cloud-based, multi-region AWS architecture engineered for high availability, operational transparency, and long-term scalability.
The Tucows Registry Service is classified as critical infrastructure. Its availability directly affects the ability to register and manage domains worldwide. Any disruption could have a cascading effect, which made this migration especially sensitive.
The company’s requirements were clear but sophisticated. Tucows Registry needed to:
Fully exit its data centers
Modernize the platform to a microservices architecture on AWS
Establish a multi-region active/passive deployment with reliable failover
This had to be achieved with no downtime for customers or registrars
Several technical factors increased the complexity of the project. Tucows Registry uses PostgreSQL as its primary database and RabbitMQ Streams as its messaging backbone. Replicating the state across regions while preserving consistency required careful design. Amazon MQ, though initially considered, did not support RabbitMQ Streams, which meant the team had to build a self-managed messaging solution. On the networking side, the system needed global reach, IPv4 and IPv6 support, and a failover mechanism that could operate without relying on DNS propagation delays.
Finally, the disaster recovery strategy needed to balance resilience and cost efficiency. Tucows Registry required a design that allowed its secondary region to scale quickly during failover events without maintaining a fully provisioned environment at all times. This added another layer of architectural precision to the project.
To meet these goals, OpsGuru designed and delivered a multi-region active/passive platform on AWS, built for operational reliability and long-term scalability.
The new foundation was built around Amazon EKS for container orchestration. The primary region hosts production workloads, while the secondary region remains in a ready state with minimal active nodes.
This approach kept only a small amount of infrastructure running in the secondary region and used Karpenter to automatically scale it up on demand during failover. Doing so allowed Tucows to minimize standby costs while still meeting a 5–10 minute recovery time objective.
PostgreSQL was deployed on EC2 instances with synchronous replication in the primary region and asynchronous replication to the secondary. An intermediate “escrow” replica was added to improve durability and safeguard data without introducing latency. Amazon EFS with cross-region replication was used for shared storage, simplifying state management between regions.
For networking and traffic management, OpsGuru implemented AWS Global Accelerator as the global entry point. This provided static IPv4 and IPv6 addresses and allowed traffic to shift between regions without DNS delays.
Traffic routing was designed so each registrar’s traffic could be directed precisely where needed, and a hub-and-spoke network model using Transit Gateway and VPC peering ensured fast, secure connectivity between regions.
The solution used a manual, controlled failover approach, allowing Tucows Registry to test and execute failovers in a predictable, repeatable way.
To address messaging requirements, OpsGuru deployed self-managed RabbitMQ clusters on EC2. Instead of replicating all message traffic between regions, which would have been costly and unnecessary, the team used RabbitMQ’s Shovel capability to replicate only the streams required for business continuity. This selective approach reduced network overhead while ensuring essential traffic would continue uninterrupted during a failover.
The entire infrastructure was built and managed with Terraform and Terragrunt, and applications were deployed via Helm charts. A GitHub Actions CI/CD pipeline for automated testing and deployments across development, QA, staging, and production. Security scans (tfsec, checkov) and automated validation steps were integrated into the pipeline, ensuring consistency and compliance. Monitoring and observability were achieved through CloudWatch and Zabbix, with additional database-level metrics collected through Munin and Prometheus exporters to Grafana.
OpsGuru also delivered knowledge transfer, runbooks, and playbooks to ensure the Tucows Registry team could autonomously manage and scale the platform after migration.
“The migration to AWS helped support our rapid growth at Tucows Registry, providing us the scalability and reliability we needed while ensuring a smooth transition from our data centers.”
— Francisco Obispo, Chief Technology Officer, Tucows Domains
The migration allowed Tucows Registry to complete its data center exit ahead of schedule, with no service disruption to customers. By moving to a cloud-based, multi-region active/passive architecture, Tucows Registry established a foundation capable of supporting its long-term growth objectives and strengthening the resilience of its global registry platform.
The new architecture was validated through controlled failover testing, demonstrating an RTO of 5–10 minutes and alignment with Tucows Registry’s SLA commitments.
The failover mechanism using AWS Global Accelerator gives Tucows Registry predictable, low-latency routing worldwide, while selective synchronization and minimal standby capacity help keep operational costs in check. The platform is designed to scale rapidly when required, without maintaining unnecessary idle infrastructure, and supports both IPv4 and IPv6 to stay ahead of evolving internet standards.
Operationally, the introduction of IaC and CI/CD pipelines has standardized deployments, improved release quality, and reduced manual operational effort. Automated security scanning (tfsec, checkov), IAM controls, and integration with Secrets Manager have strengthened the platform’s security posture. The new environment also provides Tucows with clearer visibility into system performance and more reliable, cost-optimized disaster recovery.
Beyond the technical outcomes, this modernization positions Tucows Registry to handle future increases in domain volume, onboard new top-level domains (TLDs) more easily, and continue providing stable, global internet infrastructure for years to come.