On 12th January, Atlas experienced intermittent sync failures starting at approximately 4:10 PM. The issue was triggered by a cloud networking event that failed over the production database from SQL 5 to SQL 6. Some services did not reconnect automatically, resulting in intermittent sync issues for a subset of customers.
Recovery began around 4:29 PM, and full stability was confirmed before the final status page update at 5:01 PM.
This incident was unrelated to the 13th of January incident (which stemmed from a database reindex deadlock).
The incident was caused by a cloud networking issue that triggered a database failover (SQL 5 to SQL 6). Not all sync services reconnected cleanly post‑failover, resulting in intermittent sync errors until failback and service restarts were completed.
Contributing factors:
· Automatic reconnection for some services did not complete successfully
· Failover timing meant customers saw impact before services recovered
· Sync services distribute load across multiple servers, contributing to the intermittent pattern of failures
Service was restored by:
During the incident:
· Incident started around 4:00 PM, 12 January
· Final status page update / closure: 5:01 PM
· Impact pattern: Intermittent, affected some, but not all, Atlas customers in a random distribution due to sync server routing.
Completed/Immediate
· Post incident verification of sync health across services.
Planned/Ongoing
· Tuned connection handling to improve resiliency during infrastructure events
· Scheduled replacement of the current SQL server technology over the coming months, providing improved stability during failovers
· Continued review of at‑risk customer configurations and impact patterns.
We apologise for the disruption this caused and appreciate your patience as we restored normal operations.