Atlas Database Issues

Incident Report for Person Centred Software

Postmortem

Atlas Database Failover Incident on 12th January 2026

Summary

On 12th January, Atlas experienced intermittent sync failures starting at approximately 4:10 PM. The issue was triggered by a cloud networking event that failed over the production database from SQL 5 to SQL 6. Some services did not reconnect automatically, resulting in intermittent sync issues for a subset of customers.

Recovery began around 4:29 PM, and full stability was confirmed before the final status page update at 5:01 PM.

This incident was unrelated to the 13th of January incident (which stemmed from a database reindex deadlock).

Root Cause

The incident was caused by a cloud networking issue that triggered a database failover (SQL 5 to SQL 6). Not all sync services reconnected cleanly post‑failover, resulting in intermittent sync errors until failback and service restarts were completed.

Contributing factors:

·         Automatic reconnection for some services did not complete successfully

·         Failover timing meant customers saw impact before services recovered

·         Sync services distribute load across multiple servers, contributing to the intermittent pattern of failures

Timeline of Events

Resolution

Service was restored by:

  • Returning the database to its primary node
  • Restarting affected sync services
  • Validating service health through enhanced monitoring

Customer Communication

During the incident:

·         Incident started around 4:00 PM, 12 January

·         Final status page update / closure: 5:01 PM

·         Impact pattern: Intermittent, affected some, but not all, Atlas customers in a random distribution due to sync server routing.

Preventative Measures and Next Steps

Completed/Immediate

·         Post incident verification of sync health across services.

Planned/Ongoing

·         Tuned connection handling to improve resiliency during infrastructure events

·         Scheduled replacement of the current SQL server technology over the coming months, providing improved stability during failovers

·         Continued review of at‑risk customer configurations and impact patterns.

We apologise for the disruption this caused and appreciate your patience as we restored normal operations.

Posted Jan 28, 2026 - 16:14 GMT

Resolved

This incident has been resolved.
Posted Jan 12, 2026 - 17:01 GMT

Update

We are continuing to monitor for any further issues.
Posted Jan 12, 2026 - 16:48 GMT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 12, 2026 - 16:47 GMT

Update

We are continuing to work on a fix for this issue.
Posted Jan 12, 2026 - 16:46 GMT

Identified

The issue has been identified and a fix is being implemented.
Posted Jan 12, 2026 - 16:21 GMT

Update

We are continuing to investigate this issue.
Posted Jan 12, 2026 - 16:20 GMT

Update

We are continuing to investigate this issue.
Posted Jan 12, 2026 - 16:15 GMT

Update

We are continuing to investigate this issue.
Posted Jan 12, 2026 - 16:13 GMT

Investigating

We are currently investigating this issue.
Posted Jan 12, 2026 - 16:13 GMT
This incident affected: eMar (Atlas Central, CAPA, CAPA inbound prescription service, Atlas Sync, eMAR App, Titan Integration).