Atlas Database Issues

Incident Report for Person Centred Software

Postmortem

Atlas Database Failover Incident on 12th January 2026

Summary

On 12th January, Atlas experienced intermittent sync failures starting at approximately 4:10 PM. The issue was triggered by a cloud networking event that failed over the production database from SQL 5 to SQL 6. Some services did not reconnect automatically, resulting in intermittent sync issues for a subset of customers.

Recovery began around 4:29 PM, and full stability was confirmed before the final status page update at 5:01 PM.

This incident was unrelated to the 13th of January incident (which stemmed from a database reindex deadlock).

Root Cause

The incident was caused by a cloud networking issue that triggered a database failover (SQL 5 to SQL 6). Not all sync services reconnected cleanly post‑failover, resulting in intermittent sync errors until failback and service restarts were completed.

Contributing factors:

· Automatic reconnection for some services did not complete successfully

· Failover timing meant customers saw impact before services recovered

· Sync services distribute load across multiple servers, contributing to the intermittent pattern of failures

Timeline of Events

‌

Resolution

Service was restored by:

Returning the database to its primary node
Restarting affected sync services
Validating service health through enhanced monitoring

Customer Communication

During the incident:

· Incident started around 4:00 PM, 12 January

· Final status page update / closure: 5:01 PM

· Impact pattern: Intermittent, affected some, but not all, Atlas customers in a random distribution due to sync server routing.

Preventative Measures and Next Steps

Completed/Immediate

· Post incident verification of sync health across services.

Planned/Ongoing

· Tuned connection handling to improve resiliency during infrastructure events

· Scheduled replacement of the current SQL server technology over the coming months, providing improved stability during failovers

· Continued review of at‑risk customer configurations and impact patterns.

‌

We apologise for the disruption this caused and appreciate your patience as we restored normal operations.

Posted Jan 28, 2026 - 16:14 GMT

Resolved

This incident has been resolved.

Posted Jan 12, 2026 - 17:01 GMT

Update

We are continuing to monitor for any further issues.

Posted Jan 12, 2026 - 16:48 GMT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 12, 2026 - 16:47 GMT

Update

We are continuing to work on a fix for this issue.

Posted Jan 12, 2026 - 16:46 GMT

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 12, 2026 - 16:21 GMT

Update

We are continuing to investigate this issue.

Posted Jan 12, 2026 - 16:20 GMT

Update

We are continuing to investigate this issue.

Posted Jan 12, 2026 - 16:15 GMT

Update

We are continuing to investigate this issue.

Posted Jan 12, 2026 - 16:13 GMT

Investigating

We are currently investigating this issue.

Posted Jan 12, 2026 - 16:13 GMT

This incident affected: eMar (Atlas Central, CAPA, CAPA inbound prescription service, Atlas Sync, eMAR App, Titan Integration).