Atlas Archive Process Incident on 10th November 2025
Summary
On 10th November, a service disruption occurred when a scheduled maintenance task was ran during business hours. The maintenance process required the system to safely restore data integrity, which took longer than anticipated. Service was restored in stages, with initial recovery at 2:30 PM and full service restoration by 3:20 PM.
Root Cause
The incident occurred due to a maintenance scheduling issue with our infrastructure partner.
Contributing factors:
- Scheduling Error: A background maintenance task that is normally run outside core business hours was initiated during business hours as part of troubleshooting activities.
- Extended Recovery Time: The nature of the maintenance process required additional time for the system to safely complete data operations
- Process Documentation: Our partner's procedures did not have sufficient guidance on appropriate scheduling for this type of maintenance
Timeline of Events
Resolution
Service was restored through the following steps:
- System Recovery: The maintenance process completed successfully, and systems recovered automatically
- Phased Restoration: Services were restored in stages to ensure stability
- Extended Monitoring: Additional monitoring was conducted to confirm full service stability
- Verification: All systems were verified as fully operational before closing the incident
Customer Communication
During the Incident:
- Incident began at 1:10 PM on 10th November
- First customer reports received at 1:18 PM
- Service was fully restored by 3:20 PM
- Incident closed at 4:45 PM after stability monitoring
- Total service disruption: approximately 2 hours 10 minutes
Preventative Measures and Next Steps
Completed Actions:
- Partner Engagement: Conducted a comprehensive review with our infrastructure partner during scheduled service review
- Updated Procedures: Our partner has updated their operational procedures to ensure background tasks are scheduled at appropriate times.
- Enhanced Communication: Implemented improved communication protocols requiring advance confirmation of scheduling for all maintenance activities
- Verification Process: Our partner now confirms scheduling details before initiating any maintenance processes
Ongoing Actions:
- Continued Oversight: Regular reviews of partner procedures and scheduling practices
- Documentation Review: Ensuring all maintenance procedures have clear scheduling requirements
- Process Monitoring: Tracking adherence to updated procedures
We have worked closely with our infrastructure partner to ensure this type of incident does not happen again. Updated procedures and enhanced communication protocols are now in place to prevent maintenance activities from impacting service during business hours. We continue to monitor our partner's adherence to these improved processes.
We apologise for the inconvenience this incident caused and appreciate your patience as we worked to restore service.