Describe the circumstances that led to this incident
A cronjob executed a bash script linked to a service
A vital service in the backup infrastructure started operation and initially proceeded without issue.
Describe what failed to work as expected
There was an issue with the execution causing a memory leak and high disk I/O operations, these operations queued up and used up most of the CPU time available on the instance.
The memory leak and high I/O operations filled the memory and prevented system and user tasks from being executed by the processor.
Describe how the incident was detected
Monitoring services detected and made us aware of the high I/O operation count and abnormal memory consumption.
Run a 5-whys analysis to understand the true causes of the incident
What steps did you take to resolve this incident?
As per incident training we followed an ICERR approach to the situation, firstly we identified the cause of the issue and its scope.
The incident only effected one host, and using remote monitoring tools we could identify possible culprits.
We then spent time gaining access to the machine, and ensured that our remote management services were prioritised over everything else- ensuring that we didn’t get locked out, this was we wouldn’t be forced to reboot the machine and cause an outage.
Once we had access we continued to contain the issue, we looked at the suspected processes in more detail and determined that the process responsible was one associated with the backup services utilised, we then proceeded to stop the process and revert the stored backup to the previous stored version- to ensure that there wasn’t an incomplete/corrupted backup as the latest option.
Now the issue was contained we temporarily isolated the backup service and its children until we can properly identify the cause of the issue from a development level and release a patch to ensure the issue doesn’t occur again.
We ensured no lasting damage occurred, cleaned up and then restored the service to production operation.
What went well? What could have gone better? What else did you learn?
This was manually spotted by the NOC during their monitoring operations, we can improve our AI/ML models to ensure that issues like this are identified automatically better in the future.