Performance issues on Finland

Incident Report for DynamicHost

Postmortem

Leadup

Describe the circumstances that led to this incident

A cronjob executed a bash script linked to a service
A vital service in the backup infrastructure started operation and initially proceeded without issue.

Fault

Describe what failed to work as expected

There was an issue with the execution causing a memory leak and high disk I/O operations, these operations queued up and used up most of the CPU time available on the instance.

The memory leak and high I/O operations filled the memory and prevented system and user tasks from being executed by the processor.

Detection

Describe how the incident was detected

Monitoring services detected and made us aware of the high I/O operation count and abnormal memory consumption.

Root causes

Run a 5-whys analysis to understand the true causes of the incident

A service utilised during backups had a memory leak and was performing an abnormal amount of I/O operations
The systems memory rapidly filled and the I/O operations consumed much of processors queue
This caused the scheduler to decide they needed be executed over system and user tasks
Vital services began to begin failing due to timeouts, not receiving data from the disk or OOM errors
The server become mostly unresponsive with major components failed/failing

Mitigation and resolution

What steps did you take to resolve this incident?

As per incident training we followed an ICERR approach to the situation, firstly we identified the cause of the issue and its scope.
The incident only effected one host, and using remote monitoring tools we could identify possible culprits.

We then spent time gaining access to the machine, and ensured that our remote management services were prioritised over everything else- ensuring that we didn’t get locked out, this was we wouldn’t be forced to reboot the machine and cause an outage.

Once we had access we continued to contain the issue, we looked at the suspected processes in more detail and determined that the process responsible was one associated with the backup services utilised, we then proceeded to stop the process and revert the stored backup to the previous stored version- to ensure that there wasn’t an incomplete/corrupted backup as the latest option.

Now the issue was contained we temporarily isolated the backup service and its children until we can properly identify the cause of the issue from a development level and release a patch to ensure the issue doesn’t occur again.

We ensured no lasting damage occurred, cleaned up and then restored the service to production operation.

Lessons learnt

What went well? What could have gone better? What else did you learn?

This was manually spotted by the NOC during their monitoring operations, we can improve our AI/ML models to ensure that issues like this are identified automatically better in the future.

Posted Jul 21, 2021 - 12:17 UTC

Resolved

This incident has been resolved.

Posted Jul 21, 2021 - 12:03 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 21, 2021 - 12:02 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 21, 2021 - 11:46 UTC

Investigating

We are currently investigating an issue in relation to iowait on Finland 1

Posted Jul 21, 2021 - 09:32 UTC

This incident affected: Pterodactyl Cluster (Finland 1 (Previously dn002)).