I’ve been working as the Head of SaaS at Binology for almost a year and a half already. One of my core responsibilities is planning & executing company workflows in the cloud in a cost-effective manner.
We use the AWS cloud and Docker is our primary delivery method for apps we run. Both vendors are great technology partners for a company of any size. My experience working with them was very positive.
We currently at a pretty early stage and our workloads are comparably minor. Because of that, we’ve been using burstable T instances backed up by Ubuntu successfully.
Like many other AWS users of T type of instances running Linux we also bumped into a common issue. It’s usually summarized as “kswapd high cpu no swap” and basically it’s looks like this:
- Suddenly apps running on a T instance become unresponsive. Browser times out when you try to reach a web app running there.
- It’s also likely that you won’t be able to log into instance via SSH as you usually do.
- Checking metrics will reveal a high CPU load and CPU credits balance actively dropping.
- If you’re lucky enough to have a terminal on the instance tools like top will reveal that most CPU is allocated to kswapd process (usually kswapd0).
Rebooting the instance will help you for the time being but will also leave you without a clue of what just happened.
What’s going on?
There’re long discussions on the Web about this issue you can google with something like “kswapd high cpu no swap”. There are some solutions which work for some people and don’t work for others. Overall problem roots are the following:
- T instances don’t have ephemeral storage. Because of that, they don’t have actual swap space to use.
- Mostly used junior T instances (nano/micro) have less than 1 GB of RAM.
- Whenever the memory pressure is high OS tries to swap some memory with a high priority (that’s what kswapd process is doing).
While such system behavior looks like a bug it’s definitely not something easy to fix. We’re likely to see it for a while.
There’s an obvious solution – attach an EBS volume for a swap. But that adds DevOps burden to manage your EC2 fleet. It also increases costs in comparison to ephemeral storage on more senior instances which costs nothing.
How to fix/workaround that?
There’s still no 100% solution working for everybody out there. In the meantime you can apply some of the following recommendations to mitigate the issue:
- Monitor your app health & uptime with something like StatusCake if you aren’t doing it already. That way you’ll get notified about availability issues and can act in a timely manner.
- Use CloudWatch alerts to get notified whenever CPU credits balance drops below a particular threshold. 60-70% of the maximum value usually works fine. That will give you a clue something may go wrong in different cases also.
- Tweak “vm.swappiness” setting of the system and set it to 0 since there’s no swap anyway.
- Plan memory usage and leverage Docker memory limits to make sure apps won’t eat too much of the system memory.