kswapd high cpu no swap issue with T instances on AWS

I’ve been working as the Head of SaaS at Binology for almost a year and a half already. One of my core responsibilities is planning & executing company workflows in the cloud in a cost-effective manner.

We use the AWS cloud and Docker is our primary delivery method for apps we run. Both vendors are great technology partners for a company of any size. My experience working with them was very positive.

We currently at a pretty early stage and our workloads are comparably minor. Because of that, we’ve been using burstable T instances backed up by Ubuntu successfully.

The problem

Like many other AWS users of T type of instances running Linux we also bumped into a common issue. It’s usually summarized as “kswapd high cpu no swap” and basically it’s looks like this:

Suddenly apps running on a T instance become unresponsive. Browser times out when you try to reach a web app running there.
It’s also likely that you won’t be able to log into instance via SSH as you usually do.
Checking metrics will reveal a high CPU load and CPU credits balance actively dropping.
If you’re lucky enough to have a terminal on the instance tools like top will reveal that most CPU is allocated to kswapd process (usually kswapd0).

Rebooting the instance will help you for the time being but will also leave you without a clue of what just happened.

What’s going on?

There’re long discussions on the Web about this issue you can google with something like “kswapd high cpu no swap”. There are some solutions which work for some people and don’t work for others. Overall problem roots are the following:

T instances don’t have ephemeral storage. Because of that, they don’t have actual swap space to use.
Mostly used junior T instances (nano/micro) have less than 1 GB of RAM.
Whenever the memory pressure is high OS tries to swap some memory with a high priority (that’s what kswapd process is doing).

While such system behavior looks like a bug it’s definitely not something easy to fix. We’re likely to see it for a while.

There’s an obvious solution – attach an EBS volume for a swap. But that adds DevOps burden to manage your EC2 fleet. It also increases costs in comparison to ephemeral storage on more senior instances which costs nothing.

How to fix/workaround that?

There’s still no 100% solution working for everybody out there. In the meantime you can apply some of the following recommendations to mitigate the issue:

Monitor your app health & uptime with something like StatusCake if you aren’t doing it already. That way you’ll get notified about availability issues and can act in a timely manner.
Use CloudWatch alerts to get notified whenever CPU credits balance drops below a particular threshold. 60-70% of the maximum value usually works fine. That will give you a clue something may go wrong in different cases also.
Tweak “vm.swappiness” setting of the system and set it to 0 since there’s no swap anyway.
Plan memory usage and leverage Docker memory limits to make sure apps won’t eat too much of the system memory.

One thought on “kswapd high cpu no swap issue with T instances on AWS”

Comments are closed.

Pingback: Signs of memory pressure in Linux - Blog of Dan Ivy

Halting problem в 3-х ипостасях

Термин halting problem обычно вспоминают как «ту самую» теоретическую границу вычислений: нельзя написать программу, которая для любой другой программы и любого входа заранее решит, завершится ли вычисление или уйдёт в бесконечный цикл. Но если посмотреть на индустриальные системы — от микросервисов до агентных пайплайнов с LLM — то «проблема остановки» неожиданно перестаёт быть абстракцией. Она…

NFR Conflicts: 3 practical cases

Over time I noticed some typical contradictions between non-functional requirements (NFR). Let’s consider three typical cases of NFR conflicts I learned from practice. Security vs Ease of Use Security is a huge and ever growing concern these days. No surprise here – we live when IT-systems are unprecedentedly dependable. More and more activities occur in…

Max Performance of Nginx, PHP & PostgreSQL Chain

This blog post is the first in a series of articles on performance engineering (PE) research I did some time ago. I touched the topic of PE a bit in one of previous blog articles. This time it’ll be purely practical. The picture was generated by Nano Banana and contains pretty obvious errors. I decided…

Ода Master of Orion: Conquer the Stars

Продолжаю петь оды как правило и так уже признанным столпам игроделания, в этот раз поговорим о MOO CTS от аж 2016 года — скоро юбилей 10 лет! Я не играл в оригинальный MOO, но на MOO2 потрачено немало времени. Именно поэтому MOO CTS это именно то, что нужно — это тот же MOO2, только лучше!…

Feature, Solution and Platform Engineering

This post would probably be longer than usual. I’ve been pondering the concept I’m going to cover in it for a while already. It all was ignited by the initiative to slice and dice workforce at the company I work now into a small, delivery oriented teams. The idea per se isn’t bad at all…