kswapd high cpu no swap issue with T instances on AWS

I’ve been working as the Head of SaaS at Binology for almost a year and a half already. One of my core responsibilities is planning & executing company workflows in the cloud in a cost-effective manner.

We use the AWS cloud and Docker is our primary delivery method for apps we run. Both vendors are great technology partners for a company of any size. My experience working with them was very positive.

We currently at a pretty early stage and our workloads are comparably minor. Because of that, we’ve been using burstable T instances backed up by Ubuntu successfully.

The problem

Like many other AWS users of T type of instances running Linux we also bumped into a common issue. It’s usually summarized as “kswapd high cpu no swap” and basically it’s looks like this:

Suddenly apps running on a T instance become unresponsive. Browser times out when you try to reach a web app running there.
It’s also likely that you won’t be able to log into instance via SSH as you usually do.
Checking metrics will reveal a high CPU load and CPU credits balance actively dropping.
If you’re lucky enough to have a terminal on the instance tools like top will reveal that most CPU is allocated to kswapd process (usually kswapd0).

Rebooting the instance will help you for the time being but will also leave you without a clue of what just happened.

What’s going on?

There’re long discussions on the Web about this issue you can google with something like “kswapd high cpu no swap”. There are some solutions which work for some people and don’t work for others. Overall problem roots are the following:

T instances don’t have ephemeral storage. Because of that, they don’t have actual swap space to use.
Mostly used junior T instances (nano/micro) have less than 1 GB of RAM.
Whenever the memory pressure is high OS tries to swap some memory with a high priority (that’s what kswapd process is doing).

While such system behavior looks like a bug it’s definitely not something easy to fix. We’re likely to see it for a while.

There’s an obvious solution – attach an EBS volume for a swap. But that adds DevOps burden to manage your EC2 fleet. It also increases costs in comparison to ephemeral storage on more senior instances which costs nothing.

How to fix/workaround that?

There’s still no 100% solution working for everybody out there. In the meantime you can apply some of the following recommendations to mitigate the issue:

Monitor your app health & uptime with something like StatusCake if you aren’t doing it already. That way you’ll get notified about availability issues and can act in a timely manner.
Use CloudWatch alerts to get notified whenever CPU credits balance drops below a particular threshold. 60-70% of the maximum value usually works fine. That will give you a clue something may go wrong in different cases also.
Tweak “vm.swappiness” setting of the system and set it to 0 since there’s no swap anyway.
Plan memory usage and leverage Docker memory limits to make sure apps won’t eat too much of the system memory.

One thought on “kswapd high cpu no swap issue with T instances on AWS”

Comments are closed.

Pingback: Signs of memory pressure in Linux - Blog of Dan Ivy

Mental Models for Performance Engineering

I’ve been tinkering with performance engineering (PE) as one of the areas of interest for a while lately. From experience I built up a few mental models / metaphors which help with reasoning about systems performance. Let’s get through these mental models for performance engineering one by one. The most significant achievement thus far was…

DOOM уже не тот (и не торт) – часть вторая

Совсем недавно вышла очередная игра в серии ребута классического Doom – Doom: The Dark Ages aka D:DA. Я уже отметился как не шибко фанат творчества господина Мартина и честно хотел остаться сторонним наблюдателем, ведь DOOM уже не тот. Но какой-то очередной ролик на ЮТубоне таки меня склонил поставить НЕДОСТУПНЫЙ в РФ через Стим D:DA =)…

Unsettling story of proxy_next_upstream in Nginx

It turns out that there’s quite a bunch of people who bumped into a particular Nginx upstream handling behavior. Not that long ago I also joined the club. It usually manifests as unavailability of some service behind Nginx as a reverse proxy. Clients start to get “502 Bad Gateway” after some change in the service…

Tribute to the Blog of Uwe Friedrichsen

At the end of November last year I bumped into a blog which immediately captured my attention. I eagerly read during a few weeks after as the blog turned out to be a treasury of value =) Articles are nicely interconnected with each other, so it’s easy to fall into a rabbithole of great content…

Технический аудит при закупках ИТ-систем

Корпоративные информационные системы (ИС) играют ключевую роль в управлении и оптимизации бизнес-процессов современных организаций. Как и любые другие технические активы – а как я уже утверждал ИС это вполне себе технический актив – эти системы проходят определённые этапы жизненного цикла, начиная от планирования и заканчивая утилизацией. Пока не очень понятно при чем тут технический аудит…