Swap is a logical and slow extension of physical RAM. Whenever the system runs out of the physical RAM, the Linux kernel writes the least used data from RAM to swap in turn freeing up some RAM. Typically, swap is setup on a separate dedicated partition on secondary storage or a separate storage itself. Sometimes, specially created files on an existing filesystem are also used as swap.
A lot of embedded systems use NAND flashes or SSDs as their secondary storage drives. These devices are typically divided in fixed size blocks. The blocks needs to be erased before they can be written to again. These devices don’t have mechanical parts in them which fails them. Its their write/erase or program/erase cycles that age them out. The write/erase cycles are limited on these devices which define their age. This property makes them unsuitable to be used as a swap device.
Sometime ago, I faced an out of memory problem on a project I was working on. It was a system with 8GB of physical RAM and 8GB of NAND flash as secondary storage. Since the secondary storage was a NAND flash, the swap was not enabled in the system.
The root filesystem was on ramfs which, for some historical reasons and beyond my understanding, never switched to filesystem on the secondary storage. So the complete root filesystem always stayed in RAM for systems’ uptime. With extra features being added in the software, the memory usage kept on increasing. This put an immense pressure on Linux kernel’s memory management to keep up. Eventually, the system caved in. We started seeing frequent Out of Memory (OOM), crash, and hung issues.
There were the following major problems I could see with the system:
- The root filesystem never went away as it should.
- It was based on ramfs which doesn’t have a backing store.
- There was no swap to handle memory pressure.
There was little that could have been done regarding the problem no. 1 for the reasons beyond this post. The solution, as in software industry, was demanded at the earliest. So I turned my eyes towards problem no. 2 and 3.
Ramfs and its problems
Ramfs is a very simple filesystem that exports Linux's disk caching mechanisms (the page cache and dentry cache) as a dynamically resizable RAM-based filesystem.
Normally all files are cached in memory by Linux. Pages of data read from backing store (usually the block device the filesystem is mounted on) are kept around in case it's needed again, but marked as clean (freeable) in case the Virtual Memory system needs the memory for something else. Similarly, data written to files is marked clean as soon as it has been written to backing store, but kept around for caching purposes until the VM reallocates the memory. A similar mechanism (the dentry cache) greatly speeds up access to directories.
Ramfs does not have a backing store. This means that files written to it, add pages in page cache which don’t have anything to write them to. This essentially means that these pages in page cache are never marked clean and even if there is swap available, they can never be swapped out. Even if we could figure out a way to make the root filesystem swappable, there would still be nowhere they can go.
The other major disadvantage with ramfs is that it grows dynamically. System will not stop you from writing data. The system may crash or hang if there is no more physical RAM available to write.
Problem with the system
The system I was dealing with was keeping all the logs, databases, and what not to root filesystem. So it was growing at significant rate. The root filesystem took about 2GB of RAM (unswappable) with only 6GB (a little less than that) available for user processes. Applications had genuine requirement of memory. A parallel effort there didn’t yield much.
The bigger problem was with the coredump generation. When a huge application used to crash, to generate core, system used to allocate memory frantically. Before coredump could be saved, system used to run in OOM condition killing other huge applications, eventually bringing the whole system down.
The solution obviously was to get the swap space. But before that I had to make sure that the root filesystem was swappable. As at that time it wasn’t possible. In addition to that, I had to find a way to add a swap space. There was no way to add another hardware piece. I had to solve this with whatever I had at hand.
I took the help of following two things to solve the problem:
Tmpfs has many advantages in comparison to ramfs.
- Unlike ramfs doesn’t grow dynamically. It doesn’t allow to write more than the size that was specified while mounting it.
- It has better usage reporting than ramfs.
- It is backed by swap.
There was a patch floating around that added support for root filesystem to use tmpfs rather than ramfs. Even though tmpfs is backed by swap, it wasn’t the immediate solution as the system I was debugging didn’t have swap to begin with.
ZRAM is a logical in-memory block device which stores data in compressed form. The compression is done on-the-fly. It has two common uses of being mounted on /tmp and also being used as swap!
So ZRAM fits in as another block to the solution. The major challenge was to back port it to 2.6.32 kernel. Since kernel APIs are fast paced, it took some time to port it. But the effort was well spent.
These two blocks together solved the problem.
To test the setup with fix, I wrote an application that would greedily allocate memory and would never return it back. I called it “memeater”. The image below tells the story without the ZRAM support.
|Memory Reclaim without ZRAM and INITMPFS|
As can be seen in the graph above, at about 48 seconds, the free memory dropped to almost zero. At about the same time, Linux kernel started freeing up the “clean” cached pages in desperate measure to keep up. As there weren’t much clean pages, it wasn’t able to keep up and 10 seconds later, it eventually caved in. The linear blue line shows that the anonymous pages are rising in the system. This is as expected since memeater is mallocing the memory.
|Memory Reclaim with ZRAM and INITTMPFS|
The image above is with zram. Same as the first image, at about 48 seconds, the free memory (orange line) drops to almost zero. As in the previous case, Linux frees up clean pages (grey line). Now Linux has swap enabled on a zram block device. So at 57 seconds, when in previous case the system crashed, Linux starts swapping out pages to zram. It can be seen with swap free (yellow line) sloping down. Since tmpfs is built on top of shared memory, we can see that shmem (green line) also starts sloping down. System stays up for 50 more seconds! This is big because memeater doesn’t return memory back to system. This is not the case with actual systems.
|Actual Test One|
|Actual Test 2|
The above two graphs are taken from an actual system.
The inittmpfs + zram worked amazingly well. After the solution, not a single out of memory condition was reported. There are lot of other systems that use zram or similar infrastructure. Google’s ChromeOS also uses zram. At the time of this writing, my MacOS has about 1.87 GB in compressed state.If you are having similar problem may be this setup would work for you. Let me know if it does!