Saturday, February 17, 2018

Handling memory pressure in embedded systems without swap


Swap is a logical and slow extension of physical RAM. Whenever the system runs out of the physical RAM, the Linux kernel writes the least used data from RAM to swap in turn freeing up some RAM. Typically, swap is setup on a separate dedicated partition on secondary storage or a separate storage itself. Sometimes, specially created files on an existing filesystem are also used as swap.

A lot of embedded systems use NAND flashes or SSDs as their secondary storage drives. These devices are typically divided in fixed size blocks. The blocks needs to be erased before they can be written to again. These devices don’t have mechanical parts in them which fails them. Its their write/erase or program/erase cycles that age them out. The write/erase cycles are limited on these devices which define their age. This property makes them unsuitable to be used as a swap device.

The Problem

Sometime ago, I faced an out of memory problem on a project I was working on. It was a system with 8GB of physical RAM and 8GB of NAND flash as secondary storage. Since the secondary storage was a NAND flash, the swap was not enabled in the system.

The root filesystem was on ramfs which, for some historical reasons and beyond my understanding, never switched to filesystem on the secondary storage. So the complete root filesystem always stayed in RAM for systems’ uptime. With extra features being added in the software, the memory usage kept on increasing. This put an immense pressure on Linux kernel’s memory management to keep up. Eventually, the system caved in. We started seeing frequent Out of Memory (OOM), crash, and hung issues.

There were the following major problems I could see with the system:

  1. The root filesystem never went away as it should.
  2. It was based on ramfs which doesn’t have a backing store.
  3. There was no swap to handle memory pressure.

There was little that could have been done regarding the problem no. 1 for the reasons beyond this post. The solution, as in software industry, was demanded at the earliest. So I turned my eyes towards problem no. 2 and 3.

Ramfs and its problems

Ramfs is a very simple filesystem that exports Linux's disk caching mechanisms (the page cache and dentry cache) as a dynamically resizable RAM-based filesystem.

Normally all files are cached in memory by Linux.  Pages of data read from backing store (usually the block device the filesystem is mounted on) are kept around in case it's needed again, but marked as clean (freeable) in case the Virtual Memory system needs the memory for something else.  Similarly, data written to files is marked clean as soon as it has been written to backing store, but kept around for caching purposes until the VM reallocates the memory.  A similar mechanism (the dentry cache) greatly speeds up access to directories.

Ramfs does not have a backing store. This means that files written to it, add pages in page cache which don’t have anything to write them to. This essentially means that these pages in page cache are never marked clean and even if there is swap available, they can never be swapped out. Even if we could figure out a way to make the root filesystem swappable, there would still be nowhere they can go.

The other major disadvantage with ramfs is that it grows dynamically. System will not stop you from writing data. The system may crash or hang if there is no more physical RAM available to write.

Problem with the system

The system I was dealing with was keeping all the logs, databases, and what not to root filesystem. So it was growing at significant rate. The root filesystem took about 2GB of RAM (unswappable) with only 6GB (a little less than that) available for user processes. Applications had genuine requirement of memory. A parallel effort there didn’t yield much.

The bigger problem was with the coredump generation. When a huge application used to crash, to generate core, system used to allocate memory frantically. Before coredump could be saved, system used to run in OOM condition killing other huge applications, eventually bringing the whole system down.

The Solution

The solution obviously was to get the swap space. But before that I had to make sure that the root filesystem was swappable. As at that time it wasn’t possible. In addition to that, I had to find a way to add a swap space. There was no way to add another hardware piece. I had to solve this with whatever I had at hand.

I took the help of following two things to solve the problem:

1. Tmpfs

Tmpfs has many advantages in comparison to ramfs.

  1. Unlike ramfs doesn’t grow dynamically. It doesn’t allow to write more than the size that was specified while mounting it.
  2. It has better usage reporting than ramfs.
  3. It is backed by swap.
There was a patch floating around that added support for root filesystem to use tmpfs rather than ramfs. Even though tmpfs is backed by swap, it wasn’t the immediate solution as the system I was debugging didn’t have swap to begin with.


ZRAM is a logical in-memory block device which stores data in compressed form. The compression is done on-the-fly. It has two common uses of being mounted on /tmp and also being used as swap!

So ZRAM fits in as another block to the solution. The major challenge was to back port it to 2.6.32 kernel. Since kernel APIs are fast paced, it took some time to port it. But the effort was well spent.

These two blocks together solved the problem.

The Testing

To test the setup with fix, I wrote an application that would greedily allocate memory and would never return it back. I called it “memeater”. The image below tells the story without the ZRAM support.

Memory Reclaim without ZRAM and INITMPFS

As can be seen in the graph above, at about 48 seconds, the free memory dropped to almost zero. At about the same time, Linux kernel started freeing up the “clean” cached pages in desperate measure to keep up. As there weren’t much clean pages, it wasn’t able to keep up and 10 seconds later, it eventually caved in. The linear blue line shows that the anonymous pages are rising in the system. This is as expected since memeater is mallocing the memory.

Memory Reclaim with ZRAM and INITTMPFS

The image above is with zram. Same as the first image, at about 48 seconds, the free memory (orange line) drops to almost zero. As in the previous case, Linux frees up clean pages (grey line). Now Linux has swap enabled on a zram block device. So at 57 seconds, when in previous case the system crashed, Linux starts swapping out pages to zram. It can be seen with swap free (yellow line) sloping down. Since tmpfs is built on top of shared memory, we can see that shmem (green line) also starts sloping down. System stays up for 50 more seconds! This is big because memeater doesn’t return memory back to system. This is not the case with actual systems.

Actual Test One

Actual Test 2

The above two graphs are taken from an actual system.


The inittmpfs + zram worked amazingly well. After the solution, not a single out of memory condition was reported. There are lot of other systems that use zram or similar infrastructure. Google’s ChromeOS also uses zram. At the time of this writing, my MacOS has about 1.87 GB in compressed state.If you are having similar problem may be this setup would work for you. Let me know if it does!

Sunday, April 21, 2013

MIPS Bootstrapping

Bootstrapping is the process of taking a CPU just out of reset, fetching and executing instructions serially, to a more complex running environment. The program that does that is called a "Boot loader" or "Boot strap code" or simply "Boot code".

First Instruction Fetch

When power is applied to a processor and it comes out of reset, it fetches its first instruction from an address that is hardwired. This address is known as the "Boot Vector" or the "Reset Vector". The MIPS processors' boot vector is located at physical address 0x1FC00000. The MIPS processors have MMU enabled as soon as they are powered on. The MIPS core thus presents a virtual address of 0xBFC00000. The MMU translates this address to physical address of 0x1FC00000, the boot vector. This translation again is hardwired. Typically, a boot device is present at this address and responds to the read request of the processor2. See Firgure 1.  The offset 0 of bootstrap code in Figure 1 may not always be true. This may be changed by your hardware designer by pulling the physical address lines high, there by changing the final physical address.

A boot device is a permanent storage device which gives random access on reads, just like RAM. NOR flash, NVRAM etc. are a few boot devices.

U-Boot: The Boot Loader

U-Boot is a boot loader from Denx Software Engineering. This boot loader supports multiple architecture including MIPS. We can draw a general outline on what U-Boot does as part of bootstrapping MIPS. Board specific and other details will be left out.

U-Boot can be divided in two stages: Stage 1 and Stage 2. 

Stage 1 Loader

Stage 1 is completely or partially written in assembly language. A program written in C language requires memory to be up and stack to set in stack register. When the CPU is executing the stage 1 code, usually at that time the RAM is not available and thus the C code cannot run. For this reason, first few instructions are coded in assembly language.

How soon can the assembly code can jump to C, depends on what features does CPU provide. Some of the MIPS implementations give a way to lock the cache lines and use them as stack. Some have seperate SRAM which can be used as temporary stack. In these cases, assembly can quickly use setup these temporary stacks and jump to C code. Otherwise, all code to initialize RAM has to be written in assembly, which can be quite a pain. Whatever the case is, all this time CPU is running by fetching instructions directly from boot device. The accesses to boot device are not fast as RAM and thus CPU cannot continue doing so. Thus the next step for stage 1 loader is to copy the rest of the code from the boot device to RAM. Before processor can start executing from RAM, the recently copied code in RAM is relocated. What now runs in RAM can be called as the stage 2 loader.

Stage 2 Loader

The stage 2 loader is responsible for doing the following:

  1. Prepare and initialize the other subsystems like other storage devices, USB, Networks etc.
  2. The evironment variables are initialized.
  3. Give the user prompt for further commands or autoboot a predefined operating system.

Loading Linux Kernel

U-Boot usually loads Linux as any other ELF binary. The ELF header is parsed and the entry point is taken from the ELF header. The kernel entry function pointer declaration below, shows how U-Boot passes the control over to Linux and what information is provided along with.

void (*theKernel)(int, char **, char **, int *);

U-Boot passes 4 arguments when it calls the Linux kernel. The following are the arguments passed:
  1. Number of arguments.
  2. Linux arguments.
  3. Linux environment variables.
  4. Nothing (0).

As part of environment variables, information like memsize, initrd start, flash start, etc is passed over. The Linux arguments contains whatever is defined in bootargs environtment variable of U-Boot. The 4 arguments are passed via a0, a1, a2 and a3 registers.

Friday, April 20, 2012

Whats left before prompt?

Here is what is pending:
  1. Verify that IDT installation is working fine. Before that I need to
    install at least one task segment. In Intel 64, the hardware task switching
    is not supported. But it still requires at least one TSS to be present
    because it is used for loading the interrupt/exception time stack. Thank
    god, they moved on from conditional stack switching.
  2. Need a keyboard driver before we can show the prompt.
  3. Need to get the timer working and as a result need to get scheduler schedule the X-Visor thread.
  4. Need to get the hyperthread support up and running.

What is X-Visor?

X-Visor aims towards providing an open source virtualization solution, which is light-weight, portable, and flexible. It tries to ensure small memory foot print and less virtualization overhead in every functionality. Open source projects such as: Linux, NetBSD, FreeBSD, and QEMU have made a great impact in Xvisor design & development. X-Visor has most of the features expected from a modern full-fledged hypervisor, such as:

  • Tree based Configuration (Device Tree)
  • CPU virtualization (Guest, Virtual CPUs, Virtual IRQs)
  • MMU virtualization (Virtual MMU, Virtual Guest Address Space)
  • IO virtualization (Device Emulation Framework, Emulators)
  • Device Driver Framework (Host Address Space, Host IRQs, Drivers)
  • Threading Framework (Hypervisor Threads)
  • Managment Terminal (Mterm)
  • Serial Port Virtualization (Virtual Serial)

Xvisor is a highly portable source code. In fact, its development was initiated in 3 different architectures (ARM, MIPS and Intel 64) simultaneously, to ensure flexiblity and portablity from the begining itself. It is easily portable to most general-purpose 32- or 64-bit architectures as long as they have a paged memory management unit (PMMU) and a port of the GNU C compiler (gcc) (part of The GNU Compiler Collection, GCC).

MIPS port has been abandoned because of lack of interest from MIPS designers. Intel has virtualization support, ARM is adding it but there is no plans for MIPS. I don't want to waste time on something for which people don't care. Hence I took up Intel 64. Why Intel 64? I think 32-bit days will be over soon. 64-bit architecture is future. I don't want to get in that 32-bit clutter. If somebody wants to take 32-bit support, be my guest.

Thursday, August 4, 2011

MIPS Virtualization: VCPU Address Map

            +-------------------------------+ 0xFFFFFFFF
            |...............................| X
            |...............................| V
            |...............................| I
            |...............................| S
            |...............................| O
            |..........X V I S O R..........| R
            |...............................| A
            |...A D D R R S S S P A C E...  | D
            |...............................| R
            |...............................| S
            |...N O T V I S I B L E TO...   | P
            |......G U E S T V C P U....... | CE
            +-------------------------------+ 0x80000000
            |                               |
            |                               | USEG0/KUSEG0
            |    KSEG2 / USEG0 (ASID Based) |
            |                               |
            |                               |
            +-------------------------------+ 0x40000000
            |                               | 0x3FC00000
            |    KSEG1 / USEG0 (ASID Based) |
            |                               |
            +-------------------------------+ 0x20000000
            |                               |
            |    KSEG0 / USEG0 (ASID Based) |
            |                               |
            +-------------------------------+ 0x00000000
                        FIGURE 1.
                 (Address MAP for A VCPU)

Xvisor sees and works on the actual address map as defined by
the MIPS architecure. But for the virtual CPUs (VCPUs) the address
space is different.

The VCPU running the guest OS sees the KSEG0 (Mapped cached region)
starting from 0x00000000. This is where the RAM for the VCPU is
mapped. When VCPU is in Kernel Mode or its EXL bit is set, the 2GB
region starting from 0x00000000 pretends to be the region enclosed
between addresses 0x80000000 - 0xFFFFFFFF on an actual MIPS CPU.
When not running in kernel mode, this 2GB region is the regular

When a VCPU starts, its EXL bit is set and its essentially running
in Kernel mode. So USEG0 of CPU is presented as usual kernel mode
segments by the hypervisor. Since in this mode, 512 MB region
starting at 0x20000000 becomes KSEG1 (mapped, uncached), the VCPU
starts running at virtual address 0x3FC00000 (i.e. it becomes the
start_pc of VCPU).This is the region marked as "ROM" region under
guest in DTS file. The physical address mapped to this virtual
address can be a pointing to a partition in NOR/Boot Flash or it
can be a regular memory. A bootloader typically U-Boot is supposed
to be present at this address. The important thing go note here is
that, since this u-boot will be runnign in guest mode, it can't be
linked at regular boot address i.e. 0xBFC00000. Rather it must be
linked at 0x3FC00000. Since this is the new virtual address where
the VCPU will start executing. This U-boot will also be suitably
modified so that it expects the RAM's physical address in one of
the parameter registers (a0, a1, a2, a4). Also, this modified
u-boot shouldn't try to initialize DRAM. It can though initialize
a uart and other stuff. This U-Boot will then load the guest
operating system.

The point of making USEG0 appear as regular KSEG regions, when
VCPU's EXL bit is set, is that we should be able to boot VCPU's
like a regular MIPS CPUs and do major dirty work in the boot code.
Also, needless to say, this make the picture more comprehensible.

Saturday, May 28, 2011

Atomthreads on MIPS architecture

Atomthread's port to MIPS architecture is complete. A few of the test cases are working fine. Some of them need to be fixed. Working on them. The source can be pulled from here. Its upto Kelvin now when he wants to pull in the changes. Since this is new port to a 32-bit architecture, there are some changes in the core kernel code itself. Hopefully, MIPS will get merged soon.

Once I am done with validating and fixing all the test cases, I can go back to the reason I started this port -- my hypervisor. I have not been working on it for one and a half week now. Last one and a half week, I was busy with porting atomthreads to MIPS.

I started this port because for testing the hypervisor, I wanted a small OS that doesn't play with MMU much. My first target is to test the instruction emulation framework for which Linux would be too heavy.

Hope I will get back to hypervisor soon.

Tuesday, April 26, 2011

MIPS: non-virtualizable architecture (Part 1)

MIPS is a wonderful architecture. But when it was design virtualization wasn't much in the air. As a result, this architecture isn't compleletly virtualizable. It isn't as notorious as x86.

One requirement for virtualization is the access priviledges to syste registers. MIPS does provide this. All system registers are in CP0. Any access from user space will falt. Modification access to TLBs will also fault. Then what is the problem?

The problem is with the way virtual address space is laid out. For 32 bit architecture 4 GB virtual address space is available. MIPS specification reserve upper 2 GB for kernel/supervisor mode. Any access to 3rd GB in kernel mode will go untranslated by MMU. The last 4th GB will go via MMU but is only usable in kernel mode.

For running a guest, only first 2 GB is available. Both guest and its userspace programs can only have 2GB of virtual address space.
Linux assumes many things about address space layout. It is linked in kseg0. Kmap addresses are present in Kseg2. For these and other reasons, an unmodified Linux guest cannot run on MIPS architecture. There will be modifications required. But don't propose the modifications up to the level of "hypercalls".

Wednesday, February 23, 2011

Making of Hyperthreads

I have been working on a fun project where I am trying to virtualize a MIPS machine. For a QEMU emulated NE2000 based network device and for couple of system daemons, I needed a light threaded framework. While the earlier design was that anything in hypervisor context was run on CPU and thus was completely serialized. So we decided on doing something differently here.

We created a virtual CPU and named it hypercore. This virtual CPU is supposed to run anything in context of hypervisor. This virtual CPU, whenever scheduled, always works in kernel mode.

The idea was simple, anything of _maintenance_ sort was scheduled later on that hypercore. I then went on implementing the threading mechanism. Its very similar to what is know as kernel threads in Linux(R) world. There is one function which the thread is supposed to execute. Threads have a lifetime: they are created, run, paused, run again and then finally destroyed.

Initial implementation was non-preemtible. This means that threads had to behave themselves and _yield_ the VCPU if they didn't had anything much to do. In the initial implementation, when a thread wanted to yield, it used to call schedule. The thread or hypercore scheduler used to then find the next runnable thread, and call switch context. As in any context switching code, this used to halt the current thread and the new thread was loaded. Next time again when the older thread is loaded, it used to return back to the same "schedule" function and continue its task. Initially I had a little hard time implementing the schedule code but then it was easy when I understood the fact:

"Current running thread enters the schedule but when schedule returns it returns in next thread context and this continues."

One other thing that helps in context switching is independent stack for each thread. We are having 4KiB allocation of memory for each thread. Top of it works as the stack and at the base lies the thread information. Similar to what we have in Linux(R). This makes information very easily accessible. Just mask of low 3 nibbles and you have the current thread's information. Also, when context swicthing happens much of the information about where was schedule called from and things like that are saved by the compiler on the threads stack. After switching out, this information is intact. When threads switches back in, compiler again finds the same important pieces of information at the same place!

But very soon after this non-preemtible implementation it became quite important to have pre-emtible scheduling. For non-preemtible scheduling, we gave time slice to each thread and on every interrupt, currently running threads ticks were increased. Once they reached max, it was scheduled out and new thread was pulled in. This is how it happens:

If thread A is running and interrupt happens, the A's complete context is saved on a temporary stack. Its same stack everything since we don't support nested interrupts right now. The tick of current is incremented. If it has reached max count, the next thread to be run is figured out from scheduler's queue. The A's context is saved in A's thread info and B's (new thread) context is loaded in same interrupt stack. One the function that does this all returns, the interrupt handler finds, EPC and RA and every other register updated with B's context. When eret is executed, EPC of B is loaded in PC and system actually starts executing B and not A.

Scheduling of VCPU, TLB miss handling and other core virtualizing part are still handled outside this VCPU. This gives some benefits. For non-core stuff, all threads have to share time slice given to hypercore. We _don't_ penalize the guest. No matter how many threads are there on the system, the guest gets its guaranteed time.

Sunday, January 24, 2010

Null Trace domain registered today

Today I have registered the NullTrace domain. Since I did it with google's help, google app came along. Every thing was a breeze. So you can access this blog at www[dot]nulltrace[dot]org. I *think* I will write some random rants, kernel topics and the posts that won't fit in twitter ;-)