Introduction to jails
For the newcomers more familiar with VMWare and similar products, a FreeBSD jail is an operating system level partition of the userland. This means that the kernel, with all its kernel functions and resources is shared among the instantiated virtual machines, but that the individual systems cannot directly influence one another's processes.
There are several consequences of this method. The first one is that, obviously, it is operating system-specific. The kernel is shared so the individual "guest" virtual machines cannot choose to run some other kernel. A specific edge case for FreeBSD is that jails can run any userland the kernel supports; in particular this means that, for example, an older version of FreeBSD userland (like FreeBSD 4 - not that it will matter since the kernel is 8.x) can be used, or a Linux userland. The second major consequence is that all resources really are shared. Most importantly, the CPUs and memory are shared and this can be either a good thing or a bad thing, depending on specific usage scenarios. Jails can be restricted to specific CPUs if needed with the granularity of a logical CPU, but there is currently no such limits available for memory sharing (though some are in development). As a consequence, disk caches are shared among the guests, which can be very nicely exploited by using nullfs to mount across jails (keeping only one physical copy of libraries and other binaries). The network stack is currently also shared, though there is work to introduce more virtualization in it. As a special feature ready right now, up to 16 separate forward information bases (FIBs) can exist.
I'd like to emphasize once again how cheap (with regards to resource comsumption) this sort of virtualization is - in practice, it is not much different than starting N times the set of basic processes, of which each will behave like it normally does.
The 1000 jails challenge
FreeBSD has jails integrated with the overall system and that extends to their startup and shutdown. By default, all common jail configuration can be done in exactly the same way as the rest of the system is configured - in /etc/rc.conf. Some additional files may be needed per-jail, like a jail-specific fstab.
To automate the creation of all that configuration I wrote a simple script, mkjails.py which generates the configuration files for a set of 1000 identical light jails. Each individual jail will have these properties:
- It will null-mount the relevant binary directories from the host (like /bin, /usr/bin, /lib, /usr/lib, etc.) but will have its own /etc, /var and /usr/local
- It will have its own single IP address
- It will start with a set of default FreeBSD processes like cron, syslog and sshd
Each jail will have its own configuration and after it is created is practically ready to be handled to an independant administrator who will be root within the jail. This administrator will be able to do everything to the system except upgrade its kernel and base userland (e.g. the admin will be able to install apache and have complete control over it, but will not be able to upgrade /bin/ls).
A single section of rc.conf.jails (geneated by mkjails.py for a single jail) looks something like this:
The properties described above mean the host system ended up with 1001 IP addresses assigned to a single NIC and with more than 14,000 individual mount points for the nullfs mounts.
One particular technology not shown here is the integration between jails and ZFS, which enables jail roots to administrate ZFS properties, including creation of file systems.
Without much more talk, here is what starting 1000 jails looks like:
OGG THEORA VIDEO
(for the impatient, feel free to skip to 29:45 for a bit less boring view)
The machine which did all that is equipped with 2x quad-core CPUs and 4 GB of RAM. During the experiment I feared that maybe the relatively low amount of RAM will prevent all 1000 jails to be created but it appears like I could easly create twice as many without problems. In reality, memory is probably the only limiting factor here and I could have created an arbitrary number of VMs, but 1000 is a nice round number.
As can be seen in the frames with "top" running, CPU usage is almost 0. This is because the jails are not doing anything in particular once they are started. The most CPU intensive part was the ssh RSA key generation.
In retrospect, this is an awesome result. Everything from the kernel downwards was perfectly stable before and after the experiment and the jails run flawlessly. This experiment was done without special kernel tuning - an out-of-the-box GENERIC kernel was used without any tunables and sysctls set.
- Iit gets really interesting when every minute 1000 crons wake up to do their work :) Load averages spike to > 100. Obviously, crons would need to be turned off where uneeded.
- To drive more than about 1000 jails, kern.maxproc will need to be increased (and probably kern.maxprocperuid).
So there it is - cheap, easy, low-weight virtualization that can be quickly set up and destroyed.