Today’s data centers and Hybrid Compute Clusters or HPC work with thousands of computers. With every computer having its own operating system, even a team of highly skilled administrators would find it hard to keep up with the demands required for maintaining each system. Typically, admins strain to keep the myriad types of computers they have, all functioning, as they should. Sysadmins must duplicate their efforts over and over, as they install new systems with new system software and handle configuration problems individually as they arise on each of the various systems. However, things need not always be so complicated and difficult.
There can be an alternative scenario, where all that the admins must do is reboot a machine for it to enter into a pre-configured operational environment. Such operational environments or images, as they are called, can exist in multiple numbers with each image acting as an individual container for the system software, configuration and behavior of the group of nodes the image was designed to run on.
For example, a specific image managing the operational requirements of a large multi-user cluster, would contain the necessary software, its configuration, including the behavior of compute nodes, admin nodes, login nodes, IO nodes and anything else needed. A second image might be based on the latest Linux distribution that is currently under test for a future deployment. Images could be configured to handle web servers, database programs, application servers, user desktops or render farms.
With one root image controlling the behavior of all machines, the complexity of the overall system and the overhead of system administration are scaled down drastically. It also leads to a stable environment as administrators can focus on hardening only one system instead of spreading their attention thin across the various setups.
In such a cluster, individual computers are typically devoid of hard disks, although with-disk computers are also supported. Diskless computers can be any subset of nodes and may be booted into any image as required. When an image is changed, all the nodes see it simultaneously. Only a reboot is what it takes to interchange a system image. Moreover, an image may be cloned with only a simple copy. Any number of functionally different machines may use the same image, with a simple synchronization propagating a working modification made to one system to all other systems. Since the image remains the same no matter which machine is using it, the behavior of each configured node remains the same.
Local networks may have many nodes with the image being cloned for each of them, with each clone being capable of serving the image to as many diskless clients as the network or the machine is capable of handling. The nodes operate normally using the configuration designed, which determine their role at boot time. The functional role of any node can be changed on the fly, once the node has booted.
An open-source software package, oneSIS, offers such a method for building and maintaining compute systems of any size. The lightweight, easy to configure and flexible package reduces the cost of cluster administration drastically.