The Cassandra database typically runs on large clusters of computer systems as it is designed to hold massive amounts of data. Now, a lecturer from the Dundee University is running it on the tiny, credit card sized, single board computer – the ARM based Raspberry Pi or RBPi.
At the Cassandra Summit, 2013, Andy Cobley, a lecturer at the University of Dundee, Scotland, presented his process of running Cassandra on multiple RBPis, which work as multiple Ethernet connected computers for his students. The advantage – no server racks and no data-centers required.
With 512MB of memory and a 700MHz ARM processor booting off an SD card, the Linux running RBPi does not look like a suitable candidate for usefully running Cassandra – the big data oriented Java-based database. Facebook originally contributed Cassandra as an Apache project. Organizations such as CERN, Twitter, eBay and Netflix use it to process huge amounts of data. For this, they use powerful servers in multiple data centers. Cassandra stores data and spreads the load over several clusters of connected disks and RAM loaded servers and connecting these clusters over highly constrained links results in an internationally reliable and resilient database.
Andy Cobley wanted to make it possible to run Cassandra on multiple RBPis, so that his students could experience running a database on multiple computers connected via the Ethernet, without having to build data centers and server racks. For this, Andy had to accept some compromises.
Cassandra is designed so that it can write data to disk at high speeds. Typically, in the time a laptop completes 12,000 write operations, a single RBPi can manage only 200 writes to its SD card. Making it write to an external USB drive only slows it down further. Moreover, the Ethernet port of the RBPi shares the same bus as its external USB port and the SD card. Cassandra, being very network centric, sees drastic reductions in network performance when there is any improvement in disk performance. Therefore, the route data takes through a system affects its performance.
With four to eight RBPis powered from USB hubs and all attached to an Ethernet switch, Andy was able to run Cassandra. Each of the RBPis was running the Debian Linux variant Raspian. Although he was unable to run the current Oracle JDK with the above setup, he ran Cassandra over OpenJDK. Running Cassandra in this manner, although complicated, resulted in some bugs being fixed for Cassandra. For example, Andy had to make the startup script resilient to accepting no CPU cores in the system.
Cassandra uses compression for boosting performance. However, it was not possible for Andy to use the native default method – Google’s Snappy compressor. Instead, he had to settle for the Java-based Deflate compressor, which is slower and has a penalty in write performance. Further performance boost for Cassandra came from ensuring that the RBPi CPU has more memory as compared to its GPU.
Andy has scaled down the Cassandra platform for his students, without actually rewriting it, making it easier for them to examine how a combination of Linux and Java runs on an RBPi cluster.