Thursday, March 17, 2022

Modern Clustering

 Back in the late 2000s, I discovered OpenMOSIX, a Linux software suite and kernel patch for making clusters.  In case you aren't already aware, a cluster is a group of computers networked together in a way that allows heavy workload tasks to be shared across the systems in the cluster.  OpenMOSIX was fairly unique, being a Single System Image (SSI) cluster.  The kernel patch provided OS level support for load sharing, and it produced a composite system that appeared to be a single computer to user space programs running on an OpenMOSIX cluster.  The patch allowed the kernel to automatically migrate processes between systems, for automatic load balancing, and the user space software allowed the user to set priorities and manually migrate processes.  This made OpenMOSIX extremely powerful.  A high performance computing application could merely be coded to run across multiple processes, and OpenMOSIX would handle the rest.

In contrast, other clustering software requires programs to use special libraries to explicitly farm tasks out to other nodes, to trigger migrations, and so forth, and they don't provide software that allows users to easily control migration.  And some earlier clustering software even requires all of the systems in the cluster to have nearly identical hardware.  OpenMOSIX made clustering trivially easy, but in 2007, the project was officially ended.  Of course, it hadn't seen work for several years, so in reality, it ended around 2004, the year of the last release.  Only three years did one of the founders announce that OpenMOSIX was officially done.  He justified this with the assertion that multi-core CPUs made clusters obsolete.  This severely annoyed me.  I respect open source developers, and I understand that they won't always have time to continue work on complex projects, but I lose most of that respect when they lie about things like this.  The real reason OpenMOSIX was terminated was that the developers did not have the time or motivation to port it to Linux 2.6.  To their credit, they tried.  There's a beta for one of the first 2.6 kernels.  If that's their reason for giving up, I'm fine with that.  Yeah, I'm sad and disappointed, but it's their time and their lives, and I can't expect them to keep working on it for free, merely because they used to do so.  But, clusters were not made obsolete by multi-core CPUs.  In fact, they were made far more valuable, because now we have processors with far stronger computational resources, allowing for cheaper clusters that are also immensely more powerful.  Before multi-core CPUs became common, clusters were often severely limited by network bandwidth.  Beyond a certain number of systems, network congestion would cause lag in the cluster, resulting in rapid diminishing returns.  Increasing network speeds helped to mitigate this, but it was still a serious issue.  The solution was custom system with a huge number of separate CPUs, but even then bus congestion could become an issue quite quickly.  In addition, the master node could be a bottleneck, when node traffic piled up and many nodes were stuck waiting for the master node to become free to service their requests.  Multi-core processors solved many of these issues.  Each network connection could be serving 2, 4, 6, or more cores, reducing congestion and collisions dramatically.  A master node with several cores can service nodes faster by spreading the load between cores.  And of course, more computing power is available from significantly fewer systems, reducing the cost of the cluster and even reducing the need for expensive custom systems.  OpenMOSIX on a cluster of multi-core systems could have been incredibly powerful, and the need for high performance computing has only increased.  Multi-core systems made clustering more valuable, not obsolete.

Without OpenMOSIX, modern clustering is a mess once again.  OpenMPI and OpenMP provide clustering over SSH, but programs have to be custom made for these, and they have to manage significant parts of the load balancing themselves.  Processes can't be migrated.  Tasks can be terminated and restarted somewhere else, but there is no migration.  Other SSI clustering software has popped up, but it's all dead as well.  A promising replacement for OpenMOSIX, Kerrighed, did make it to the 2.6 kernel, but it died in 2010, with the last release being in June of that year.  MOSIX, the proprietary clustering software OpenMOSIX was originally forked from, does still exist and has seen continued development into 2017, with the latest release being in October of that year.  The latest version doesn't even require a kernel patch, but it uses some virtualization which increases overhead.  In addition, it isn't open source, which creates significant security concerns.  The LinuxPMI project picked up OpenMOSIX development when Moshe Bar announced it was being abandoned, but while it appears like it has made significant progress, there are still no pre-packaged kernels or distros using it, and if you want to use it, you have to patch and compile the kernel yourself, which is no trivial task.  It really seems like the cancellation of the OpenMOSIX project ultimately killed SSI clusters, the most useful and easiest to use kind of cluster.

What now?  Is clustering just a dead technology that is mainly only used by nostalgic nerds?  Not quite.  It's still possible to do informal clustering, by writing programs that handle the elements of clustering themselves.  This is only better than OpenMPI/OpenMP because you have more control, and you don't have to setup all of the OSs for clustering.  Distributed computing has remained alive and well, even as clustering has largely collapsed, but unlike OpenMOSIX and SSI clustering, the developer of the software still has to know how to do it.

The most popular use of clustering is probably distributed rendering.  Back in the day, Linux distributions like Dyne:Bolic used a 2.4 Linux kernel with OpenMOSIX, to make distributed rendering with Blender trivial.  Blender already had the capability to distribute rendering across multiple processes, and an OpenMOSIX kernel would automatically migrate those processes across the cluster to balance the load.  Modern Blender has network rendering plugins that can be used to accomplish the same thing, but unlike an OpenMOSIX cluster, Blender has to be installed on all of the systems and setup to connect to the master instance.  And of course, someone had to write all of the networking code for the plugins to communicate.  This is not as good as OpenMOSIX, but it is better than nothing, and it is 100% portable.

Modern clustering doesn't require special operating system support.  SSI clusters are still incredibly valuable, and the do things that are difficult or impossible to do without centralized management, but for your average distributed workload, they aren't strictly necessary.

During my undergrad training, I wrote a distributed prime number finder for my networking course.  If this sounds like it is way beyond undergrad level work, that's because it is.  I already had 20 years of experience, so I had to find ways to keep things interesting.  I often did this by doing far more advanced projects than were required, and in addition to keeping things interesting, this also allowed me to learn at my own level, giving my degree more meaning than a mere piece of paper saying I knew what I already knew before I even started it.  Anyhow, this distributed prime number finder was a challenge.  The concept was simple.  Counting by twos starting from 3 (aka, skipping even numbers), it used trial division up to the square root of each number, storing primes in a file.  I had written the single threaded version of this around 8 years earlier, so the concept was very simple for me.  I used Python this time (I used C the first time, but with the data limits of C), which has solid built in networking libraries.  This made the network communication fairly trivial.  So I wrote two programs, using a client/server architecture, where the server is a cluster master node program, and the client is a worker node program.  The master sends out ranges of numbers to test, and the clients return lists of the primes in those ranges.  And this is fast, despite being written in Python.  (That said, if I was to do it again, and I might, I would write the testing math in C and turn that into a Python module, to accelerate the math.)  This was very successful, and I ended up with a solid prime finding system that worked in a clustered fashion without needing a formal OS level cluster.

This wasn't without some challenges though.  There was one in particular.  That challenge was ordering.  The primes were still just written to a file, but the program was no longer linear.  It might farm out different ranges to five different systems, but the results might not come back in order, and the file needs to be ordered.  The master created client threads similar to modern HTTP servers, and those threads would store the results in objects passed into them.  When a client completed its range, a new range needed to be sent out to it, so I couldn't just make a queue, wait for the head client to finish, give it a new range, and send it to the end.  I needed to be able to give finished clients new work, even if they finished before the next one in order.  I also needed to be able to handle cases where a client connection closed unexpectedly, so that ranges wouldn't get skipped in those cases.

To solve this problem, I started with the queue.  The queue contained objects that tracked their range and the output of their associated threads, as well as the client for that thread.  When a client completed its section and returned the results, the thread recorded the results and terminated.  The master went through a loop, scanning the entire queue and creating new threads and assignments for any idle client.  When this was complete, it would check the first object in the queue, to see if it was done.  If it was, the results would be written to the file, and the object removed.  Of course, this required mutexes, to avoid race conditions with reads and writes, but this was trivial in Python.  Compared to my single threaded, non-distributed version, this was incredibly fast on just a handful of machines.  In addition, this model was able to take advantage of multi-core CPUs in a highly controllable way.  If your system has multiple cores, you run one instance of the client for each core you want to use.  If you only want to use half of your cores, so that you can use the computer for something else without lag, run half as many client instances as you have cores.  This is something I did in testing, and it worked very well.

My final cluster was a quad core desktop machine, a Dell Xeon server with two CPUs (Pentium 4 class, so both single core), my dual core laptop, and a few Raspberry Pis.  This was before multi-core Pis, so I ran the server on the Pi along with one client node.  On my desktop, I ran four nodes.  On my laptop, I ran two nodes, and I ran two nodes on my dual CPU server.  On the other Pis, I also ran one node each.  This worked exactly as expected.  Overnight I ended up with several GB of prime numbers in a file on my Raspberry Pi master node, with a microSD card with very limited capacity.

What did I learn from this?

First, formal clusters aren't strictly necessary for doing high performance computing.  It's not that hard to write a client/server application for distributed computing.  And even load balancing isn't that hard.  My distributed prime finder automatically load balances, because it sends small tasks and only assigns new ones when the previous ones are complete.  This inherently gives more powerful systems more work and less powerful ones less work.

Second, client/server distributed computing models can be complex, when order matters, but while this can be a challenge, there are simple solutions that most experienced developers should be able to figure out.

Third, SSI clusters can simplify things by eliminating the networking stuff, but things like order dependence will still be a challenge whenever you are distributing work across multiple processes.

Fourth, if I was to do this again or wanted to make this into a polished product, there are a few things I would do different.  Using Python for the master node is perfect, but I might use C or a C module for Python if I was to do this again, to further improve performance.  What's the point of distributed computing, if you aren't going to optimize the program itself?  Python is an excellent language, but I would almost certainly get a big performance boost using C instead, for the heavy lifting.  (That said, for this particular application, part of the reason I used Python was for its infinitely scalable integers, which C does not have.  I suppose I could use GMP for that though, and 64 bit systems provide a much larger range, especially with GCC's support for 128 bit variables and math.)  I would also want to make this more easily tunable.  Currently, if I want to control the number of cores used, I have to run new instances or kill running instances.  It would be better if the clients were threaded and the number of threads used could be controlled from the master node, so that only one client instance runs in each machine, and the whole thing can be controlled from a node.  And maybe for use on untrusted systems, a redundancy level could be specified, such that each range is calculated on multiple different machines and then the collective results are used to verify accuracy.


The advantage SSI clusters have over this is that people with limited programming experience and knowledge can use them to distribute tasks of existing programs that are already multi-processed, while this informal clustering requires specially designed applications.  This strategy is certainly better (in my opinion) than OpenMPI, but it still requires copying the client program to all of the worker nodes and running them, and it requires the application to be specially designed for this.

But, there might still be a solution.  Python can parse and execute code at runtime, which means it should be possible to write a simple library that has a client portion that can just be run as-is with an IP address and port input, and a server portion that has ordered, unordered, and perhaps other modes that take an "assignment" function that gives out tasks and a string that is sent to the clients as the compute program.  The downside is that you wouldn't want to be a client to an untrusted master node, because the program sent might be malicious, and Python sandboxing is notoriously difficult.

Maybe the value of SSI clusters is diminished, but if so it's not because multi-core CPUs reduces its value.  It's because distributed computing isn't that hard even without SSI clusters.  That said, SSI clusters are still incredibly valuable, and their value is only increased with multi-core CPUs and other improvements to hardware.  There will always be problems that take very large amounts of time to compute, where SSI clusters are significantly easier to use than networked distributed applications.