Threads, processes, cores, sockets

I wrote this response to a user asking about threads/cores/sockets and how it relates to Slurm. Perhaps useful for documentation once it's been cleaned up and edited for clarity and accuracy?

The essay

This is a source of constant confusion, since Slurm is neither consistent with standard terminology nor with itself sometimes! I'll try to clear this up, but please let me know if you have further questions.

It helps to have some background context of Slurm's history and that of the HPC landscape in general. When Slurm first released in 2002 out of Lawrence Livermore National Lab, HPC was primarily used by fields with massively parallel simulation work, like geophysics and computational chemistry. These fields largely required tightly-coupled parallelism and multi-node communication, and the overwhelming majority of the codes used communication libraries conforming to the Message Passing Interface (MPI) to do so. MPI follows a task-based parallelization paradigm - when you run an MPI program over n tasks, you're launching n independent copies of the program that are collected into the communication "world" and assigned a number ("rank," in MPI terminology) to govern which task communicates with which other tasks.

So, Slurm was developed with MPI workloads in mind. The main problem to solve at the time was how to launch these MPI tasks on separate servers and facilitate the communication between them. To MPI, it matters very little whether task A or task B are on the same node or not, it just needs to know where to look at what protocol to use to send the data. Since this was the prevailing paradigm during Slurm's development, tasks are Slurm's main unit of work. The flag "-n" refers to tasks, which default to one CPU core each (more on that in a minute). In an MPI program, you use "-n" to specify the total number of tasks you want to start up, then in the batch script invoke n copies of your MPI program with "srun" which Slurm knows how to start up on the appropriate node using the appropriate resources.

MPI is fantastic, but it has a critical weakness (though weakness is debatable, it's simply a feature of the parallelization method). Each MPI task by default manages its own memory without the ability to access the memory of other tasks within the same communication world. So, if task A needs to tell task B something, task A needs to explicitly send it to task B. Task B can't just look at task A's memory and retrieve it itself. (Note: this is now an oversimplification - MPI-3 standard has tools for true shared memory models. But distributed memory is still the default).

Somewhat concurrently to MPI development, OpenMP also emerged in the 90s as a method for standardizing parallelization. (Note: this is sometimes confused with OpenMPI, an MPI implementation, but they're distinct things). Contrary to MPI, OpenMP operates by default with shared memory. Each OpenMP thread has access to the same memory and can read or write to it at will. This has some major advantages - there's no need for threads to send or wait to receive information from other threads and there's a lot less data duplication overall, generally resulting in a lower memory footprint. However, this doesn't work when multiple servers need to be marshalled to solve a problem.

Computational scientists developed methods like hybrid parallelization programming to leverage the best of both worlds. In this method, MPI serves as the top layer of parallelization method to ensure the workload can scale between servers. However, each task uses OpenMP within itself to leverage multiple cores for local computation. So, if you have a node of 16 cores, you could now use 4 MPI tasks which each used 4 OpenMP threads.

Slurm incorporated this paradigm by separating the concept of the task from cores. Annoyingly, the flag used to set cores per task is actually --cpus-per-task=X, which causes confusion since it's not using the standard definition of a CPU.

So, what is standard terminology then? Here's what I think is largely agreed upon (from smallest to largest unit):

Thread: this is more of a software concept than hardware. A thread is essentially just executing a sequence of instructions. A multithreaded program can, in theory, execute in parallel. However, if your hardware does not have the capability to actually run these threads in parallel, the OS will manage which thread actually gets to execute when, and a thread may not execute start to finish on one hardware core and will move around as the OS determines. Switching between threads has overhead, so if you have too many threads and they're all pretty active, your OS will have to swap between them a bunch and will degrade performance overall. This is confused a bit by Intel's concept of hyperthreading, where they claim to be able to run multiple threads on one core, which is turned off across the board on Unity since it doesn't generally work well for HPC.

Process: also a software concept and, loosely, a collection of one or more threads.

Core: a core is a part of a CPU that can execute a thread independently. So, a 16 core CPU can, in theory, execute 16 simultaneous threads. Note that performance scaling is murkier - throwing 16 cores at a problem will rarely scale linearly since memory access and communication between processes/tasks/threads adds up.

CPU (hardware) socket: harware sockets, loosely, can be thought of as how many CPUs a server can seat. Sockets can matter a lot for certain types of computation. Loading data onto, seeking to the correct position, and reading data from memory all takes time, which adds up a whole lot if you're doing a lot of those operations during computation. On non-uniform memory access (NUMA) nodes, the most common type of design on Unity, the RAM on the server has locality to a certain socket. So, accessing RAM from a task on a socket that does not have locality to that RAM will be slower than if the RAM and the task are on the same socket. Sockets are more or less synonymous with CPUs nowadays.

Now that that's sorted out, lets see how this matches up with Slurm flags:

--sockets-per-node: This selects nodes with a certain number of sockets or greater. The majority of nodes in Unity have two sockets (though not all! I believe the ARM architecture nodes have one socket only, as is typical for ARM systems).

--cores-per-socket: This is essentially cores per CPU. For example, cpu050 has two sockets, each containing a 64 core AMD EYPC 7763. So, setting --cores-per-socket=64 could select cpu050 or anything with a higher corecount on the individual sockets.

--threads-per-core: setting this to anything but 1 on Unity will just result in your job not being scheduled (except for on the power9 nodes, I think). Threads per core is something that's set in the slurm config file and we have no x86_64 nodes that will support hyperthreading.

The above three settings are about node selection, but more or less don't affect the resources your job is allocated once a node is selected. For that, we have:

--nodes, -N: the number of nodes requested. This is frequently used to confine tasks to a certain number of nodes, or else slurm will distribute tasks however it can fit them in fastest. Also frequently used with --exclusive, which will allocate all resources on the selected node to the job.

--ntasks, -n: sets the number of tasks. This is usually meant for MPI tasks, but there are ways to leverage multiple tasks outside of the MPI paradigm. However, it's not typical. Tasks can be allocated on different nodes. Defaults to 1.

--cpus-per-task, -c: sets the number of cores allocated to a task. So, if you set --ntasks=3 and --cpus-per-task=4, your job will require 12 cores total. Note: this does not ever break cores across nodes within one task. So one task will always have all cores on one node. This is the most frustratingly named slurm setting, IMO.

There are about a thousand more Slurm settings for tweaking what resources you're allocated, but those are the big ones for CPU parallel programs and are sufficient for most users.

Now, over 1300 words later, I can start to answer your question about what you should do, practically. Computational biology is a relative newcomer to the HPC landscape and doesn't historically use MPI workloads. Workloads in your fields are much more likely to be arrays of jobs that are either serial or have a relatively low parallelization need (20 is relatively low nowadays!). Typically, the software used also doesn't understand what on earth Slurm means when it allocates multiple tasks, so what I see happen frequently is that a user will request n tasks for n processes in a program, but then the program ends up jamming them all on 1 core since to Slurm it's a single task. So, -c n / --cpus-per-task=n is the correct choice here, which will give your job all n cores on one task. Once that's allocated, Slurm doesn't care how your program handles parallelization internally, it'll just allow it access to all cores requested.

Please let me know what further questions you have. This is not easy to grapple with and Slurm is famously inconsistent in labeling, which makes matters worse!