Processor binding on an Opteron

The AMD opteron is an inherently NUMA design where each cpu has a built in memory controller and its own set of local memory. As with almost all NUMA designs if a process moves to a CPU that is not local to the memory there is a performance degradation. The Linux scheduler supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical. For most applications this may be good enough but for HPC we can see significant performance variation between runs of he same program depending on if and when a process get swapped off of a CPU.

It is possible to set the CPU affinity so that a process will on the cpus we specify using the command taskset.

For serial programs it is fairly trivial to bind the program to a cpu when we start it with

taskset <mask> <program>

where mask is a bitmask for the cpu so that 1 selects cpu0, 2 selects cpu1, 4 selects cpu2, 8 selects cpu3, 16 selects cpu4 etc. Other values could of course be used to select groups of cpus but this would still allow process migration within the group and we would still see the performance degredation.

For mpi programs this process is a little more complicated since we want to bind each process to a different CPU. On our cluster Maxwell I experimented with a number of ways of doing this using a wrapper to MPI but they all had problems. The solution I settled on in the end was to make use of the fact that ssh is used by mpi to start the processes. I created an sshrc file (see sshd(8) manual page for more info) that is executed by ssh when the user logs in just before the users command is run. In this script I change the processor binding of the parent process of the script i.e. the sshd and alternate which CPU each new process is bound to. The command that the ssh spawns then inherits this processor binding.