Processor binding on an Opteron
The AMD opteron is an inherently NUMA design where each cpu has a built
in memory controller and its own set of local memory. As with almost
all NUMA designs if a process moves to a CPU that is not local to the
memory there is a performance degradation. The Linux scheduler supports
natural CPU affinity: the scheduler attempts to keep processes on
the same CPU as long as practical. For most applications this may be
good enough but for HPC we can see significant performance variation
between runs of he same program depending on if and when a process get
swapped off of a CPU.
It is possible to set the CPU affinity so that a process will on the
cpus we specify using the command taskset.
For serial programs it is fairly trivial to bind the program to a
cpu when we start it with
taskset <mask>
<program>
where mask is a bitmask for the cpu so that 1 selects cpu0, 2 selects
cpu1, 4 selects cpu2, 8 selects cpu3, 16 selects cpu4 etc. Other values
could of course be used to select groups of cpus but this would
still allow process migration within the group and we would still see
the performance degredation.
For mpi programs this process is a little more complicated since we
want to bind each process to a different CPU. On our cluster Maxwell I
experimented with a number of ways of doing this using a wrapper to MPI
but they all had problems. The solution I settled on in the end was to
make use of the fact that ssh is used by mpi to start the processes. I
created an sshrc file (see sshd(8) manual page for more info)
that is executed by ssh when the user logs in just before the users
command is run. In this script I change the processor binding of the
parent process of the script i.e. the sshd and alternate which CPU each
new process is bound to. The command that the ssh spawns then inherits
this processor binding.