History of High Performance Computing in Cambridge

Cambridge-Cranfield HPCF  > Information for Users  > Programming and Compilation  > Hartree

The nodes visible from outside the HPCF are hartree.hpcf.cam.ac.uk and hartree-2.hpcf.cam.ac.uk. Both support ssh. These are hartree_a and hartree_e respectively and are I/O nodes. The compute nodes are hartree_b to hartree_j, skipping e. Compilation and job submission can be done interactively on either I/O node.

Compilation

As code is compiled on hartree_a, with power3 processors, but run on the compute nodes with power4 processors, some care is need.
The default options tune for power4 but allow the code to run on power3 for debugging.

Compilers

The compilers mpxlf, mpxlf90 and mpxlc deal with MPI and the obvious languages. Note there is no -lmpi: this is done automatically.

The current MPI release only supports 64 bit programs with the thread safe version, eg mpxlf_r for fortran.

mpxlf90 will not compile F90 programs whose names end `.f90'. for these use `mpxlf90 -qsuffix=f=f90'

Libraries

For BLAS etc, add -lessl

For LAPACK add -llapack (This is merely a version I have compiled. I did use the Call Conversion Interface which means that the parts of Lapack in the ESSL library are used in preference to the standard Fortran ones and of course it is linked with the ESSL so you have to have to link with that as well. )

You might want to try using the mass library as an alternative to the standard maths library. Using the maths functions it contains can give a 2 fold increase over the standard maths library. If you use the vectorized version which requires you to alter your code it can yeild upto a 6 fold increase.
-lmass
-lmassp3v vector version but needs code changes

"In some cases MASS is not as accurate as the system library"
"Compared to the standard mathematical library, libm.a, the MASS library can only differ in the last bit"
Further details of performance and accuracy look here .

Compiler Options

The default compiler options on hartree are -O3 -qtune=pwr4.

To get any reasonable optimisation, specify at least -O3
To improve performance -O4 or -O5 can be used. They set -qessl -qhot -qipa -qarch=auto -qtune=auto -qcache=auto. These last options set the tuning for the machine you are compiling on so you need to set -qtune=pwr4 -qarch=pwr4 if you use -O4 or higher. Other options that might help include -qlargepage.

Use -qstrict if you are worried about non-bitwise identical results.
Consider using -qipa and -Q to perform interprocedural optimisations and more procedure inlining, (The former can dramatically increase compile time).
Consider using -qfloat=hsflt for faster single precission expressions. hsflt is the highest performing option but that it is unsafe since exponent overflow can go undected. The highest performing safe option is -qfloat=hssngl .
However, there is a potentially important compiler optimization (reciprocal multiply) that is enabled only if -qfloat=hsflt is specified.

Finally if it is worth the effort compile using in addition
-qpdf1
run the program with a variety of typical data sets. Then recompile with
-qpdf2 instead of -qpdf1
This process should optimise your code based on how it runs.

Memory usage and Linking

By default the compiler works in 32bit mode. To use more than 256Mb of memory you must link with -bmaxdata:0xY0000000 where Y is the number of 256Mb segments you want. (8 to use the maximum avaible).

The same is true of -bmaxstack:0xY0000000 but the default is 512mb.

Further Information

On line manuals: http://www.hpcf.cam.ac.uk/manuals/ibm/ .

The man command!

Email support: hpcf-support@ucs.cam.ac.uk