Cambridge-Cranfield High Performance Computing Facility: Historical Information

History of High Performance Computing in Cambridge

The CCHPCF, formerly the HPCF, began in the early 90's by the efforts of a consortium of scientists from different departments who found themselves unable to obtain sufficient computing resources either locally or nationally. The consortium was successful in obtaining financial support from the University, the Newton Trust, Hitachi and grant applications including bids to JREI.

This page gives details of past hpcf machines.

There are also some old benchmarks.

Turing Hitachi S3600

Turing was installed 1996 and decommissioned at the end of February 1999.

Turing was a single-processor Hitachi S3600 vector supercomputer.
Turing had both a scalar and vector processor. The vector unit could sustain speeds of over 1GFLOP, whereas the scalar unit struggled to get above 10MFLOPS. Code which couldn't be vectorised would thus run faster on a standard departmental workstation. Some simple examples of the effect of vectorisation and vector length are given below:

                 Some Fortran Timings (MFlops)

    Length     +     *     /  SQRT EXP  LOG   SIN ATAN
         1     3     3     2    0    0    0     0    0
        10    34    33    25    5    5    3     6    4
       100   291   287   177   43   16   31    65   36
     1,000   715   715   371   63   18   46   111   54
    10,000   827   826   379   63   18   46   111   54
   100,000   850   849   386   63   18   46   111   54
 1,000,000   857   870   389   63   18   46   111   54

Memory access speeds also depend very much on the type of access. Random (scalar) operations can access only about 2,500,000 values a second, whereas vector ones can access 2,000,000,000.

More details are provided. There is also a programming guide and a guide to tuning.

Babbage (Hitachi SR2201)

Babbage was installed in 1997 and decommissioned in June 2001.

Babbage was Hitachi SR2201 parallel supercomputer, with 256 computation nodes and 16 I/O nodes (mainly for system processes and interactive use.) Each processor was based on HP's PA-RISC technology but with important enhancements which enabled it to deliver performance levels more normally associated with vector processors on suitably vectorisable code.

CPU:       256 x HARP-1E, 150 MHz, 2 floating point pipelines
                                  1 load/store pipeline
GFLOPS:    256 x 0.3 = 76.8
Registers: 128 (as 4 slide-windowed sets)
Caches:    16 KB/16 KB instruction/data primary cache
           512 KB/512 KB secondary cache
           (both caches are direct mapped and write through)
Memory:    256 MB per node, 56 GB total
Topology:  3-D crossbar, 300 MB/sec
I/O:       Additional 16 dedicated I/O nodes
Disk:      350 GB local RAID disk, and 30 GB of other local disks

The architecture of babbage was thus similar to that of other distributed memory parallel computers such as the Cray T3D and T3E, but unlike shared memory computers such as the Silicon Graphics Origin 2000 series. Distributed memory architectures tend to be faster and more scalable, provided that the code run on them parallelises well.

The innovative 'pseudovectorisation' extentions to the PA-RISC architecture enable the preloading and poststoring of data from register to memory bypassing the caches. One 8 byte word can be read from or written to memory each clock cycle in this way, without interrupting the arithmetic units. A simple complex*16 inner product can thus sustain over 220 MFLOPS on a single processor for any large array size, even those far beyond the secondary cache size. The BLAS function ZGEMM can exceed 90% of peak on realistic problem sizes. In these respects, it is more similar to a vector machine like the Cray YMP than a RISC-based one like the T3E.

Hitachi also added block TLBs (translation lookaside buffers) to the PA-RISC design, giving 6 BTLBs. The usual TLB has 256 entries, each covering a 4Kb page, thus giving a total coverage of 1Mb. A single BLTB entry, in contrast, covers up to 32Mb. This sort of coverage is very important in many scientific codes which use arrays of much greater than 1Mb in size.

Babbage ran a MPP version of HI-UX, a 32-bit version of UNIX, and appeared to the user to be very similar to a single UNIX workstation.

Babbage was broken down by software into separate partitions:

   No. of partitions  Nodes  Time limit  Memory  GFLOPS
            3           64     8 hours   14.4 GB   19.2
            1           32     8 hours    7.2 GB    9.6
            1           16     8 hours    3.6 GB    4.8
            1            8     8 hours    1.8 GB    2.4
            1            4     1 hour     0.9 GB    1.2
            1            4    10 min      0.9 GB    1.2

Jobs were submitted from two front end workstations, Hooke and Lovelace

Hodgkin (SGI Origin 2000)

Hodgkin was installed in 1999 and decommissioned in October 2003.

Hodgkin is an SGI Origin 2000 SMP supercomputer, with 64 processors and an Onyx 2 graphics module. The image shows the graphics module on the left and three of the four CPU racks. Hartree is just visible behind hodgkin.

CPU:       64 x MIPS R12000 300 MHz
Caches:    32 KB 2-way set-associative data cache
           32 KB 2-way set-associative instruction cache
           8 MB secondary cache
Memory:    4 GB per node, 128 GB total
Topology:  2 CPUs per node, 2 nodes per router; routers form a 4D hypercube
Disk:      1 TB local disk
Graphics:  2 "Infinite Reality" Graphics pipelines

The Origin 2000 is very well-known. It is the most scalable of the cache coherent shared memory machines, and provides a nearly `flat' memory access model to all of its main memory. The time taken to load a datum from main memory into a processor varies from about 1 microsecond for nearby memory to about 10 microseconds for the furthest memory on a 512-processor system. On hodgkin, the worst case memory access time is about 6 microseconds.

Hodgin runs Irix 6, a 64-bit version of UNIX, and appears to the user to be very similar to a single UNIX workstation.

Hodgkin is not broken down into separate partitions, but the scheduler attempts to balance different sizes of job; the following sizes are supported:

           CPUs   Memory  GFLOPS
            32     64 GB   19.2    (for testing and development only)
            16     32 GB    9.6
             8     16 GB    4.8
             4      8 GB    2.4
             2      4 GB    1.2
             1      2 GB    0.6

There are three types of queue: production queues (with a 24 hour limit), development ones (with a 2 hour limit) and testing ones (with a 10 minute limit.) All combinations of type and size exist, and all queues run continuously, except that the 32 CPU production queue is available only by special arrangement.

Biography of Dorothy Hodgkin

Hartree (IBM SP)

Hartree was installed in 2001 and decommissioned in January 2005.

Hartree was an IBM SP parallel supercomputer; from 2001 until June 2003 it comprised 10 computation nodes and 2 I/O nodes (mainly for system processes and interactive use.) Each computation node contained 16 Power3-II processors running at 375MHz, and 12GB of memory. The nodes were connected via IBM's high performance SP Switch2. The above picture shows the three racks which contain hartree's nodes and the slightly shorter rack containing its RAID array.

CPU:       10 x 16 x Power3-II, 375 MHz, 4 floating point pipelines
GFLOPS:    160 x 1.5 = 240
Caches:    128 KB 128 way set associative instruction cache
           128 KB 128 way set associative data cache
           8MB secondary cache
Memory:    12 GB per node, 120 GB total
Topology:  Crossbar (effectively), 500MB/s
I/O:       Additional 2 dedicated I/O nodes with 4 CPUs each
Disk:      1 TB local RAID disk.

From June 2003 until its decommission in 2005 Hartree comprised 8 computation nodes and 2 I/O nodes. Each computation node contained 8 Power4 processors running at 1.1 GHz, and 16 Gb of memory. The I/O nodes each contained 4 375 MHz Power3 processors and 8 Gb of memory. The nodes were connected via IBM's high performance SP Switch2.

CPU:       8 x 8 x Power4, 1.1 GHz, 4 floating point pipelines
GFLOPS:    64 x 4.4 = 282 
Caches:    128 Kb 128 way set associative instruction cache
           128 Kb 128 way set associative data cache
           8 Mb secondary cache
Memory:    16 Gb per node, 128 Gb total
Topology:  Crossbar (effectively), 500 Mb/s
I/O:       Additional 2 dedicated I/O nodes with 4 CPUs each
Disk:      1 Tb local RAID disk.

The IBM SP thus provided individual nodes of considerable compute power themselves, connected via a high performance switch. The CPUs were cache-based RISC processors and thus optimisation issues were similar for the SP and a normal desktop workstation. The processors were capable of both explicit prefetching and of automatic streaming in order to sustain the high memory bandwidths that many large scientific simulations require.

Hartree ran AIX, IBM's UNIX.

In its second configuration Hartree's queues allowed jobs of up to 12 hours and 32 processors (140 Gflops and 64 Gb). In its initial configuration Hartree's queues allowed jobs of up to 12 hours and 64 processors (96 GFLOPS and 40GB). Larger jobs (up to 128 processors and 80GB) could be run when required.

Biography of Douglas Hartree

Other hardware

The CCHPCF also had a four processor Origin 2000 fileserver with 1 TB of disk (mott,shown left). Mott was retired at the end of 2003