Cambridge-Cranfield HPCF > History |
The CCHPCF, formerly the HPCF, began in the early 90's by the efforts of a consortium of scientists from different departments who found themselves unable to obtain sufficient computing resources either locally or nationally. The consortium was successful in obtaining financial support from the University, the Newton Trust, Hitachi and grant applications including bids to JREI.
This page gives details of past hpcf machines.
There are also some old benchmarks.
Turing was installed 1996 and decommissioned at the end of February 1999.
Turing was a single-processor Hitachi S3600 vector supercomputer.
Turing had both a scalar and vector processor. The vector unit could sustain
speeds of over 1GFLOP, whereas the scalar unit struggled to get above
10MFLOPS. Code which couldn't be vectorised would thus run faster on a standard
departmental workstation. Some simple examples of the effect of vectorisation
and vector length are given below:
Some Fortran Timings (MFlops) Length + * / SQRT EXP LOG SIN ATAN 1 3 3 2 0 0 0 0 0 10 34 33 25 5 5 3 6 4 100 291 287 177 43 16 31 65 36 1,000 715 715 371 63 18 46 111 54 10,000 827 826 379 63 18 46 111 54 100,000 850 849 386 63 18 46 111 54 1,000,000 857 870 389 63 18 46 111 54
Memory access speeds also depend very much on the type of access. Random (scalar) operations can access only about 2,500,000 values a second, whereas vector ones can access 2,000,000,000.
More details are provided. There is also a programming guide and a guide to tuning.
Babbage was installed in 1997 and decommissioned in June 2001.
Babbage was Hitachi SR2201 parallel supercomputer, with 256 computation nodes and 16 I/O nodes (mainly for system processes and interactive use.) Each processor was based on HP's PA-RISC technology but with important enhancements which enabled it to deliver performance levels more normally associated with vector processors on suitably vectorisable code.
CPU: 256 x HARP-1E, 150 MHz, 2 floating point pipelines 1 load/store pipeline GFLOPS: 256 x 0.3 = 76.8 Registers: 128 (as 4 slide-windowed sets) Caches: 16 KB/16 KB instruction/data primary cache 512 KB/512 KB secondary cache (both caches are direct mapped and write through) Memory: 256 MB per node, 56 GB total Topology: 3-D crossbar, 300 MB/sec I/O: Additional 16 dedicated I/O nodes Disk: 350 GB local RAID disk, and 30 GB of other local disks
The architecture of babbage was thus similar to that of other distributed memory parallel computers such as the Cray T3D and T3E, but unlike shared memory computers such as the Silicon Graphics Origin 2000 series. Distributed memory architectures tend to be faster and more scalable, provided that the code run on them parallelises well.
The innovative 'pseudovectorisation' extentions to the PA-RISC architecture enable the preloading and poststoring of data from register to memory bypassing the caches. One 8 byte word can be read from or written to memory each clock cycle in this way, without interrupting the arithmetic units. A simple complex*16 inner product can thus sustain over 220 MFLOPS on a single processor for any large array size, even those far beyond the secondary cache size. The BLAS function ZGEMM can exceed 90% of peak on realistic problem sizes. In these respects, it is more similar to a vector machine like the Cray YMP than a RISC-based one like the T3E.
Hitachi also added block TLBs (translation lookaside buffers) to the PA-RISC design, giving 6 BTLBs. The usual TLB has 256 entries, each covering a 4Kb page, thus giving a total coverage of 1Mb. A single BLTB entry, in contrast, covers up to 32Mb. This sort of coverage is very important in many scientific codes which use arrays of much greater than 1Mb in size.
Babbage ran a MPP version of HI-UX, a 32-bit version of UNIX, and appeared to the user to be very similar to a single UNIX workstation.
Babbage was broken down by software into separate partitions:
No. of partitions Nodes Time limit Memory GFLOPS 3 64 8 hours 14.4 GB 19.2 1 32 8 hours 7.2 GB 9.6 1 16 8 hours 3.6 GB 4.8 1 8 8 hours 1.8 GB 2.4 1 4 1 hour 0.9 GB 1.2 1 4 10 min 0.9 GB 1.2
Jobs were submitted from two front end workstations, Hooke and Lovelace
Hodgkin was installed in 1999 and decommissioned in October 2003.
Hodgkin is an SGI Origin 2000 SMP supercomputer, with 64 processors and an Onyx 2 graphics module. The image shows the graphics module on the left and three of the four CPU racks. Hartree is just visible behind hodgkin.
CPU: 64 x MIPS R12000 300 MHz Caches: 32 KB 2-way set-associative data cache 32 KB 2-way set-associative instruction cache 8 MB secondary cache Memory: 4 GB per node, 128 GB total Topology: 2 CPUs per node, 2 nodes per router; routers form a 4D hypercube Disk: 1 TB local disk Graphics: 2 "Infinite Reality" Graphics pipelines
The Origin 2000 is very well-known. It is the most scalable of the cache coherent shared memory machines, and provides a nearly `flat' memory access model to all of its main memory. The time taken to load a datum from main memory into a processor varies from about 1 microsecond for nearby memory to about 10 microseconds for the furthest memory on a 512-processor system. On hodgkin, the worst case memory access time is about 6 microseconds.
Hodgin runs Irix 6, a 64-bit version of UNIX, and appears to the user to be very similar to a single UNIX workstation.
Hodgkin is not broken down into separate partitions, but the scheduler attempts to balance different sizes of job; the following sizes are supported:
CPUs Memory GFLOPS 32 64 GB 19.2 (for testing and development only) 16 32 GB 9.6 8 16 GB 4.8 4 8 GB 2.4 2 4 GB 1.2 1 2 GB 0.6
There are three types of queue: production queues (with a 24 hour limit), development ones (with a 2 hour limit) and testing ones (with a 10 minute limit.) All combinations of type and size exist, and all queues run continuously, except that the 32 CPU production queue is available only by special arrangement.
Biography of Dorothy Hodgkin
Hartree was installed in 2001 and decommissioned in January 2005.
Hartree was an IBM SP parallel supercomputer; from 2001 until June 2003 it comprised 10 computation nodes and 2 I/O nodes (mainly for system processes and interactive use.) Each computation node contained 16 Power3-II processors running at 375MHz, and 12GB of memory. The nodes were connected via IBM's high performance SP Switch2. The above picture shows the three racks which contain hartree's nodes and the slightly shorter rack containing its RAID array.
CPU: 10 x 16 x Power3-II, 375 MHz, 4 floating point pipelines GFLOPS: 160 x 1.5 = 240 Caches: 128 KB 128 way set associative instruction cache 128 KB 128 way set associative data cache 8MB secondary cache Memory: 12 GB per node, 120 GB total Topology: Crossbar (effectively), 500MB/s I/O: Additional 2 dedicated I/O nodes with 4 CPUs each Disk: 1 TB local RAID disk.
From June 2003 until its decommission in 2005 Hartree comprised 8 computation nodes and 2 I/O nodes. Each computation node contained 8 Power4 processors running at 1.1 GHz, and 16 Gb of memory. The I/O nodes each contained 4 375 MHz Power3 processors and 8 Gb of memory. The nodes were connected via IBM's high performance SP Switch2.
CPU: 8 x 8 x Power4, 1.1 GHz, 4 floating point pipelines GFLOPS: 64 x 4.4 = 282 Caches: 128 Kb 128 way set associative instruction cache 128 Kb 128 way set associative data cache 8 Mb secondary cache Memory: 16 Gb per node, 128 Gb total Topology: Crossbar (effectively), 500 Mb/s I/O: Additional 2 dedicated I/O nodes with 4 CPUs each Disk: 1 Tb local RAID disk.
The IBM SP thus provided individual nodes of considerable compute power themselves, connected via a high performance switch. The CPUs were cache-based RISC processors and thus optimisation issues were similar for the SP and a normal desktop workstation. The processors were capable of both explicit prefetching and of automatic streaming in order to sustain the high memory bandwidths that many large scientific simulations require.
Hartree ran AIX, IBM's UNIX.
In its second configuration Hartree's queues allowed jobs of up to 12 hours and 32 processors (140 Gflops and 64 Gb). In its initial configuration Hartree's queues allowed jobs of up to 12 hours and 64 processors (96 GFLOPS and 40GB). Larger jobs (up to 128 processors and 80GB) could be run when required.
Biography of Douglas Hartree
The CCHPCF also had a four processor Origin 2000 fileserver with 1 TB of disk (mott,shown left). Mott was retired at the end of 2003 |