3D Black Hole Simulation

The H code simulates the evolution of a black hole in three dimensions using a hyperbolic formulation of the Einstein equations by Joan Masso. The simulation is run for 100 time steps and uses minimal I/O operations. The H code currently is a single multipurpose code, which is compatible with HPF variants, F90, and F77 with message passing. When tuned for the CM5, the code achieves over 15 Gflops for size 200^3 grids.
This performance study includes the following parallel architectures
(and software)

	IBM SP1 at ANL (MPI/F90)
	TMC CM5 at NCSA (CM-Fortran,MPI/F77)
	CRAY T3D at PSC (HPF subset)
	CRAY C90 at PSC (F90)
	SGI Power Challenge at NCSA (F90 subset)

Problem Size: nx = ny = nz = 32

SGI Power Challenge

no. processors     cpu time       speedup    ~ Mflops

      1             240.6	    1.0	        63.4
      2             118.9	    2.02       128.3
      4             58.4	    4.12       261.0
      8             31.9	    7.54       477.8
     15		    22.5	   10.69       677.7
     16	    ***** only 15 processros configured in due to a failed processor

TMC CM5

no. processors     cpu time       speedup      ~ Gflops

     32             23.6		         0.6	
     64		    14.8			 1.0
    128		     9.9			 1.5
    256		     7.7			 2.0
    512		     6.9			 2.2

Cray T3D

no. processors     cpu time       speedup      ~ Gflops

      2            754.1			 0.02
      4		   382.0			 0.04
      8		   190.1			 0.08
     16		    98.9			 0.15
     32             49.8			 0.31
     64		    25.9			 0.59
    128		    16.6			 0.92
    256		     7.8			 1.96
    512		    57.1			 0.27

IBM SP1

no. processors     cpu time       speedup      ~ Gflops

     64             13.9			 1.09


Problem Size: nx = ny = nz = 64

SGI Power Challenge

no. processors     cpu time       speedup     ~ Mflops

      1		   2067.1	    1.00        59.0
      2            1057.1           1.96       115.3
      4             537.3	    3.85       227.0
      8             282.1	    7.33       432.1
     15		    175.2	   11.80       695.8
     16	    ***** only 15 processros configured in due to a failed processor

TMC CM5

no. processors     cpu time       speedup   ~ Gflops

     32            150.8			 0.8
     64             79.2			 1.5
    128             41.9			 2.9
    256             24.1			 5.0
    512             15.4 	                 7.9

Cray T3D

no. processors     cpu time       speedup   ~ Gflops

      8		  1520.5			 0.08
     16		   769.3			 0.16
     32		   389.3			 0.31
     64		   194.3			 0.63
    128		   100.3			 1.20
    256		    50.2			 2.40
    512		    26.6			 4.60

IBM SP1

no. processors     cpu time       speedup   ~ Gflops

     64             68.6  			 1.8


Problem Size: nx = ny = nz = 128

SGI Power Challenge

no. processors     cpu time       speedup   ~ Mflops

     15		   1370.3	               711.6


TMC CM5

no. processors     cpu time       speedup   ~ Gflops

     64            574.1			 1.7
    128            294.0			 3.3
    256            154.0			 6.3
    512             81.9			11.9

Cray T3D

no. processors     cpu time       speedup   ~ Gflops

     64           1536.0			 0.63
    128            774.8			 1.30
    256            391.1			 2.50
    512            197.3			 4.90

Peak Architecture Performance for Problem Size n = 32^3

Machine		Processors	CPU Time	Gflops

CRAY T3D	   512		   7.8	  	  2.0
TMC CM5		   512		   6.9	  	  2.2
CRAY C90	    16		  19.0	  	  0.5
IBM SP1		    64		  13.94	   	  1.1
SGI PC		    15		  22.49	  	  0.7

Peak Architecture Performance for Problem Size n = 64^3

Machine		Processors	CPU Time	Gflops

CRAY T3D	   512		  26.6	  	  4.6
TMC CM5		   512		  15.4	  	  7.9
CRAY C90	    16		  23.1	  	  5.3
IBM SP1		    64		  68.6	  	  1.8
SGI PC		    15		 175.1	  	  0.7

Peak Gflop Performance

Machine		Processors	n	Gflops

CRAY T3D	   512		256	  5.1
TMC CM5		   512		128	 11.9
CRAY C90	    16		128	  7.4
IBM SP1		    64		 64	  1.8
SGI PC		    15		128	  0.7

key algorithms:
MacCormack finite difference scheme
key contact:
Rob Gjertsen (gjertsen@ncsa.uiuc.edu)
remarks:
The problem sizes for some architectures differed from the standard
32^3, 64^3, 128^3 sizes due to performance considerations; the times
were scaled accordingly in these cases.
The occurance of superlinear speed up can be
attributed to cache utilization.