TIKSL

I/O - Scheme Layout Proposal

This page is intended to start a discussion about I/O matters in the gigabit project. The main ideas are described, thought not all of them in greatest detail. If you feel we missed a part necessary for understanding, please let us know!

Please feel free to send any comments, ideas and suggestions to the authors (tradke@aei-potsdam.mpg.de, merzky@zib.de) or to the project mailing list (gigabit@aei-potsdam.mpg.de).

The Project:

The TIKSL Project is part of the "Gigabit Testbed Süd+Berlin", and focuses on remote and distributed visualization for numerical relativity applications, that is, for Cactus. The main goals for the project are:
1. provide remote startup and control
2. provide remote visualization (as general as possible)
3. provide distributed file I/O (as general as possible)
Used Packages:

Cactus:
- tool for solving systems of partial differential equations numerically
- "Flesh" + "Thorns" == basic toolkit + plugins
- important plugins (thorns): Solver, AMR, I/O, IsoSurfacer
Amira:
- 3D visualization and modelling system
- advanced vector field visualization algorithms
- high flexibility due to modular, object-oriented design
- important methods: isosurfaces, illuminated stream lines, voxels
Globus:
- basic software infrastructure for distributed computations
- offers global authentification, resource managment, file access
- provides with nexus a general IO layer for distributed applications

Current State:

In regard to the goals stated above, following points characterize the current state:
- file I/O: parallel (but recombination is necessary for some cases) or serial; both using FlexIO, quite slow
- recombination: is very slow (T3E!)
- visualization:
  - offline: after file recombination, via FlexIo reader in Amira
  - online: only iso surfacer, with no interaction
  - control: no online control over vis. parameters (e.g. isosurfacer) (J. Shalf: control is possible, but not used)
- remote control: not possible
- checkpointing: restart requires same number of PEs for distributed application, or needs recombining; could be faster
Remark: Recombination is (as I understood it) nothing else than reading some (chunked) data files, recreating the data structures, and writing them out into a single data file.

How to reach the goals:
1. file I/O:
  - parallel (MPI-IO)
  - distributed (all nodes write to local disks!)
  - format: HDF5 (AMR, crashsave, generic, parallel)
2. checkpointing:
  - also HDF5 (?) (generic!, parallel, distributed, fast on T3E (?) - see above..;))
3. online visualization:
  - event/request driven
  - active visualization on simulation side (as isosurfacer...)
  - same API for data access as offline visualization (HDF5 oriented API)
  - online access to all data fields (for online visualization)
  - Data Reduction (subsampling, downsampling, compression etc.), depending on the demands of Amira
  - multiple visualizations

How to do this?
1. Data Access:
  
  Mainly due to point C3 we decided to use a Data Server connecting Amira indirectly to the simulation.
  
  Pro:
  - transparent serving for online and offline requests (just depending on timestep, simulation is not necessarily running)
  - asynchroneous streaming
  - Sampling/compression of offline data (HDF5 hyperslabs...)
  - flexible data types
  - reading of distributed files needs extra process on remote servers anyway...
  - fast due to caching
  - caching does not steal memory from simulation
  Contra:
  - extra PE needed (!! T3E !!)
  - needs some extra communications, so increases delay
  - more complex to code (!)
  Thus, the DS should run as a daemon!
  
  According to our T3E Staff it's no problem to run daemons on T3E, even if it eats much computing time. But, anyway, it has to share the available computing time with other applications, since it's running on the int4eractive command nodes, where also compiling etc. takes part... An more severe limit meight be the available memory, 'cause this is also to be shared between all ionteractive applications.
  
  Fortunately, during the next few month an OS upgrade is planned, which will provide an extra node (with 'big' memory) for memory intensive interactive applications/daemons. For us, so to say..;-) I will check with Garching that this is ok for them as well.
  
  On an O2k an daemon will certainly rival in (memory and) CPU time with the application as well, espcially if there ar as many processes as nodes. But here is no way to prevent from this (or any ideas?).
2. File I/O:
  
  For file I/O we plane to use HDF5 with an parallel, _distributed_ file layout, and MPI-IO, as well for _distributed_ collective IO. Here is the weakest point of the whole design, of course, we need your comments/suggestions/help right here! :
  
  Parallel HDF5 relies on MPI-IO, which is available on the T3E since mid of March. We did not do much testing yet, but with MPI-IO available, parallel HDF5 should be coming soon on T3E as well(right??). (Remark: the parallel HDF5 tests coming with the HDF5 distribution are not compiling. They seem to be very old anyway... According to Albert Cheng, some tests on T3E are olready done.)
  
  HDF5 usually writes a single file. We would like to have seperated files, for better performance (see Thomas Radkes I/O tests on turmoil, modi4 and origin!). HDF5 supports a so called mounting scheme for logical data groups, which has some paralells to unix fs mounting, but the term does not refer to really mounting external files into a HDF5 data set. Well, we would like to have exactly this, so is there a chance to get this???
  
  Actually, Werner pointed out, that this should work already, at least there have been some comments about it in the mailing lists. Following Mike Folk at the NCSA it is not implemented yet. Can anyone clearify this, or even provide a piece of code as an example?
  
  We want even more: if there would be a possibility to write a HDF5 data set into splitted files on one file system, why cant we do this distributed, on distributed file systems? One problem: MPI-IO (ROMIO) does not support writing on distributed FS (Right??). Steven Tuecke from Argonne mentioned once, that the ROMIO group might be interested in distributed I/O, too, so is there any chance that distributed IO will be supported in the near future?
  
  File Layout:
  
  The final file layout will look like this:
  
  Every 'Data Chunk' is thought to be written to a single file. FS means file system: the files might be located on different file systems.
  
  The advantages are clear: all processors can do local output, but since all data chunks + header form a single logical HDF5 file, the access of all data is transparent for every application, no recombination etc is necessary (actually, since the remote files are read by another process, and the local files are to be processed by the HDF5 lib, it's more a recombination 'on the fly').
  
  It would be probably better to keep some information redundantly in each file. So, maybe, a special header file is not really necessary. Additionally, it might be easier to keep one single file per mashine, so MPI-IO can do some optimization for collective calls (following Rajeev Thakur).
  
  Additionally, we can use multiple files systems, optimum size of IO blocks (by collecting data from a certain subset of procs, here 3), multiple files, and we minimize network transfer, and all of this should optimize IO speed!
  
  The only drawbacks we see right now: there is a process necessary which reads the data located on a remote file system. We want to do this via an entry in inetd.conf on all servers, so if a a process reads a file, and the lib realises due to the header information that there are missing parts on a remote server, the daemon there is connected (and possibly automatically started), reads the parts of the file necessary, and puts them through a socket back to the application. The lib additionally could do some caching/staging as well...
  
  The other possible problem is that, if we want to get a hyperslab or so, a read possibly goes over multiple files. Thus, if every file contains a self-contained HDF group structure, the hyperslab specification has to cover multiple groups! Or, alternatively, a group can extend over multiple (possibly all) data chunks. Is this possible?
  
  Is there any chance to get the HDF5 group and the ROMIO group together to help to provide such functionality (in a time frame not exeeding the project time of 2 years...)???
3. Stream I/O:
  
  Right now, mainly the request of data is described below. We are not yet sure how to send the data through the net. One of the nicest ways would be to get MPI-IO to write onto a socket instead of a file, but who is doing the sorting then? Another question is, which data format is to be used here. HDF5 is not streamable, FlexIO is, but we don't want someone rewriting/supporting the FlexIO library only for misusing it as an stream format! Another possibility is to use CORBA for this, but here we have to learn a lot, and test how it performs for large data sets. Additionally, we don't know if Corba (Mico) is available for all platforms. Other canditates are DCE and PVM. If this seems to be impossible for any reason, we possibly end up here with an own creation, using a protocol like the one described for the request part below.
  
  So, if anyone has a better idea about this, let us know!
4. IO-API:
  
  The IO API is thought to be very similar to HDF5. The reason for this is, that we can do the file IO very transparently by forwarding the calls to the native HDF5 library. In case of stream IO, the calls have to be mapped to our own stream lib (see section about stream IO), which has to implement a protokoll able to handle all possible data types.
  
  In principle, it is wished to have the possibility of downsampling the data before writing them to disk or sending them over the net. If HDF5 provides this functionality (this seems to be the case according to Albert Cheng), we want to leave this task to the HDF5 lib (we possibly can't do it better ;), and implement only the things we need for streaming.
  
  In case this is not possible in HDF5, we have to do our own reduction, which then will be a more general layer:
  
  Anyway, the basic IO layer is thought to be a switch , which decides (by looking at the URLs/file handles) which lower layer has to be served.
5. Request/Control Protocoll:
  - Visualization generates [data | info | control | previz] - request with unique ID
  - sends request to data server(s)
  - data server decision: online or offline request
    - offline:
      get requested data from HDF5 files,
      send [data | info] to visualization (other requests fail (geom?)
    - online:
      ask simulation for [data | previz | info | control],
      wait for [data | previz | info | confirmation],
      send to Visualization
  - for details: see below.

Mile Stones:
1. current (Q0):
  - file I/O via FlexIO into seperated files, recombining needed for viz
  - online viz == remote isosurfacing, direct connection
  - checkpointing via FlexIO (correct?)
2. HDF5, no Metacomputing (Q2/3):
  - serial HDF5-I/O
  - data servers on both sides
  - same API for online/offline data requests
  - protocol supports AMR grid
3. splitted files, cactus control, no Metacomputing (Q4/5/6):
  - ability to use paralle I/O on splitted HDF5 files
  - cactus (and previsualization) can be controlled remotely
  - multiple visualizations from one simulation
4. Metacomputing (Q6/7/8):
  - multiple distributed data files
  - full cactus control
  - parallel previsualization
  - multiple data servers
  - HDF5 (?) checkpointing and recovery

Layout:

'DS' stands for Data Server.

Any other visualization using the DS has 'just' to generate requests and collect the data. A HDF5-style API should make this easy.

Any other simulation using the DS has to provide HDF5 file I/O and request handling. An easy API should make this simple to implement.

Protocol:

All requests are identified by a unique request ID, generated on visualization side. So, sent data can be associated to a certain request by this ID.

The list of former rcpts. is increase by every transmitter station (any mediate data server) to prevent from loops and multiple request handling. This is most important for broadcasting.

Syntax:

    
      <REQUEST> 
         <HEADER> 
               <ID      = 12345>             # unique request ID
               <VERSION = 0.1>               # protocol version
               <TAG     = "donald">          # request tag
               <TARGET  = [<thorn_name> | DS | application | broadcast ]>        
                                             # request target
               <TYPE    = [init|info|control|data|previz|keepalive|answer]>              
                                             # request type
               <SIZE    = 5432>              # size of request content
               <RCPTS   = [D727885EDB88EA62A779398BEE1099BA
                           5EE10EDB88EA62A77939A8BD7278899B
                           EDB88EA65EE1099BA2A779398BD72788]>              
					     # list of former rcpts. of request
               
         </HEADER>
    
         <CONTENT>
 
            <---something-depending-on-request-type--->
            <---something-depending-on-request-type--->
            <---something-depending-on-request-type--->
 
         </CONTENT>
                          
      </REQUEST>
    
      
    
      INIT:
      
          Associates an application with the according data server
	  running on the same host. This request should be the first
	  one received by a data server, and should be sent by the
	  creator of the server. In this way, pairs of appls/ds
	  are formed, which will work together very closely.
	  
	  For the situation that an application is distributed, and
	  for this (or for any other reasons) more than one data server
	  is to be associated to the application, these data servers
	  should be announced to each other as well. A new request type 
	  for this is not necessary I think, the ASSOCIATED_ID could be 
	  recogniced as know, or we set TYPE to "ASSOCIATED_DS" or so.
	  
	  
      
          <PORT          = 2500>
          <ASSOCIATED_ID = "EDB88EA62A779398BD727885EE1099BA">
         
          <PROGRAM NAME = "Cactus"
                   TYPE = "CCTK"
                   HOST = "berte.zib.de"
                   UID  = 185
                   GID  = 7
                   USER = radke>
         
           .
           .
           .
         
         Possibly other parameters a ds/application might 
	 need for configuration.


     
      
   
      INFO:
      
          Request information, like available services, uptime, statistics etc.
      
          <TYPE = IsRunning>
          <TYPE = WhenStarted>
          <TYPE = WhichTimeStep>
          
          <TYPE = ListThorns>
          <TYPE = ListFields>
          <TYPE = ListControls>
          
          <TYPE = ThornExists   NAME = "<name>">
          <TYPE = FieldExists   NAME = "<name>">
          <TYPE = ControlExists NAME = "<name>">
     
      
   
      CONTROL:
      
           Try to control the remote application. The process
	   invoking such requests should get information about 
	   available controls, and can afterwards refer to 
	   obtained control id's.
      
           <SetValue    NAME  = "sample_size"
                        TYPE  = "INTEGER"
                        VALUE =  128
                        >
           <ToggleValue NAME  = "light">
          
           <SIGNAL = REGRID>
           <SIGNAL = CHECKPOINT>
           <SIGNAL = STOP>
           <SIGNAL = SUSP>
           <SIGNAL = CONT>
           .
           .
           .
     
      
   
      PREVIZ:
      
           Request some computation on application side, and the 
	   results. Here the same remark as for Control holds:
	   get information about available PreViz routines first.
      
           <PREVIZ TYPE  = "line"
                   START = "0.1, 1.4, -5.1"
                   >
           <PREVIZ TYPE  = "isosurface"
                   LEVEL =  0.001
                   >
           .
           .
           .
       
         Possible parameters depend on thorn to be invoked.


     
      
   
      DATA:
      
          Try to obtain actual data fields / sets from application.
	  Again, first get information about availability.
      
          <DATA NAME         = "fieldname"
                RESOLUTION   = 128
                DOWNSAMPLING = 2
                HYPERSLAB    = "some HDF5 parameters"
                XMIN         = 2
                XMAX         = 4
                YMIN         = 3
                YMAX         = 5
                ZMIN         = 4
                ZMAX         = 6>
         
          <PARAM NAME=hyperslab
                 VALUE=[hdf5-hyperslab-syntax]>
         
           .
           .
           .
         
         Possible parameters depend on field type and 
	 visualization needs.


     
      
   
      KEEPALIVE:
      
          Now and then something should be sent just
	  to say: we are still out here.
      
          <TIME = 924691596>           # times since 01.01.1970 0:00
         
           .
           .
           .
         
         Possible other parameters I cannot think of right now... ;).


     
      
   
      ANSWER:
      
          This Request always refers to an ather, already handled
	  one, and gives information about its handling. This request
	  naturally should always flow in opposite direction to the
	  original request.
      
          <ERRORCODE = [ NO_ERROR | ERROR_PERMISSION... ]> 
         
           .
           .
           .
         
         This does not include any data to be sent, but only 
	 information about request handling itself (aknlowledgement etc.).

Request Handling:

The thorn_handleRequests does nothing than :

maintaining a list of thorns able to handle requests
maintaining a list of buffers for their requests
filling these buffers with incoming requests

So, if there arrives a PREVIZ request for the thorn_isoSurfacer, the RequestHandler has to check if this thorn exists, if it is able to manage requests, and, if both are ok, to fill the whole content part of the requests into the thorns request buffer. If request cannot be handled by anyone, an error has to be generated.

The thorns themself have to parse the contents then, and do something according to its contents. the sending back of data, acknowledging etc. is completely done by the thorn itself, probably via an easy to handle API (well, _everybody_ should be able to write thorns...). If something goes wrong during parsing etc., the thorn now has to generate an approriate error message.

TIKSL

I/O - Scheme Layout Proposal

The Project:

Used Packages:

Cactus:

Amira:

Globus:

Current State:

How to reach the goals:

How to do this?

Data Access:

File I/O:

File Layout:

Stream I/O:

IO-API:

Request/Control Protocoll:

Mile Stones:

Layout:

Protocol: