[GE users] SGE -- running jobs across SGI Altix Partitions

Michael T Witkowski Michael_T_Witkowski at raytheon.com
Sat Nov 17 17:26:13 GMT 2007


Is there anyone using SGE on multiple Altix nodes:

1) Node => single system image across a minimum of 16 processor cores (up 
to 1024)
2) Parallel job workload using MPI  typically with 30, 60, 120 or greater 
cpu slots (cores) per job
3) Nodes (independent systems)  Numalinked together
4) SGE configuration using per job cpusets

What we are interested in doing is running parallel jobs across partitions 
that are Numalinked together.

So an example would be:
======================

System A -- An Altix 4700 with 1024 processor cores 
System B -- An Altix 4700 with 1024 processor cores

(For simplicity,  I omit the boot cpus/cpuset)
(Systems are identical in HW, SW and configurations)

Now,  If I run a few jobs
====================

1) At time t1, Job J1 starts on system A and has 768 slots allocated
     (and an associated cpuset)

2.) At time t2 (after t1) job J2 starts on System B and has 768 slots 
allocated
    (and an associated cpuset)

3) At time t3 (after t2) I want job J3 to start.  It has a request for 512 
slots and an associated cpuset.
    It cannot run on System A or System B since the resources are not 
available
    But it can run on a set of resources from both (256 from System A and 
256 from System B)

Or, alternatively,  just assume  we want to run a single job with between 
1026 and 2048 slots


The information I would greatly appreciate is:
======================================

Thoughts on actual or potential configurations to accomplish this
   *** Parallel Environments
   *** Cpusets
   *** Queue structures
   *** etc

and/or pointers to any documentation, references, or Points of contact.


Thanks much

Michael Witkowski




More information about the gridengine-users mailing list