[GE users] SGE -- running jobs across SGI Altix Partitions
Michael T Witkowski
Michael_T_Witkowski at raytheon.com
Sat Nov 17 17:26:13 GMT 2007
Is there anyone using SGE on multiple Altix nodes:
1) Node => single system image across a minimum of 16 processor cores (up
to 1024)
2) Parallel job workload using MPI typically with 30, 60, 120 or greater
cpu slots (cores) per job
3) Nodes (independent systems) Numalinked together
4) SGE configuration using per job cpusets
What we are interested in doing is running parallel jobs across partitions
that are Numalinked together.
So an example would be:
======================
System A -- An Altix 4700 with 1024 processor cores
System B -- An Altix 4700 with 1024 processor cores
(For simplicity, I omit the boot cpus/cpuset)
(Systems are identical in HW, SW and configurations)
Now, If I run a few jobs
====================
1) At time t1, Job J1 starts on system A and has 768 slots allocated
(and an associated cpuset)
2.) At time t2 (after t1) job J2 starts on System B and has 768 slots
allocated
(and an associated cpuset)
3) At time t3 (after t2) I want job J3 to start. It has a request for 512
slots and an associated cpuset.
It cannot run on System A or System B since the resources are not
available
But it can run on a set of resources from both (256 from System A and
256 from System B)
Or, alternatively, just assume we want to run a single job with between
1026 and 2048 slots
The information I would greatly appreciate is:
======================================
Thoughts on actual or potential configurations to accomplish this
*** Parallel Environments
*** Cpusets
*** Queue structures
*** etc
and/or pointers to any documentation, references, or Points of contact.
Thanks much
Michael Witkowski
More information about the gridengine-users
mailing list