[GE users] GridEngine on Sierra systems

Fedele STABILE fedele at fis.unical.it
Thu May 29 12:09:49 BST 2008


Here are links to HP AlphaServerSC (Sierra system) features 
http://h18002.www1.hp.com/alphaserver/archive/sc/sys_sc45_features.html

here you can read System SPD (system specifications) the first 5 pages
are on RMS:
http://webdocs.caspur.it/hp_doc/sc_doc/web/index.htm

you can see that someone uses LSF to manage scheduling and queue.

Second question: 
In a Sierra cluster generally each node has similar configuration and
feature to others so i think it's possible to forget the communication
between HP-RMS and GE-RMS.
I explain: 
1) on my installation i defined a complex consumabe named n_cpu (with
initial value equal to the cpu's available);
2) each time user submits a job indicate how many cpu uses/reserves (-l
n_cpu=# );
3) GridENgine validate the request assigning cpu from one or more nodes
in the cluster and then runs the command submitted
4) HP RMS receive the run (prun) or allocate wherever is executed
because it's a cluster wide resource and itself defines the location and
amount of resource.

So the GridENgine resource n_cpu is a consumable value and needs to GE
for scheduling purposes (if cluster has 10 cpu available GE execute the
job sending the command to HP-RMS)
HP RMS uses n_cpu value to allocate on the cluster the resource using
the user choice, if he want.

Fedele

Il giorno gio, 29/05/2008 alle 00.15 +0200, Reuti ha scritto:
> Hi,
> 
> Am 28.05.2008 um 13:33 schrieb Fedele STABILE:
> 
> > In my installation GE doesn't substitute the native resource manager.
> > It's necessary explain the mechanism:
> > HP resource manager (we call RMS) is part of the Operating System.
> > So if i need to run jobs or allocate cpus on the cluster i need to use
> > RMS commands like prun or allocate.
> > HP-RMS uses a database to manage informations on the state of the
> > system, but it doesn't manage any job queue.
> > When i submit a job via GridEngine, i must interact with HP-RMS  to
> > execute my job on the cluster. If i don't have any particular
> > requirement (example: location of the reserved resources) i can use  
> > prun
> > instead of  mpirun to launch the job.
> > Killing the job is easy because the signal kills also all  
> > processes, but
> > suspending and resume are problems because these signals are not
> > propagated.
> > For this reason i developed scripts that i'm testing, but i see  
> > they are
> > ok.
> 
> can you provide any link for HP's RMS? I found something, and at the  
> first glance it looks to me like a plain resource manager, just  
> without a schedule or queuing facility (similar to what you mentioned).
> 
> > Now there's another question to solve: is it possible that GE  
> > reserve a
> > resource and HP-RMS another?
> 
> AFAICS: yes, this might happen. So I would try to stay only with one  
> resource manager and don't use HP one's. We use HP-MPI also on Linux,  
> and it will work with a PE in SGE if you adjust the generated  
> machinefile slightly (in case HP-MPI is your main interest) and  
> "export MPI_REMSH=rsh" in your jobscript before "mpirun", so that  
> SGE's rsh-wrapper will work.
> 
> -- Reuti
> 
> 
> > Fedele
> >
> >
> > Il giorno lun, 26/05/2008 alle 16.48 +0200, Reuti ha scritto:
> >> Hi,
> >>
> >> Am 26.05.2008 um 16:40 schrieb Fedele STABILE:
> >>
> >>> I've installed GridENgine on my Sierra System and it works !!
> >>>
> >>> Sierra System is an HP project of supercomputer that uses a QSW
> >>> network
> >>> to connect a cluster of HP server. It uses a Resource Manager that
> >>> loads
> >>> the parallel executable, as mpirun, and send it to the nodes in the
> >>> cluster.
> >>> So to submit a job GridEngine needs to communicate with this  
> >>> Resource
> >>> Manager.
> >>>
> >>> I have created script for "execution methods" that can suspend and
> >>> resume parallel jobs, no modification needed for starting and
> >>> termination.
> >>>
> >>> Is there anyone interested to discuss this argument?
> >>
> >> if you have it running already, it would be nice if you could prepare
> >> a Howto for it (how it's working in Sierra Systems and what the
> >> scripts do). Is it still a Tight Integration, although it sends the
> >> jobs in the end to HP's resource manager?
> >>
> >> -- Reuti
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list