[GE users] SGE and physical machine slot allocation

Reuti reuti at staff.uni-marburg.de
Fri Apr 21 06:42:16 BST 2006


Am 20.04.2006 um 20:39 schrieb lukacm at pdx.edu:

> Hello,
>
> thank you for your help, it seems now SGE is accounting correctly  
> for the
> Job/Slot/Machine allocation.
>
> How come that changing the machine file allows to do this? I  
> thought that
> according to the docs, the machine file MUST contain all available  
> machines?
> Isn't it true?

This depends on the point of view - which documentation are you  
referring to: MPICH or SGE? Without any queuingsystem, it will most  
likely contain all machines in the cluster, but with a queuing system  
it should only contain the nodes granted to be used for this job. -  
Reuti


> thanks again
>
> martin
>
> Quoting Reuti <reuti at staff.uni-marburg.de>:
>
>> Am 20.04.2006 um 19:05 schrieb lukacm at pdx.edu:
>>
>>> Hello,
>>>
>>> Yes the command line from the submit file is as follows:
>>>
>>> $MPIR_HOME/mpirun -np $NSLOTS -v -machinefile /home/visible/ 
>>> mbmachines
>>> /home/visible/apps/MrBayes/mb anolis.nex
>>
>> Nope, this will use any machine in the cluster. The to be used
>> machinefile is $TMPDIR/machines which is created by the start
>> procedure of the PE.
>>
>> Please have a look at the supplied mpi.sh scipt in $SGE_ROOT/mpi
>>
>> -- Reuti
>>
>>
>>> All variables are defined and so.
>>>
>>> However , concernign this: "So you also renamed the created link in
>>> startmpi.sh
>>> to create a ssh wrapper?" i am not sure. Did not find anything
>>> about it in the
>>> FAQ's. Is there any doc on this? I modified 'rsh' in the /
>>> gridengine/opt/
>>> directory. all links in startmpi.sh in the same directory are  
>>> pointing
>>> correctly to that wrapper. So i guess i am confused about your
>>> question
>>>
>>> martin
>>>
>>>
>>> Quoting Reuti <reuti at staff.uni-marburg.de>:
>>>
>>>> Am 19.04.2006 um 23:30 schrieb lukacm at pdx.edu:
>>>>
>>>>> Hello,
>>>>>
>>>>> yes the job is running fine, but not as SGE scheduled it on the
>>>>> physical
>>>>> machines, i.e. parallel slots.
>>>>>
>>>>> the qsub command looks like qsub -pe mpich 4 mbsub.sh
>>>>>
>>>>> inside the main important flags are
>>>>>
>>>>> #$ -v P4_RSHCOMMAND=ssh
>>>>> #$ -v P4_GLOBMEMSIZE=10000000
>>>>> #$ -v MPICH_PROCESS_GROUP=no
>>>>> #$ -v CONV_RSH=ssh
>>>>
>>>> So you also renamed the created link in startmpi.sh to create a ssh
>>>> wrapper?
>>>>
>>>> Have you given any hostlist to the mpirun command? - Reuti
>>>>
>>>>
>>>>>
>>>>> I also did the tight integration of MPICH and SGE using the method
>>>>> number 2.
>>>>>
>>>>> In general i would not mind this issue, but when i have to clean a
>>>>> set of
>>>>> zombies from the same user, and i do not know which processes are
>>>>> zombies and
>>>>> which not, it makes a problem.
>>>>>
>>>>> martin
>>>>>
>>>>> Quoting Reuti <reuti at staff.uni-marburg.de>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 19.04.2006 um 21:59 schrieb lukacm at pdx.edu:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> a job run with SGE generates the following strangeness.
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> --
>>>>>>> ------
>>>>>>> arc.q at compute-0-11.local       BIPC  2/2       1.00     lx26- 
>>>>>>> amd64
>>>>>>>    3964 0.55500 tas        ruedas       r     04/19/2006
>>>>>>> 10:50:59     2
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> --
>>>>>>> ------
>>>>>>> arc.q at compute-0-12.local       BIPC  1/2       0.00     lx26- 
>>>>>>> amd64
>>>>>>>    3964 0.55500 tas        ruedas       r     04/19/2006
>>>>>>> 10:50:59     1
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> --
>>>>>>> ------
>>>>>>>
>>>>>>> The slots allocated by SGE do not correspond to the queues that
>>>>>>> are
>>>>>>> shown by
>>>>>>> qstat. Is there a rememdy to tight integrate SGE to the physical
>>>>>>> machines?
>>>>>>
>>>>>> this seems not to be a problem of SGE, but of the integration of
>>>>>> your
>>>>>> parallel job into SGE. So this job got three slots, but is only
>>>>>> using
>>>>>> one slot according to the load you mean?
>>>>>>
>>>>>> What is your defined queue, PE, the defined scripts for this  
>>>>>> PE and
>>>>>> your qsub command?
>>>>>>
>>>>>> Is your job instead running on other nodes than the intended  
>>>>>> ones?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> thank you
>>>>>>>
>>>>>>>
>>>>>>> martin
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> --
>>>>>> --
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list