[GE users] Having problems setting up PEs

reuti reuti at staff.uni-marburg.de
Thu Nov 13 22:47:58 GMT 2008


Margaret,

Am 13.11.2008 um 23:18 schrieb Margaret Doll:

> I am running sge-V6lu4-1 and rocks-sge-5.0-2
>
> If I run my programs using
>
> qsub -pe mpich 4 shll
>
> where shll contains:
>
> #!/bin/bash
> #$ -o $HOME/works-1/Out
> #$ -j y
>
> /opt/openmpi/bin/mpirun -v -n 4 /home/mad/works-1/mad   ,
>
> this works but the job is assigned to an arbitrary compute node.
>
>
> I created a PE,  called chemistry.  My version of qmon does not have
> the option
> of assigning the PE to a queue in the PE list setup page as shown on
> page 247

maybe you used by accident the manual for 5.3 while you are using  
version 6.x? In 6.0 the PE list is instead in the queue definition.

> of the "Administration and User's Guide."  I, therefore, assigned an
> unique
> machine  file to the startup shell.  See below.
>
> When I run
>
> qsub -pe chemistry 4 shll

First you have to decide which MPI implementation to use. If it's  
Open MPI, then the instructions of the Open MPI FAQ apply. The  
reference to the manual was only for how to setup a PE in general.

If you supply any custom node list for a parallel job, you are most  
likely violating the granted allocation scheme by SGE.


> my job is stuck in the pending bin with the following errors:
>
> scheduling info:	queue instance "all.q at compute-0-1.local" dropped
> because it is
>   				temporarily not available
> 			queue instance "all.q at compute-0-2.local" dropped because it is
>   				temporarily not available
> 			queue instance "het at compute-0-32.local" dropped because it is
> 				full
> 			queue instance "het at compute-0-  ...
> Error for job 17330:  11/13/2008 ...: exit_status of pe_start = 1
> Error for job 17330:  11/13/2008 ...: exit_status of pe_start = 1
>
>
> PE List
>
> 	chemistry	PE Name		chemistry
> 			Slots		16
> 			Users		chemistry
> 			Xusers		NONE
> 			Start Proc Args /opt/gridengine/mpi/startmpi.sh -unique /opt/
> gridengine/mpi/chem-machinefile

For Open MPI you don't need entries for start/stop_proc_args. /bin/ 
true is fine.

For future reference: the purpose of startmpi.sh is to take a list of  
granted nodes in an internal SGE format ($pe_hostfile), and convert  
it into a format MPICH understands, as this varies from parallel  
library to parallel library (PVM, MPICH1, MPICH2, Linda, Open MPI,  
Charm++, ...). The option -uniq will pipe the final list though the  
Linux uniq command, i.e. you get a list of granted nodes in ther  
essence, but you have no reference how many slots you got per node.  
Is is sometimes useful, e.g. when you have a fixed allocation rule,  
and you can be sure to have the same amount of slots on all nodes  
anyway.

-- Reuti


> 			Stop Proc Args	/opt/gridengine/mpt/stopmpi.sh	
> 			Allocation Rule	$fill_up
> 			Urgency Slots	min
>
> Contents of /opt/gridengine/mpi/chem-machinefile
> 	compute-0-10 7 mem8.q 8
> 	compute-0-11 7 mem8.q 8
>
> Referenced PEs for all.q and mem16.q
>
> 	chemistry
> 	make
> 	mpi
> 	mpich
>
>
> 	mem16.q and all.q, both contain nodes compute-0-10 and compute-0-11.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=88704
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88705

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list