[GE users] batch job execution behaviour

mark garey garey at biostat.ucsf.edu
Wed Mar 30 21:43:19 BST 2005


greetings all,

i have a 6 node, 12cpu xserve cluster - panther 10.3.8, sge 5.3p6.

i'm to a point (in my clustering efforts) now where i can submit a job  
to the scheduler via qsub,
have the job queue, transfer, execute on the specified hosts/slots,
and exit without error. but not without error every time i sub the same  
job.

the job is an R batch job that spawns N mpi slaves on N lamhosts.

i can run the job interactively using qrsh and a single slot varying  
the number
of mpi slaves and lamhosts with success every attempt.
but when i submit the job with qsub things get a bit weird.

when all goes well, the job queues, begins transferring ( at which time  
all the
relevant queues are dropped because their slots are full ), the mpi  
slaves
corresponding to the slots used all get max cpu, the job ends and  
everything
cleans up and i have a nice R data image.
when all does not go well, the job queues, transfers, the mpi slaves  
clock a bit
of cpu then back off to zero cpu. the job will sit there doing nothing  
till
i qdel it.

i'm using a parallel environment i set up for lam that looks like this:

pe_name           	lam
queue_list        		statcluster-node1.q statcluster-node2.q  
statcluster-node3.q statcluster-node5.q statcluster-node4.q  
statcluster.q
slots             		12
user_lists        		NONE
xuser_lists       		NONE
start_proc_args  	 /usr/local/bin/lambooter
stop_proc_args    	/usr/local/bin/lamhalter
allocation_rule   	2
control_slaves    	FALSE
job_is_first_task 	FALSE

something else to note is the stderr of a failed job will show that two  
(or one depending on the number
of mpi slaves spec'd in the R job and lamhosts spec'd in the -pe flag )  
mpi slaves core dumped ( though
i have yet to find a real core file ).

something else somewhat relative is whether or not i can create a queue  
that uses n slots e.g. what i have
now was set up by default by the gridengine install, that is 2 slots  
per node. can't i have an all.q that
specs all slots but the default slots on the head node? otherwise only  
something like
'qsub -q  
statcluster.q,statcluster-node5.q,statcluster-node4.q,statcluster- 
node3.q,statcluster-node2.q,statcluster-node1.q ...'
seems to occupy all of the slots.

one other thing. various commands display the total memory for my head  
node (4GB) as being negative:

statcluster          darwin         2  1.00     2.0G    -1.4G      0.0   
     0.0

that a bug?

hope this is not too much for a first post. i'm pretty excited that we  
have parallel jobs that will run via
the gridengine but am not sure what resource limit is being hit to  
cause the spurious behaviour.

often at 2:30am after a long day, things get a bit blurry and its hard  
to recall what works and when.

thanks in advance,

mark+


--
mark garey
ucsf department of epidemiology and biostatistics
500 parnassus ave, mu420w
san francisco, ca. 94143
415-502-8870


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list