[GE users] batch job execution behaviour

Reuti reuti at staff.uni-marburg.de
Wed Mar 30 21:51:00 BST 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


please read:


This should also works on Macs, only that it seems mandatory to have an 
allocation rule of 2 fixed for them. Up to now, I don't know exactly why.

Cheers - Reuti

Quoting mark garey <garey at biostat.ucsf.edu>:

> greetings all,
> i have a 6 node, 12cpu xserve cluster - panther 10.3.8, sge 5.3p6.
> i'm to a point (in my clustering efforts) now where i can submit a job  
> to the scheduler via qsub,
> have the job queue, transfer, execute on the specified hosts/slots,
> and exit without error. but not without error every time i sub the same  
> job.
> the job is an R batch job that spawns N mpi slaves on N lamhosts.
> i can run the job interactively using qrsh and a single slot varying  
> the number
> of mpi slaves and lamhosts with success every attempt.
> but when i submit the job with qsub things get a bit weird.
> when all goes well, the job queues, begins transferring ( at which time  
> all the
> relevant queues are dropped because their slots are full ), the mpi  
> slaves
> corresponding to the slots used all get max cpu, the job ends and  
> everything
> cleans up and i have a nice R data image.
> when all does not go well, the job queues, transfers, the mpi slaves  
> clock a bit
> of cpu then back off to zero cpu. the job will sit there doing nothing  
> till
> i qdel it.
> i'm using a parallel environment i set up for lam that looks like this:
> pe_name           	lam
> queue_list        		statcluster-node1.q statcluster-node2.q  
> statcluster-node3.q statcluster-node5.q statcluster-node4.q  
> statcluster.q
> slots             		12
> user_lists        		NONE
> xuser_lists       		NONE
> start_proc_args  	 /usr/local/bin/lambooter
> stop_proc_args    	/usr/local/bin/lamhalter
> allocation_rule   	2
> control_slaves    	FALSE
> job_is_first_task 	FALSE
> something else to note is the stderr of a failed job will show that two  
> (or one depending on the number
> of mpi slaves spec'd in the R job and lamhosts spec'd in the -pe flag )  
> mpi slaves core dumped ( though
> i have yet to find a real core file ).
> something else somewhat relative is whether or not i can create a queue  
> that uses n slots e.g. what i have
> now was set up by default by the gridengine install, that is 2 slots  
> per node. can't i have an all.q that
> specs all slots but the default slots on the head node? otherwise only  
> something like
> 'qsub -q  
> statcluster.q,statcluster-node5.q,statcluster-node4.q,statcluster- 
> node3.q,statcluster-node2.q,statcluster-node1.q ...'
> seems to occupy all of the slots.
> one other thing. various commands display the total memory for my head  
> node (4GB) as being negative:
> statcluster          darwin         2  1.00     2.0G    -1.4G      0.0   
>      0.0
> that a bug?
> hope this is not too much for a first post. i'm pretty excited that we  
> have parallel jobs that will run via
> the gridengine but am not sure what resource limit is being hit to  
> cause the spurious behaviour.
> often at 2:30am after a long day, things get a bit blurry and its hard  
> to recall what works and when.
> thanks in advance,
> mark+
> --
> mark garey
> ucsf department of epidemiology and biostatistics
> 500 parnassus ave, mu420w
> san francisco, ca. 94143
> 415-502-8870
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list