[GE users] batch job execution behaviour
garey at biostat.ucsf.edu
Wed Mar 30 21:43:19 BST 2005
i have a 6 node, 12cpu xserve cluster - panther 10.3.8, sge 5.3p6.
i'm to a point (in my clustering efforts) now where i can submit a job
to the scheduler via qsub,
have the job queue, transfer, execute on the specified hosts/slots,
and exit without error. but not without error every time i sub the same
the job is an R batch job that spawns N mpi slaves on N lamhosts.
i can run the job interactively using qrsh and a single slot varying
of mpi slaves and lamhosts with success every attempt.
but when i submit the job with qsub things get a bit weird.
when all goes well, the job queues, begins transferring ( at which time
relevant queues are dropped because their slots are full ), the mpi
corresponding to the slots used all get max cpu, the job ends and
cleans up and i have a nice R data image.
when all does not go well, the job queues, transfers, the mpi slaves
clock a bit
of cpu then back off to zero cpu. the job will sit there doing nothing
i qdel it.
i'm using a parallel environment i set up for lam that looks like this:
queue_list statcluster-node1.q statcluster-node2.q
statcluster-node3.q statcluster-node5.q statcluster-node4.q
something else to note is the stderr of a failed job will show that two
(or one depending on the number
of mpi slaves spec'd in the R job and lamhosts spec'd in the -pe flag )
mpi slaves core dumped ( though
i have yet to find a real core file ).
something else somewhat relative is whether or not i can create a queue
that uses n slots e.g. what i have
now was set up by default by the gridengine install, that is 2 slots
per node. can't i have an all.q that
specs all slots but the default slots on the head node? otherwise only
seems to occupy all of the slots.
one other thing. various commands display the total memory for my head
node (4GB) as being negative:
statcluster darwin 2 1.00 2.0G -1.4G 0.0
that a bug?
hope this is not too much for a first post. i'm pretty excited that we
have parallel jobs that will run via
the gridengine but am not sure what resource limit is being hit to
cause the spurious behaviour.
often at 2:30am after a long day, things get a bit blurry and its hard
to recall what works and when.
thanks in advance,
ucsf department of epidemiology and biostatistics
500 parnassus ave, mu420w
san francisco, ca. 94143
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users