[GE users] qconf -tsm
isakrejda at lbl.gov
Tue Apr 13 17:36:19 BST 2010
Last night I had trouble figuring out why a set of jobs is not entering
execution and since it
was late and i was tired I took an easy way out and ran qconf -tsm. I
missed the fact that
number of pending jobs crept up to 10k and with 1k of job slots that one
the scheduler took 3h and almost drained the cluster. So I have few
1. Is there a way to turn this debugging for just one job through 1
cycle of the scheduler?
We do not have the option to keep track of why the job is waiting on
because it puts too much
load and in the past caused draining of the cluster.
2. Once I realize that it's going to take too much time what is the best
way to interrupt the cycle?
If I stop and restart the master, is it going to kill the scheduler or
wait until the scheduler finishes
its cycle? If I just kill the qmaster, is there a flag somewhere that
would tell to trigger
the logging of the scheduler cycle again or is it all in the memory?
3. Is there a way to redirect output of that debugging run out of the
directory? The directory is on a fairly heavily used common fs. I
thought about creating a link to
a local fs before issuing the command. Would that work? I remember (but
it for a while) that if the <ge_root>/<cell>/common/schedd_runlog
exists, sge appends to it.
So would that work and would it help? I wonder whether it was a pure IO
that slowed the run
so much or is something else going on when I am troeing in that qconf -tsm.
Taking into account consequences I am reluctant to experiment so any
insight would be
I am running 6.2u5 version.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users