[GE users] qconf -tsm

isakrejda isakrejda at lbl.gov
Tue Apr 13 17:36:19 BST 2010


Hi,

Last night I had trouble figuring out why a set of jobs is not entering 
execution and since it
was late and i was tired I took an easy way out and ran qconf -tsm. I 
missed the fact that
number of pending jobs crept up to 10k and with 1k of job slots that one 
cycle of
the scheduler took 3h and almost drained the cluster. So I have few 
questions.

1. Is there a way to turn this debugging for just one job through 1 
cycle of the scheduler?
We do not have the option to keep track of why the job is waiting on 
because it puts too much
load and in the past caused draining of the cluster.

2. Once I realize that it's going to take too much time what is the best 
way to interrupt the cycle?
If I stop and restart the master, is it going to kill the scheduler or 
wait until the scheduler finishes
its cycle? If I just kill the qmaster, is there a flag somewhere that 
would tell to trigger
the logging of the scheduler cycle again or is it all in the memory?

3. Is there a way to redirect output of that debugging run out of the  
<ge_root>/<cell>/common/
directory? The directory is on a fairly heavily used common fs. I 
thought about creating a link to
a local fs before issuing the command. Would that work? I remember (but 
haven't done
it for a while) that if the  <ge_root>/<cell>/common/schedd_runlog 
exists,  sge appends to it.
So would that work and would it help? I wonder whether it was a pure IO 
that slowed the run
so much or is something else going on when I am troeing in that qconf -tsm.

Taking into account consequences I am reluctant to experiment so any 
insight would be
appreciated.

I am running 6.2u5 version.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253263

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list