[GE users] Strange behavior with sge_qmaster

Sean Dilda agrajag at dragaera.net
Thu Jul 8 20:11:12 BST 2004


On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:

> 
> > Something else I've had happen several times this weekend is that SGE
> > will stop scheduling jobs.  There will be several jobs submitted to SGE,
> > for which there are resources, yet SGE will not launch the jobs.  If I
> > shut down sge_qmaster, then start it up again, those jobs are launched
> > immediately.  I have a feeling that the scheduling loop may be
> > stopping.  I have schedd_job_info set to false.  However when this
> > occurs, I change it to true, yet no matter how long I wait, scheduling
> > info for the jobs never shows up.  Originally I had flush_submit_sec and
> > flush_finish_sec set to '1'.  However when this started I changed them
> > back to '0', but the problem didn't go away.
> 
> --> dto. Please provide more information, e.g. what does
> 
>     qconf -sss
> 
> show? If qmaster doesn't get order from scheduler you will get a "no
> scheduling host defined" answer.
> 
>     Is the scheduler busy (at least from time to time?)

Just noticed the problem happening again.  'qconf -sss' gave 'no
scheduling host defined'.

In the messages file for qmaster, I found this:
07/08/2004 01:56:57|qmaster|head4|E|acknowledge timeout after 600
seconds for event client (schedd:1) on host "head4"
07/08/2004 01:56:57|qmaster|head4|I|event client "scheduler" with id 1
deregistered

In the schedd messages file, I saw this:
07/08/2004 01:48:53|schedd|head4|W|qmaster alive timeout expired
07/08/2004 01:50:30|schedd|head4|E|unable to send message to qmaster
using port 535 on host "head4": got send error
07/08/2004 01:50:31|schedd|head4|W|qmaster alive timeout expired

Another interesting thing I noticed.. the messages file for schedd seems
to be full of messages like this:
07/08/2004 01:45:16|schedd|head4|E|can't find parallel task 21384.1 task
1.node10 for update in function pe_task_update_master_list_usage
07/08/2004 01:45:16|schedd|head4|E|callback function for event "565298.
EVENT JOB 21384.1 task 1.node10 USAGE" failed



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list