[GE users] Strange behavior with sge_qmaster
agrajag at dragaera.net
Thu Jul 8 20:11:12 BST 2004
On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
> > Something else I've had happen several times this weekend is that SGE
> > will stop scheduling jobs. There will be several jobs submitted to SGE,
> > for which there are resources, yet SGE will not launch the jobs. If I
> > shut down sge_qmaster, then start it up again, those jobs are launched
> > immediately. I have a feeling that the scheduling loop may be
> > stopping. I have schedd_job_info set to false. However when this
> > occurs, I change it to true, yet no matter how long I wait, scheduling
> > info for the jobs never shows up. Originally I had flush_submit_sec and
> > flush_finish_sec set to '1'. However when this started I changed them
> > back to '0', but the problem didn't go away.
> --> dto. Please provide more information, e.g. what does
> qconf -sss
> show? If qmaster doesn't get order from scheduler you will get a "no
> scheduling host defined" answer.
> Is the scheduler busy (at least from time to time?)
Just noticed the problem happening again. 'qconf -sss' gave 'no
scheduling host defined'.
In the messages file for qmaster, I found this:
07/08/2004 01:56:57|qmaster|head4|E|acknowledge timeout after 600
seconds for event client (schedd:1) on host "head4"
07/08/2004 01:56:57|qmaster|head4|I|event client "scheduler" with id 1
In the schedd messages file, I saw this:
07/08/2004 01:48:53|schedd|head4|W|qmaster alive timeout expired
07/08/2004 01:50:30|schedd|head4|E|unable to send message to qmaster
using port 535 on host "head4": got send error
07/08/2004 01:50:31|schedd|head4|W|qmaster alive timeout expired
Another interesting thing I noticed.. the messages file for schedd seems
to be full of messages like this:
07/08/2004 01:45:16|schedd|head4|E|can't find parallel task 21384.1 task
1.node10 for update in function pe_task_update_master_list_usage
07/08/2004 01:45:16|schedd|head4|E|callback function for event "565298.
EVENT JOB 21384.1 task 1.node10 USAGE" failed
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users