[GE users] Jobs getting rescheduled

amfortas n.gresham at manchester.ac.uk
Mon Aug 16 18:53:28 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

> Am 16.08.2010 um 19:19 schrieb amfortas:
> 
> > Many thinks for responding.
> > 
> >> jobs were submit with "-r y" and/or the queue has the flag "rerun TRUE" set?
> > 
> > Yes, that is set for the queue, to catch the occasional job that may need to be rescheduled owing to a problem on a work-node.
> 
> Is the job rescheduling itself, or just when a node gets "unheard" for some time?
> 
> 

I suspected that the 'unheard' re-scheduling trigger may be getting invoked. I have the following parameters:

load_report_time             00:01:00
max_unheard                  00:10:00
reschedule_unknown           00:15:00

But nothing is ever reported as 'unheard', there is never anything in state 'u': all nodes appear to be OK under 'watch -d qhost', for example.

> > But what is surprising is that every job in the entire queue is getting rescheduled at the same time: even those that seem to be running quite happily. Is this the intended behaviour when "rerun TRUE" or '-r y' are set?
> 
> No.
> 
> 
> >> Was there any entry in the messages file of the qmaster (while "loglevel log_info" is set)?
> > 
> > Log level was already set to 'log_info', but there is nothing informative in the qmaster 'messages' file.
> > 
> >> Someone issued `qmod -rj "*"` by accident?
> > 
> > I don't think so, no.
> 
> Just as a note: if someone who has manager right does this, all jobs will be rescheduled.
> 
> Anything in the accounting record? Usually there is written one when a job gets rescheduled.

Nothing found there, just the "job? didn't get resources" message in 'reporting'

Regards

[NG]

> 
> -- Reuti
> 
> 
> > Regards
> > 
> > [NG]
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274775
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274787

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list