[GE users] Jobs getting rescheduled

reuti reuti at staff.uni-marburg.de
Mon Aug 16 19:25:13 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 16.08.2010 um 19:53 schrieb amfortas:

>> Am 16.08.2010 um 19:19 schrieb amfortas:
>> 
>>> Many thinks for responding.
>>> 
>>>> jobs were submit with "-r y" and/or the queue has the flag "rerun TRUE" set?
>>> 
>>> Yes, that is set for the queue, to catch the occasional job that may need to be rescheduled owing to a problem on a work-node.
>> 
>> Is the job rescheduling itself, or just when a node gets "unheard" for some time?
>> 
>> 
> 
> I suspected that the 'unheard' re-scheduling trigger may be getting invoked. I have the following parameters:

Then this  would imply, that all nodes lost contact at the same time.


> load_report_time             00:01:00
> max_unheard                  00:10:00
> reschedule_unknown           00:15:00
> 
> But nothing is ever reported as 'unheard', there is never anything in state 'u': all nodes appear to be OK under 'watch -d qhost', for example.
> 
>>> But what is surprising is that every job in the entire queue is getting rescheduled at the same time: even those that seem to be running quite happily. Is this the intended behaviour when "rerun TRUE" or '-r y' are set?
>> 
>> No.
>> 
>> 
>>>> Was there any entry in the messages file of the qmaster (while "loglevel log_info" is set)?
>>> 
>>> Log level was already set to 'log_info', but there is nothing informative in the qmaster 'messages' file.
>>> 
>>>> Someone issued `qmod -rj "*"` by accident?
>>> 
>>> I don't think so, no.
>> 
>> Just as a note: if someone who has manager right does this, all jobs will be rescheduled.
>> 
>> Anything in the accounting record? Usually there is written one when a job gets rescheduled.
> 
> Nothing found there, just the "job? didn't get resources" message in 'reporting'

There is only one place in the source, where this message is invoked. And the statment before this one, is to write a record in the accounting file. So I wonder, why there is none. `qacct -j <job_id>` should be there for each and every rescheduled job, even when some fields are left empty.

-- Reuti


> Regards
> 
> [NG]
> 
>> 
>> -- Reuti
>> 
>> 
>>> Regards
>>> 
>>> [NG]
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274775
>>> 
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274787
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274794

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list