[GE users] Queue Error - any advice?

Reuti reuti at staff.uni-marburg.de
Wed Jan 23 11:20:09 GMT 2008


Hi,

Am 23.01.2008 um 11:01 schrieb Neil Baker:

> A very useful article. However, the OS logs don't seem to indicate any
> problems.  Perhaps our level of syslog isn't set at a low enough  
> resolution.
>
> We're running 5.3v7 and I've now managed to speak to the user who's  
> jobs
> caused the problems.  The only big difference between his jobs and  
> other
> users jobs are the options he uses to qsub:
>
> -m as -M james.nealand at crl.toshiba.co.uk
>
> I can't see how this can cause the problems we're seeing and not  
> all his
> jobs using this caused problems.  Out of the 32 or more jobs he  
> submitted
> only 17 caused errors.  Each error resulted in only one line in the  
> messages
> log file for that host.
>
> For the 7 errors that didn't cause the exec hosts to go into an  
> error state
> the error was:
>
> Tue Jan 22 09:25:48 2008|execd|stg-dell19|E|abnormal termination of  
> shepherd
> for job 7350890.1: "exit_status" file is empty

AFAIR also in 5.3 it was possible to set "loglevel log_info" to see  
more infos in the messages file(s). Maybe a problem with the mail  
system?

-- Reuti


> For the 10 errors that did cause the exec hosts to go into an error  
> state
> (disabling them from the grid) the error was:
>
> Tue Jan 22 09:25:48 2008|execd|stg-dell10|E|"abnormal termination of
> shepherd for job 7350903.1: no "exit_status" file"
>
> As all these jobs ran at 9:25am yesterday and all 10 machines went  
> into the
> Error state at the same time, it does look job related rather than  
> purely OS
> / Hardware related.
>
> As a result I've cleared the error state and jobs seem to be  
> running ok on
> those affected machines.
>
> I wouldn't say that the problem is solved, but I think that re- 
> enabling them
> isn't likely to cause problems for most staff, only the user's jobs in
> question.
>
> Can anyone recommend how we can turn on better logging to try and  
> capture
> more details information?
>
> Regards
>
> Neil
>
> -----Original Message-----
> From: Chris Dagdigian [mailto:dag at sonsorol.org]
> Sent: 22 January 2008 17:12
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Queue Error - any advice?
>
>
> I wrote about queue state "E" a few days ago, not sure if this is of
> interest:
>
> http://gridengine.info/articles/2008/01/20/understanding-queue- 
> error-state-e
>
> { more comments inline ... }
>
> Regards,
> Chris
>
>
> On Jan 22, 2008, at 11:48 AM, Neil Baker wrote:
>
>> Hi,
>>
>> I'm Neil, Richard's colleague.  Today I've been doing a lot more
>> research
>> into this problem of Error queues.  You're replies to Richard's
>> original
>> posting have been very useful BTW.
>>
>> Before I go into the details of my findings, when a queue is in an
>> "Error"
>> state, is it ok to just re-enable the queue (i.e. is it just
>> indicating that
>> the last job failed), or has some serious damage been caused to the
>> exec
>> host and all its queues which needs to be fixed first?  If it needs
>> fixing,
>> does any have any suggestions?
>>
>> The good news is that a particular job sent by a single member of
>> staff
>> seems to have caused all the queues on these affected exec hosts to
>> be set
>> into an Error state.  Unfortunately the member of staff works in a
>> different
>> time zone on the other side of the world so I can't contact him to
>> find out
>> what they are doing until he reads his email tomorrow.  Even after I
>> speak
>> to him, he's unlikely to know exactly what his jobs have done to
>> cause this.
>>
>> In the meantime, as all the queues have been set to the Error state
>> rather
>> than just the queue running his job, it seems to indicate that the
>> whole
>> exec host has been affected and this is the main reason why I've not
>> re-enabled these exec hosts so far.  Other jobs running since these
>> problems
>> occurred on other enabled exec hosts don't seem to be causing any
>> problems,
>> so hopefully it looks more like a job problem rather than a problem
>> with the
>> qmaster.
>
> When your entire SGE system is globally showing state "E" it usually
> means that the "bad" job that triggered the problem in the first place
> was submitted with the "rerunnable" option set to "yes". In these
> situations, your job will fail and will immediatly get redispatched to
> another queue instance where it will also fail. Eventually your entire
> cluster goes offline.
>
> The moral here is carefully use the rerunnable option and only on
> scripts and workflows that have been tested beforehand.
>
> But ... rerunnable is not always the culprit:
>
> The other option (if rerunnable is not set) is that multiple jobs sent
> to diffrent nodes all caused the problem. This would happen if for
> instance someone who does not actually have a user account on the
> remote nodes submitted a bunch of jobs. Or if many jobs were submitted
> with impossible output file paths etc.
>
>
>>
>>
>> However I was wondering if anyone had experienced anything similar
>> and how
>> they went about fixing the broken exec hosts (can I get away with  
>> just
>> re-enabling the queues on the broken exec hosts for example).  As
>> they would
>> be re-enabled into a live environment, where staff are trying to meet
>> deadlines, I'd rather not do this if there is a chance they are
>> faulty.
>>
>> Thanks in advance for advice you can give.
>>
>> Regards
>>
>> Neil
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
>
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list