[GE users] Queue Error - any advice?

Neil Baker neil.baker at crl.toshiba.co.uk
Wed Jan 23 10:01:34 GMT 2008


Hi Chris,

A very useful article. However, the OS logs don't seem to indicate any
problems.  Perhaps our level of syslog isn't set at a low enough resolution.

We're running 5.3v7 and I've now managed to speak to the user who's jobs
caused the problems.  The only big difference between his jobs and other
users jobs are the options he uses to qsub:

-m as -M james.nealand at crl.toshiba.co.uk

I can't see how this can cause the problems we're seeing and not all his
jobs using this caused problems.  Out of the 32 or more jobs he submitted
only 17 caused errors.  Each error resulted in only one line in the messages
log file for that host.  

For the 7 errors that didn't cause the exec hosts to go into an error state
the error was:

Tue Jan 22 09:25:48 2008|execd|stg-dell19|E|abnormal termination of shepherd
for job 7350890.1: "exit_status" file is empty

For the 10 errors that did cause the exec hosts to go into an error state
(disabling them from the grid) the error was:

Tue Jan 22 09:25:48 2008|execd|stg-dell10|E|"abnormal termination of
shepherd for job 7350903.1: no "exit_status" file"

As all these jobs ran at 9:25am yesterday and all 10 machines went into the
Error state at the same time, it does look job related rather than purely OS
/ Hardware related.

As a result I've cleared the error state and jobs seem to be running ok on
those affected machines.

I wouldn't say that the problem is solved, but I think that re-enabling them
isn't likely to cause problems for most staff, only the user's jobs in
question.

Can anyone recommend how we can turn on better logging to try and capture
more details information?

Regards

Neil

-----Original Message-----
From: Chris Dagdigian [mailto:dag at sonsorol.org] 
Sent: 22 January 2008 17:12
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Queue Error - any advice?


I wrote about queue state "E" a few days ago, not sure if this is of  
interest:

http://gridengine.info/articles/2008/01/20/understanding-queue-error-state-e

{ more comments inline ... }

Regards,
Chris


On Jan 22, 2008, at 11:48 AM, Neil Baker wrote:

> Hi,
>
> I'm Neil, Richard's colleague.  Today I've been doing a lot more  
> research
> into this problem of Error queues.  You're replies to Richard's  
> original
> posting have been very useful BTW.
>
> Before I go into the details of my findings, when a queue is in an  
> "Error"
> state, is it ok to just re-enable the queue (i.e. is it just  
> indicating that
> the last job failed), or has some serious damage been caused to the  
> exec
> host and all its queues which needs to be fixed first?  If it needs  
> fixing,
> does any have any suggestions?
>
> The good news is that a particular job sent by a single member of  
> staff
> seems to have caused all the queues on these affected exec hosts to  
> be set
> into an Error state.  Unfortunately the member of staff works in a  
> different
> time zone on the other side of the world so I can't contact him to  
> find out
> what they are doing until he reads his email tomorrow.  Even after I  
> speak
> to him, he's unlikely to know exactly what his jobs have done to  
> cause this.
>
> In the meantime, as all the queues have been set to the Error state  
> rather
> than just the queue running his job, it seems to indicate that the  
> whole
> exec host has been affected and this is the main reason why I've not
> re-enabled these exec hosts so far.  Other jobs running since these  
> problems
> occurred on other enabled exec hosts don't seem to be causing any  
> problems,
> so hopefully it looks more like a job problem rather than a problem  
> with the
> qmaster.

When your entire SGE system is globally showing state "E" it usually  
means that the "bad" job that triggered the problem in the first place  
was submitted with the "rerunnable" option set to "yes". In these  
situations, your job will fail and will immediatly get redispatched to  
another queue instance where it will also fail. Eventually your entire  
cluster goes offline.

The moral here is carefully use the rerunnable option and only on  
scripts and workflows that have been tested beforehand.

But ... rerunnable is not always the culprit:

The other option (if rerunnable is not set) is that multiple jobs sent  
to diffrent nodes all caused the problem. This would happen if for  
instance someone who does not actually have a user account on the  
remote nodes submitted a bunch of jobs. Or if many jobs were submitted  
with impossible output file paths etc.


>
>
> However I was wondering if anyone had experienced anything similar  
> and how
> they went about fixing the broken exec hosts (can I get away with just
> re-enabling the queues on the broken exec hosts for example).  As  
> they would
> be re-enabled into a live environment, where staff are trying to meet
> deadlines, I'd rather not do this if there is a chance they are  
> faulty.
>
> Thanks in advance for advice you can give.
>
> Regards
>
> Neil
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________




______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list