[GE users] Queue Error - any advice?

Chris Dagdigian dag at sonsorol.org
Tue Jan 22 17:12:10 GMT 2008


I wrote about queue state "E" a few days ago, not sure if this is of  
interest:

http://gridengine.info/articles/2008/01/20/understanding-queue-error-state-e

{ more comments inline ... }

Regards,
Chris


On Jan 22, 2008, at 11:48 AM, Neil Baker wrote:

> Hi,
>
> I'm Neil, Richard's colleague.  Today I've been doing a lot more  
> research
> into this problem of Error queues.  You're replies to Richard's  
> original
> posting have been very useful BTW.
>
> Before I go into the details of my findings, when a queue is in an  
> "Error"
> state, is it ok to just re-enable the queue (i.e. is it just  
> indicating that
> the last job failed), or has some serious damage been caused to the  
> exec
> host and all its queues which needs to be fixed first?  If it needs  
> fixing,
> does any have any suggestions?
>
> The good news is that a particular job sent by a single member of  
> staff
> seems to have caused all the queues on these affected exec hosts to  
> be set
> into an Error state.  Unfortunately the member of staff works in a  
> different
> time zone on the other side of the world so I can't contact him to  
> find out
> what they are doing until he reads his email tomorrow.  Even after I  
> speak
> to him, he's unlikely to know exactly what his jobs have done to  
> cause this.
>
> In the meantime, as all the queues have been set to the Error state  
> rather
> than just the queue running his job, it seems to indicate that the  
> whole
> exec host has been affected and this is the main reason why I've not
> re-enabled these exec hosts so far.  Other jobs running since these  
> problems
> occurred on other enabled exec hosts don't seem to be causing any  
> problems,
> so hopefully it looks more like a job problem rather than a problem  
> with the
> qmaster.

When your entire SGE system is globally showing state "E" it usually  
means that the "bad" job that triggered the problem in the first place  
was submitted with the "rerunnable" option set to "yes". In these  
situations, your job will fail and will immediatly get redispatched to  
another queue instance where it will also fail. Eventually your entire  
cluster goes offline.

The moral here is carefully use the rerunnable option and only on  
scripts and workflows that have been tested beforehand.

But ... rerunnable is not always the culprit:

The other option (if rerunnable is not set) is that multiple jobs sent  
to diffrent nodes all caused the problem. This would happen if for  
instance someone who does not actually have a user account on the  
remote nodes submitted a bunch of jobs. Or if many jobs were submitted  
with impossible output file paths etc.


>
>
> However I was wondering if anyone had experienced anything similar  
> and how
> they went about fixing the broken exec hosts (can I get away with just
> re-enabling the queues on the broken exec hosts for example).  As  
> they would
> be re-enabled into a live environment, where staff are trying to meet
> deadlines, I'd rather not do this if there is a chance they are  
> faulty.
>
> Thanks in advance for advice you can give.
>
> Regards
>
> Neil
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list