[GE users] Queue Error - any advice?

Neil Baker neil.baker at crl.toshiba.co.uk
Tue Jan 22 16:48:53 GMT 2008


Hi,

I'm Neil, Richard's colleague.  Today I've been doing a lot more research
into this problem of Error queues.  You're replies to Richard's original
posting have been very useful BTW.

Before I go into the details of my findings, when a queue is in an "Error"
state, is it ok to just re-enable the queue (i.e. is it just indicating that
the last job failed), or has some serious damage been caused to the exec
host and all its queues which needs to be fixed first?  If it needs fixing,
does any have any suggestions?

The good news is that a particular job sent by a single member of staff
seems to have caused all the queues on these affected exec hosts to be set
into an Error state.  Unfortunately the member of staff works in a different
time zone on the other side of the world so I can't contact him to find out
what they are doing until he reads his email tomorrow.  Even after I speak
to him, he's unlikely to know exactly what his jobs have done to cause this.

In the meantime, as all the queues have been set to the Error state rather
than just the queue running his job, it seems to indicate that the whole
exec host has been affected and this is the main reason why I've not
re-enabled these exec hosts so far.  Other jobs running since these problems
occurred on other enabled exec hosts don't seem to be causing any problems,
so hopefully it looks more like a job problem rather than a problem with the
qmaster.  

However I was wondering if anyone had experienced anything similar and how
they went about fixing the broken exec hosts (can I get away with just
re-enabling the queues on the broken exec hosts for example).  As they would
be re-enabled into a live environment, where staff are trying to meet
deadlines, I'd rather not do this if there is a chance they are faulty.

Thanks in advance for advice you can give.

Regards

Neil

-----Original Message-----
From: Richard Hobbs [mailto:richard.hobbs at crl.toshiba.co.uk] 
Sent: 22 January 2008 10:03
To: users at gridengine.sunsource.net
Subject: [GE users] Queue Error - any advice?

Hello,

We are seeing the following:

======================================================================
stg-dell30:~ # qhost -q -h stg-lotus5
HOSTNAME             ARCH       NPROC  LOAD   MEMTOT   MEMUSE   SWAPTO
 SWAPUS
----------------------------------------------------------------------------
---
global               -              -     -        -        -        -
      -
stg-lotus5           glinux         4  0.14  1010.2M    88.5M     2.0G
  12.4M
   lotus5F1             BIP   0/1      E
   lotus5F2             BIP   0/1      E
   lotus5H1             BIP   0/1      E
   lotus5H2             BIP   0/1      E
   lotus5L1             BIP   0/1      E
   lotus5L2             BIP   0/1      E
   lotus5L3             BIP   0/1      E
   lotus5L4             BIP   0/1      E
   lotus5S1             BIP   0/1      E
   lotus5S2             BIP   0/1      E
stg-dell30:~ #
======================================================================

Does anyone know what i can run to investigate further?

Just for reference, "stg-lotus5" has a load average of 0 and is running
no jobs.

Thanks in advance, people!

Richard.

-- 
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Cambridge Research Laboratory
Email: richard.hobbs at crl.toshiba.co.uk
Web: http://www.toshiba-europe.com/research/
Tel: +44 1223 436999        Mobile: +44 7811 803377

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________




______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list