[GE users] Queue Error - any advice?

Ken Tang kentang at berkeley.edu
Tue Jan 22 17:03:03 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Usually what I have done in the past as to not disturb current running 
jobs on the exec hosts and the qmaster computer is to either delete the 
job that is causing the problem and then clear the error state 
specifically on that exec host, along with the process ID of the job.  
Sometimes this fixes the problem and fixes the queues.  Other times I 
have had to just simply restart the sge_execd service.


Neil Baker wrote:
> Hi,
>
> I'm Neil, Richard's colleague.  Today I've been doing a lot more research
> into this problem of Error queues.  You're replies to Richard's original
> posting have been very useful BTW.
>
> Before I go into the details of my findings, when a queue is in an "Error"
> state, is it ok to just re-enable the queue (i.e. is it just indicating that
> the last job failed), or has some serious damage been caused to the exec
> host and all its queues which needs to be fixed first?  If it needs fixing,
> does any have any suggestions?
>
> The good news is that a particular job sent by a single member of staff
> seems to have caused all the queues on these affected exec hosts to be set
> into an Error state.  Unfortunately the member of staff works in a different
> time zone on the other side of the world so I can't contact him to find out
> what they are doing until he reads his email tomorrow.  Even after I speak
> to him, he's unlikely to know exactly what his jobs have done to cause this.
>
> In the meantime, as all the queues have been set to the Error state rather
> than just the queue running his job, it seems to indicate that the whole
> exec host has been affected and this is the main reason why I've not
> re-enabled these exec hosts so far.  Other jobs running since these problems
> occurred on other enabled exec hosts don't seem to be causing any problems,
> so hopefully it looks more like a job problem rather than a problem with the
> qmaster.  
>
> However I was wondering if anyone had experienced anything similar and how
> they went about fixing the broken exec hosts (can I get away with just
> re-enabling the queues on the broken exec hosts for example).  As they would
> be re-enabled into a live environment, where staff are trying to meet
> deadlines, I'd rather not do this if there is a chance they are faulty.
>
> Thanks in advance for advice you can give.
>
> Regards
>
> Neil
>
> -----Original Message-----
> From: Richard Hobbs [mailto:richard.hobbs at crl.toshiba.co.uk] 
> Sent: 22 January 2008 10:03
> To: users at gridengine.sunsource.net
> Subject: [GE users] Queue Error - any advice?
>
> Hello,
>
> We are seeing the following:
>
> ======================================================================
> stg-dell30:~ # qhost -q -h stg-lotus5
> HOSTNAME             ARCH       NPROC  LOAD   MEMTOT   MEMUSE   SWAPTO
>  SWAPUS
> ----------------------------------------------------------------------------
> ---
> global               -              -     -        -        -        -
>       -
> stg-lotus5           glinux         4  0.14  1010.2M    88.5M     2.0G
>   12.4M
>    lotus5F1             BIP   0/1      E
>    lotus5F2             BIP   0/1      E
>    lotus5H1             BIP   0/1      E
>    lotus5H2             BIP   0/1      E
>    lotus5L1             BIP   0/1      E
>    lotus5L2             BIP   0/1      E
>    lotus5L3             BIP   0/1      E
>    lotus5L4             BIP   0/1      E
>    lotus5S1             BIP   0/1      E
>    lotus5S2             BIP   0/1      E
> stg-dell30:~ #
> ======================================================================
>
> Does anyone know what i can run to investigate further?
>
> Just for reference, "stg-lotus5" has a load average of 0 and is running
> no jobs.
>
> Thanks in advance, people!
>
> Richard.
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list