[GE users] clearing error condition on queues

John Hearns john.hearns at streamline-computing.com
Wed Jul 4 15:29:57 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dave Love wrote:
> My system (running 6.0u8) has got in a mess and is refusing to
> schedule stuff on most of the nodes, saying
> 
>   all.q marked QERROR as result of job 1659's failure at host node1...
> 
> (repeated multiple times).

Look in the messages file. Both in
$SGE_ROOT/default/spool/qmaster/messages
$SGE_ROOT/default/spool/node1/messages

In my experience, the errors described there are overwhelmingly true - 
ie. SGE does not spit spurious errors, so pay attention to what it says 
there.
Solve the problem and then clear the error state on the queue.

Also, is job 1659 still in the queue?
Might be worth putting a hold on it, and making sure other jobs can run 
whilst this one is held.




-- 
      John Hearns
      Senior HPC Engineer
      Streamline Computing,
      The Innovation Centre, Warwick Technology Park,
      Gallows Hill, Warwick CV34 6UW
      Office: 01926 623130 Mobile: 07841 231235

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list