[GE users] Problem with 6.1 after vacation

John Hearns john.hearns at streamline-computing.com
Thu Aug 9 15:44:18 BST 2007


On Thu, 2007-08-09 at 07:15 -0700, Brett_W_Grant at raytheon.com wrote:
> 
> Anyone have any ideas? 
> 
> Thanks, 
> Brett Grant

Brett,
  the problem might be on one (or more) of your exec hosts having
dropped an NFS mount.
In my experience, the Gridengine messages file does tell you the truth,
and is very helpful. This "unable to find job" error might indicate that
an exec host (hosts) no longer have the SGE spool directory mounted, or
the users home directory, or somewhere where the job executable lives.

Plan of attack:

a) check the NFS server is running on the qmaster machine

b) log into an idel node, check NFS mounts, unmount an NFS drive or two,
remount

c) reboot this idle node

d) do some simple (non-array) job submissions.
    run a 'qrsh hostname'
    qsub a script which is simply a sleep

e) submit a simple array job, which is lots and lots of sleeps as above

f) consider rebooting exec hosts if they have got in a real mess with
NFS

g) train up a PFY so you can enjoy vacation without the cellphone
   http://catb.org/jargon/html/P/PFY.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list