[GE users] Machines constantly in Error state, and won't stay cleared...

craffi dag at sonsorol.org
Tue Jan 19 20:13:34 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

The best debugging tip is to check the output of your jobs looking for 
error messages.

The next best debugging is to look at the .o and .e files that SGE may 
have created for the job

After that find your $SGE_ROOT/$SGE_CELL/spool/qmaster/messages file
as well as:

$SGE_ROOT/$SGE_CELL/spool/qmaster/messages
$SGE_ROOT/$SGE_CELL/spool/<nodename>/messages

Then look in /tmp on the nodes where the job ran to see if there are any 
panic log messages.


 From memory here are the causes of error state E on jobs:

- user does not exist on the remote node
- user has a different UID/GID on the remote node
- really badly messed up fileserver or directory permissions
- missing NFS mount or bad file path error
- really messed up local OS situation on the compute node


What you need to understand is that error state E is caused by a job 
failing in such a spectacular manner that SGE decides that something is 
really wrong and closes off the entire queue instance to avoid a "black 
whole" effect where all pending jobs drain and exit on error on a 
malfunctioning system.

Trust me, the E state is for your protection. Heh.


-Chris



sgenedharvey wrote:
> I can?t figure out what?s causing this problem. I installed execution
> host on 4 machines, precisely the same, by copying & pasting the
> commands precisely the same on each machine. Two of the machines are
> working flawlessly, and two of the machines keep going into Error state.
> Even if I uninstall/reinstall execution host daemon, completely removing
> all SGE files from the systems ... Even after clearing the Error state
> from the queues ... The first time the machine is scheduled to run a
> job, the job fails, and the machine returns to Error state.
>
>
> If I check the reason for Error state:
> [eharvey at air gridout]$ qstat -explain E
> queuename qtype resv/used/tot. load_avg arch states
> ---------------------------------------------------------------------------------
> camb.q at air.lyricsemi.hdq BIP 0/0/2 2.11 lx24-amd64 E
> queue camb.q marked QERROR as result of job 367's failure at host
> air.lyricsemi.hdq
> queue camb.q marked QERROR as result of job 369's failure at host
> air.lyricsemi.hdq
>
>
> So, obviously, I want to know why those 2 jobs failed .... But can?t
> seem to find any record anywhere...
>
>
> If I check the man page, it says ?Please check the error logfile of that
> sge_execd?
> But I can?t find any logfile ... Can anybody tell me where to find the
> logfile? Or any other method to figure out why these machines keep going
> into error state?
>
>
> I am running SGE 6.2u4
>
> Thanks....

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239789

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list