[GE users] Machines constantly in Error state, and won't stay cleared...

templedf dan.templeton at sun.com
Wed Jan 20 14:32:53 GMT 2010

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Sounds like an NFS or name service issue.  Check in the /tmp directory 
on air to see if the execd is dumping an emergency log file there.  You 
should also try touching a file in /gridware/sge/default as the admin 
user to see if it works manually.


sgenedharvey wrote:
> Ahh.  Well, this led somewhere.
> I have no execution directory on the machines that are failing.  I can't
> seem to figure out why...
> During installation, I run "./inst_sge -x" and after a while, it asks:
>     The spool directory is currently set to:
>     <</gridware/sge/default/spool/air>>
> And I just hit "enter" for default.
> On a "good" machine, when sge_execd starts, it comes up instantly (less than
> a second or two) and it creates the various subdirs of the spool directory
> automatically.
> On a "bad" machine, when sge_execd starts, it takes a long time (I guess 30
> seconds) and there is no spool directory created.  There is also no error
> message. 
> So ... I have no idea why the behavior would be different...
> On 1/19/10 3:22 PM, "templedf" <dan.templeton at sun.com> wrote:
>> Look in the "messages" file in the execd's spool directory. It's
>> probably located at $SGE_ROOT/$SGE_CELL/spool/air/messages. If it's not
>> there, look at qconf -sconf air or qconf -sconf to find the location of
>> the spool directory.
>> Daniel
>> sgenedharvey wrote:
>>> I can?t figure out what?s causing this problem. I installed execution
>>> host on 4 machines, precisely the same, by copying & pasting the
>>> commands precisely the same on each machine. Two of the machines are
>>> working flawlessly, and two of the machines keep going into Error
>>> state. Even if I uninstall/reinstall execution host daemon, completely
>>> removing all SGE files from the systems ... Even after clearing the
>>> Error state from the queues ... The first time the machine is
>>> scheduled to run a job, the job fails, and the machine returns to
>>> Error state.
>>> If I check the reason for Error state:
>>> [eharvey at air gridout]$ qstat -explain E
>>> queuename qtype resv/used/tot. load_avg arch states
>>> -----------------------------------------------------------------------------
>>> ----
>>> camb.q at air.lyricsemi.hdq BIP 0/0/2 2.11 lx24-amd64 E
>>> queue camb.q marked QERROR as result of job 367's failure at host
>>> air.lyricsemi.hdq
>>> queue camb.q marked QERROR as result of job 369's failure at host
>>> air.lyricsemi.hdq
>>> So, obviously, I want to know why those 2 jobs failed .... But can?t
>>> seem to find any record anywhere...
>>> If I check the man page, it says ?Please check the error logfile of
>>> that sge_execd?
>>> But I can?t find any logfile ... Can anybody tell me where to find the
>>> logfile? Or any other method to figure out why these machines keep
>>> going into error state?
>>> I am running SGE 6.2u4
>>> Thanks....
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239
>> 790
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239810
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list