[GE users] Machines constantly in Error state, and won't stay cleared...

torsten torsten.blix at sun.com
Wed Jan 20 07:06:13 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On 01/19/10 23:18, sgenedharvey wrote:
> On a "bad" machine, when sge_execd starts, it takes a long time (I guess 30
> seconds) and there is no spool directory created.  There is also no error
> message. 
> 
> So ... I have no idea why the behavior would be different...

Before the spool directory is created, the messages file is first 
created in /tmp, so take a look if you find something there.

Cheers,
Torsten

> On 1/19/10 3:22 PM, "templedf" <dan.templeton at sun.com> wrote:
> 
>> Look in the "messages" file in the execd's spool directory. It's
>> probably located at $SGE_ROOT/$SGE_CELL/spool/air/messages. If it's not
>> there, look at qconf -sconf air or qconf -sconf to find the location of
>> the spool directory.
>>
>> Daniel
>>
>> sgenedharvey wrote:
>>> I can?t figure out what?s causing this problem. I installed execution
>>> host on 4 machines, precisely the same, by copying & pasting the
>>> commands precisely the same on each machine. Two of the machines are
>>> working flawlessly, and two of the machines keep going into Error
>>> state. Even if I uninstall/reinstall execution host daemon, completely
>>> removing all SGE files from the systems ... Even after clearing the
>>> Error state from the queues ... The first time the machine is
>>> scheduled to run a job, the job fails, and the machine returns to
>>> Error state.
>>>
>>>
>>> If I check the reason for Error state:
>>> [eharvey at air gridout]$ qstat -explain E
>>> queuename qtype resv/used/tot. load_avg arch states
>>> -----------------------------------------------------------------------------
>>> ----
>>> camb.q at air.lyricsemi.hdq BIP 0/0/2 2.11 lx24-amd64 E
>>> queue camb.q marked QERROR as result of job 367's failure at host
>>> air.lyricsemi.hdq
>>> queue camb.q marked QERROR as result of job 369's failure at host
>>> air.lyricsemi.hdq
>>>
>>>
>>> So, obviously, I want to know why those 2 jobs failed .... But can?t
>>> seem to find any record anywhere...
>>>
>>>
>>> If I check the man page, it says ?Please check the error logfile of
>>> that sge_execd?
>>> But I can?t find any logfile ... Can anybody tell me where to find the
>>> logfile? Or any other method to figure out why these machines keep
>>> going into error state?
>>>
>>>
>>> I am running SGE 6.2u4
>>>
>>> Thanks....
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239
>> 790
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239810
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239886

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list