[GE users] Machines constantly in Error state, and won't stay cleared...

sgenedharvey sge at nedharvey.com
Tue Jan 19 22:18:39 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Ahh.  Well, this led somewhere.

I have no execution directory on the machines that are failing.  I can't
seem to figure out why...

During installation, I run "./inst_sge -x" and after a while, it asks:
    The spool directory is currently set to:
    <</gridware/sge/default/spool/air>>
And I just hit "enter" for default.

On a "good" machine, when sge_execd starts, it comes up instantly (less than
a second or two) and it creates the various subdirs of the spool directory
automatically.

On a "bad" machine, when sge_execd starts, it takes a long time (I guess 30
seconds) and there is no spool directory created.  There is also no error
message. 

So ... I have no idea why the behavior would be different...



On 1/19/10 3:22 PM, "templedf" <dan.templeton at sun.com> wrote:

> Look in the "messages" file in the execd's spool directory. It's
> probably located at $SGE_ROOT/$SGE_CELL/spool/air/messages. If it's not
> there, look at qconf -sconf air or qconf -sconf to find the location of
> the spool directory.
> 
> Daniel
> 
> sgenedharvey wrote:
>> I can?t figure out what?s causing this problem. I installed execution
>> host on 4 machines, precisely the same, by copying & pasting the
>> commands precisely the same on each machine. Two of the machines are
>> working flawlessly, and two of the machines keep going into Error
>> state. Even if I uninstall/reinstall execution host daemon, completely
>> removing all SGE files from the systems ... Even after clearing the
>> Error state from the queues ... The first time the machine is
>> scheduled to run a job, the job fails, and the machine returns to
>> Error state.
>> 
>> 
>> If I check the reason for Error state:
>> [eharvey at air gridout]$ qstat -explain E
>> queuename qtype resv/used/tot. load_avg arch states
>> -----------------------------------------------------------------------------
>> ----
>> camb.q at air.lyricsemi.hdq BIP 0/0/2 2.11 lx24-amd64 E
>> queue camb.q marked QERROR as result of job 367's failure at host
>> air.lyricsemi.hdq
>> queue camb.q marked QERROR as result of job 369's failure at host
>> air.lyricsemi.hdq
>> 
>> 
>> So, obviously, I want to know why those 2 jobs failed .... But can?t
>> seem to find any record anywhere...
>> 
>> 
>> If I check the man page, it says ?Please check the error logfile of
>> that sge_execd?
>> But I can?t find any logfile ... Can anybody tell me where to find the
>> logfile? Or any other method to figure out why these machines keep
>> going into error state?
>> 
>> 
>> I am running SGE 6.2u4
>> 
>> Thanks....
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239
> 790
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239810

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list