[GE users] Machines constantly in Error state, and won't stay cleared...

sgenedharvey sge at nedharvey.com
Tue Jan 19 19:59:39 GMT 2010


    [ The following text is in the "Windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I can?t figure out what?s causing this problem.  I installed execution host on 4 machines, precisely the same, by copying & pasting the commands precisely the same on each machine.  Two of the machines are working flawlessly, and two of the machines keep going into Error state.  Even if I uninstall/reinstall execution host daemon, completely removing all SGE files from the systems ... Even after clearing the Error state from the queues ... The first time the machine is scheduled to run a job, the job fails, and the machine returns to Error state.


If I check the reason for Error state:
[eharvey at air gridout]$ qstat -explain E
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
camb.q at air.lyricsemi.hdq       BIP   0/0/2          2.11     lx24-amd64    E
    queue camb.q marked QERROR as result of job 367's failure at host air.lyricsemi.hdq
    queue camb.q marked QERROR as result of job 369's failure at host air.lyricsemi.hdq


So, obviously, I want to know why those 2 jobs failed ....  But can?t seem to find any record anywhere...


If I check the man page, it says ?Please check the error logfile of that sge_execd?
But I can?t find any logfile ...  Can anybody tell me where to find the logfile?  Or any other method to figure out why these machines keep going into error state?


I am running SGE 6.2u4

Thanks....



More information about the gridengine-users mailing list