[GE users] Dead processes

Reuti reuti at staff.uni-marburg.de
Fri Sep 10 19:15:28 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

>	Had a scenario yesterday, running sge-6.0u1 on a mix of Redhat 7.3 and SUSE 
>9.1, where a couple of grid jobs were running on a SUSE exec host, and the 
>system died for a reason completely unrelated to SGE.  When it came back up, 
>starting the sgeexecd gave me some msg about these two process ids - I guess 
>it was trying to recover.  It didn't, and the machine wouldn't be recognized 
>by the master - with an "AU" status.  Restarting the daemon made no 
>difference.  The only way I could get it back up was to manually dig through 
>the $SGE_ROOT/default/spool/foosystem subdirectories and manually delete all 
>the various jobs and scripts floating around from those crashed jobs, and 
>restart the daemon.

This may happen (as you noted) after a crash of a machine. Usually it's some 
way of being out of sync: there is a pid for a running job, but the job isn't 
there. Hints of this are always entries in the message file of the qmaster and 
node.

>	My questions are:  is there a method I can use to clear this state and get 
>the machine back online without the manually digging?  Also, is this sort of 
>thing common?  It seemed to me that SGE was expecting specific PIDs from the 
>previous session, couldn't find them, and instead of giving some sort of 
>error and coming up properly, it seems to lock.  That doesn't seem to be very 
>nice behaviour...seems to put the job in a higher priority over exec machine 
>status.  To me, resubmitting a task is usually trivial - having that machine 
>up and running is more important.

You mean, on a crashed exec host, that a restart of the execd should clean all 
of the directories (or copy it contents to a backup area)?

CU - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list