[GE users] Dead processes

John Coldrick jc at axyzfx.com
Fri Sep 10 14:21:43 BST 2004

	Had a scenario yesterday, running sge-6.0u1 on a mix of Redhat 7.3 and SUSE 
9.1, where a couple of grid jobs were running on a SUSE exec host, and the 
system died for a reason completely unrelated to SGE.  When it came back up, 
starting the sgeexecd gave me some msg about these two process ids - I guess 
it was trying to recover.  It didn't, and the machine wouldn't be recognized 
by the master - with an "AU" status.  Restarting the daemon made no 
difference.  The only way I could get it back up was to manually dig through 
the $SGE_ROOT/default/spool/foosystem subdirectories and manually delete all 
the various jobs and scripts floating around from those crashed jobs, and 
restart the daemon.

	My questions are:  is there a method I can use to clear this state and get 
the machine back online without the manually digging?  Also, is this sort of 
thing common?  It seemed to me that SGE was expecting specific PIDs from the 
previous session, couldn't find them, and instead of giving some sort of 
error and coming up properly, it seems to lock.  That doesn't seem to be very 
nice behaviour...seems to put the job in a higher priority over exec machine 
status.  To me, resubmitting a task is usually trivial - having that machine 
up and running is more important.



John Coldrick                  www.axyzfx.com        Axyz Animation
Houdini/Renderman/Discreet                           425 Adelaide St W
416-504-0425                                         Toronto, ON Canada
jc at axyzfx.com                                        M5V 1S4
Always remember that you are unique.  Just like everyone else.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list