[GE users] Yet another qdel mpich problem (SGE 6.0u1)

Vladimir Florinski vflorins at ucr.edu
Wed Sep 8 03:11:44 BST 2004


On Tue, 2004-09-07 at 18:29, Ron Chen wrote:
> (I didn't read the whole thread, may be there's
> something that I missed)
> 
> The code in shepherd.c - shepherd_signal_job() is
> supposed to signal all the processes in the job. You
> can get more detail here:
> 
> http://gridengine.sunsource.net/servlets/ReadMsg?msgId=11809&listName=users
> 

Ok, I found two places where some code has been disabled for Linux,
Solaris and Alpha. I removed the #if 0 and recompiled SGE (a correct
build procedure requires installing Berkley DB, but I was lazy and built
with -spool-classic. I don't think it matters since I only took the
sge_shepherd program out of the build).

Unfortunately, using the rebuilt sge_shepherd did not change the
situation. The processes are still running after a qdel. Again, I would
like to emphasize that on one of the nodes (always the same one) one of
the processes (with a smaller PID) is always killed correctly. Is it
possible that there is some file that is supposed to be local to each
node, but is rather shared on this system (this is a diskless cluster,
so everything is mounted via NFS)? I know there is information about the
child processes kept in /tmp/<job.queue>/ but all files there had unique
names, so there could be no conflict.


-- 
Vladimir Florinski
Assistant Research Physicist
Institute of Geophysics and Planetary Physics
University of California
Riverside, CA 92521
phone: 1-909-787-3943
fax: 1-909-787-4509


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list