[GE users] jobs never die on nodes with mpich

Reuti reuti at staff.uni-marburg.de
Fri Aug 13 22:56:48 BST 2004

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


>setgroups'. This is done in libs/uti/sge_set_uid_gid.c, which is 
>called from the shepherd.
>> How will any of them know, that a new bash was created (forked)?
>That's the idea of this additional group. As only root can set it, 
>it's not possible to evade from it. So when processes have to be 
>killed, all those that are part of the same additional group are 
>targeted. (SGE sets a different additional group per job).

>I was just looking at your list of processes and remembered one thing
>that might make a difference. SGE on Linux does not have by default
>enabled the code that sends the signal to the group of processes, see:

thanks for pointing this out. I now understand the idea behind it (and why it's 
not working with the contributed binaries). I already had a brief look at the 
shepherd.c and was indeed wondering about the "#if 0" in some places.

So, the qrsh_starter will write the pid of the started program to a file in 
$TMPDIR, and the process noted there will be killed (and all processes 
belonging to the same additonal group). As long as you have only exactly one 
child (or threads) it's working without having additional groups - is this the 
mechanism? My idea of avoiding bash-forks with the "exec" depends of course on 
the programs you use, maybe it's an option for ssh, to avoid the recompilation. 
Or: why not include an already tuned sshd with the necessary changes in SGE?

On the other hand: I saw the problem only on the slaves and when a fork left 
the process group. When you have a job creating forks only on one exec host, 
it's working. I noticed that using a script, all created forks will have the 
same process group as the starting shell. Will this be used in this case by SGE 
anyway? I found a "-" prefix to the pid in some places in shepherd.c  to the 
kill call. Starting the forking program from an interactive shell instead will 
give a new process group.

I will look into it further during the next days...

Cheers - Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list