[GE users] jobs never die on nodes with mpich

Reuti reuti at staff.uni-marburg.de
Thu Aug 12 11:57:33 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

All,

in the last days I looked into the behavior of the slave processes started by 
qrsh on the nodes with Michel, which posted the original question on this list. 
In the end, we came to the conclusion:

All is working fine, as long as the started process by qrsh on the slave is 
just one command/program only. If there is more than one command (like "cd ~; 
myprogram"), the qrsh_starter will create a helping bash to handle these 
commands and lose control of the created child (I red the hint on the list, 
that the created bash has a new process group). Also a whole bash script on the 
slaves will give the same problem.

This is the case with some mpirun scripts, e.g. mpich.ch_gm.pl from Myrinet 
("cd ~; env blabla=7845 myprogram"). It seems working, if the last commands in 
the calls is always prefixed by an "exec": "cd ~; exec env blabla=7845 
myprogram". Then there is no bash-in-the-middle left, the qdel is working fine. 
It's easy to put the "exec" before the calls in the Myrinet perl script where 
the command line is defined.

I don't have ssh on the cluster running, but I red the questions on the list 
about the problems and recompiling all the stuff to get a process group 
attached to the ssh process on the slaves. Maybe putting an "exec" in the right 
place also in this cases will solve the problem.

Cheers - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list