[GE users] SGE 6.2u1 qrsh breaks tight pvm integration

schnepper klaus.schnepper at dlr.de
Wed Feb 11 08:43:07 GMT 2009


The new builtin interactive job support in qrsh breaks tight pvm integration 
in grid engine 6.2u1.

Symptom: Slave pvm daemons get started with "qrsh -V -inherit" as expected and 
talk back to the primary pvm daemon through the qrsh connection (stdout and 
stderr). Then the slave pvm daemon hangs and eventually the primary pvm 
daemon gives up on the slaves; the slaves are however still runninng in a 
waiting state below the shepherd. When starting pvm slaves using ssh or rsh 
instead of qrsh this does not happen.

Reason for the behavior:
Slave pvm daemons wait for their stdin to be closed (blocking read on file 
descriptor 0) before finishing their startup sequence. The primary pvm daemon 
uses rsh or qrsh connected to pipes for starting the slave daemons and closes 
the file descriptors to this qrsh after the slave daemon has reported to be 
started via the returning (stdout, stderr) pipes.
This closing of the file descriptors (in particular stdin) is seen in qrsh but 
not sent to the shepherd on the slave side and the shepherd in turn never 
closes the slave daemon's stdin.
This behaviour is different from the behaviour of rsh or ssh which both close 
the slaves's (client's) stdin.

The builtin internal job start method of gridengine should be modified to 
a) detect closing of stdin in qrsh
b) transmitting this file descriptor state change to the shepherd and/or 
qrsh_starter so that this closes the stdin to the command executed from 
qrsh_starter.

Work around: (there are basically two work-arounds)
1) Modify the pvmd source in the file .../src/pvmd.c: in routine slave_config

remove or comment out the lines:
#ifndef WIN32

#if !defined(IMA_OS2) && !defined(CYGWIN)
	if (!ms)
		(void)read(0, (char*)&i, 1);
#endif

This will remove the blocking read from the slave pvm daemon.

2) write a pvmd wrapper program that starts the actual slave pvm daemon and 
monitors its stdin, stdout, stderr connections. In the wrapper filter the 
slave pvm daemon's stdout for a line starting with "ddpro"; after sending 
that on stdout explicitly close the stdin connection (pipe) to the slave pvm 
daemon started from the wrapper. This wakes the slave pvm daemon and finishes 
its startup sequence.

An example for such a wrapper is attached.

The builtin internal job start method of gridengine should be modified to: 
a) detect closing of stdin in qrsh
b) transmitting this file descriptor state change to the shepherd and/or 
qrsh_starter so that this closes the stdin to the command executed from 
qrsh_starter.

With best regards,

Klaus Schnepper


-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Klaus Schnepper                         | Email: Klaus.Schnepper at dlr.de
DLR                                     | Tel.: +49 (0)8153 28 2434
Institut fuer Robotik und Mechatronik   | Fax:  +49 (0)8153 28 1441
D 82230 Wessling                        |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.4.2.1 (GNU/Linux)

mQENAzo4y3oAAAEIAMnSK1xTTlmk913ypTmc35BWlrXJ7cJagQrWo9s7jEL3F2Ik
s+The4gWDRyp9OEFsND390V0AUgrZqFY9xqr6aYgpMoM97pkqy+06PoeSzJJAzoi
be7R9Ka0IEyoebjskFiKMivu1hYYZxnx53LN99VrbFKQil3Y9b5Xuei/Y6Nt/mqn
r3Copozb18qXfxtXqpymgPZ9Q2efTZSbVN/ZGm6kW2kmuRMb6If3mDUNalLpLnJa
Hkc7D4YxJnOCLs9S7f7Il45JNI43u0P9sMgEpBOPpPOluol02657p6vqUq2m97Uy
eGXchKNGMF3WLm8wC3yw3kRVZZrQPbgAiZ/r4wsABRG0KEtsYXVzIFNjaG5lcHBl
ciA8S2xhdXMuU2NobmVwcGVyQGRsci5kZT6JARUDBRA6OMt6PbgAiZ/r4wsBAZLx
B/96ywQnNE00r57UDPJgEDIA82e/DnbznBirfaXnO4BT7G4yhfD/WKZ0+1g+PUte
Xed8B6VAWvhRFN2z3nalX3L3jmlGziyHHxTWbMQDO8oWeuKna0gcS5Wq3yO89qcI
C5aBjS5bBglZWgQH3vKbLSwqyAVg3qNnyai4TAHjZ5GC1/N2D1AxQlGvywklBmJX
yeol89GhacpWgXd3PlrQ8rfTH4PcEWs3Yi8RaZWKtsCsO/kWg9B83J0Cufp4K3Pr
Y8VdTn6zE4zVaM06yYCZzPVall58H6bWKckBYOXFJZlZv8unli4uUmZy7oDxkDwx
ZYSgG8bEKX16+GW/jdV7GONp
=E1Lk
-----END PGP PUBLIC KEY BLOCK-----

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103363

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, Text/X-CSRC (charset: iso 8859-15) (Name: "sge_pvmd.c") 370 ]
    [ lines. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list