[GE users] Grid Engine problem

Harald Pollinger Harald.Pollinger at Sun.COM
Fri Aug 24 12:27:33 BST 2007


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Uri Moszkowicz wrote:
> Hi,
> My company uses Sun Grid Engine to distribute jobs for regressions but 
> are getting infrequent but regular errors. Once every few days the 
> regression process will submit a job to the grid using qrsh but the task 
> ends in a timeout with qrsh unable to read the exit code from 
> "shepherd", which isn't a machine name.

Please see "man sge_shepherd(8)".


> These errors cause false 
> positive regression failures and waste debug time so we'd like to fix 
> it. Anyone know whats going on or have any hints on how to debug this 
> problem? Perhaps an explanation on how the return code is communicated 
> from the compute host to the submit host would help.
> 
> <path>/rsh exited with exit code 0 reading exit code from shepherd ... 
> timeout (60 s) expired while waiting on socket fd 4
> error: error reading returncode of remote command cleaning up after 
> abnormal exit of <path>/rsh

The "qrsh" client starts the "rsh" command, the "sge_shepherd" starts 
the "rshd" daemon, "qrsh" and "sge_shepherd" are directly connected via 
TCP. At job end, the "qrsh" client tries to read the exit status of the 
"rshd" from the "sge_shepherd".
If the "rsh" really abnormally exits and the "rshd" doesn't recognize 
this, the "qrsh" waits for the "rshd" exit code but the "rshd" didn't 
yet exit and therefore the "sge_shepherd" doesn't send it's exit code, 
until the timeout expires.
(I suggest you draw yourself a picture of the components to follow this.)

The question is why the "rsh" exited abnormally.

Regards,
Harald

> 
> Thanks,
> Uri


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list