[GE users] Grid Engine problem

Uri Moszkowicz uri at 4refs.com
Wed Aug 22 19:27:35 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,
My company uses Sun Grid Engine to distribute jobs for regressions but are
getting infrequent but regular errors. Once every few days the regression
process will submit a job to the grid using qrsh but the task ends in a
timeout with qrsh unable to read the exit code from "shepherd", which isn't
a machine name. These errors cause false positive regression failures and
waste debug time so we'd like to fix it. Anyone know whats going on or have
any hints on how to debug this problem? Perhaps an explanation on how the
return code is communicated from the compute host to the submit host would
help.

<path>/rsh exited with exit code 0 reading exit code from shepherd ...
timeout (60 s) expired while waiting on socket fd 4
error: error reading returncode of remote command cleaning up after abnormal
exit of <path>/rsh

Thanks,
Uri



More information about the gridengine-users mailing list