[GE users] Protocol error, closed connection when using qrsh for hpux 11i
nir at chipx.co.il
Mon Jun 7 10:14:58 BST 2004
When trying to use qrsh -l hostname=some-hpux-hostname I receive an
Protocol error, some-hpux-hostname closed connection
This suddenly happens on all my hpux systems running hp-ux 11i and 5.3p2
None of the hosts reported anything in the spool/hostname/messages, nor
in the qmaster/messages.
I installed a new host and got the same error, but it did record the
In the qmaster messages:
Mon Jun 7 11:24:00 2004|qmaster|endor|I|starting up 5.3p2 (sgeee)
Mon Jun 7 11:24:34 2004|qmaster|endor|W|job 26099.1 failed on host
jabba assumedly after job because: job 26099.1 died through signal KILL
In the host's messages:
Mon Jun 7 11:18:13 2004|execd|jabba|I|starting up 5.3p2 (sgeee)
Mon Jun 7 11:19:33 2004|execd|jabba|W|can't receive request: WRITE
Mon Jun 7 11:40:57 2004|execd|jabba|W|can't receive request: READ ERROR
Mon Jun 7 11:42:11 2004|execd|jabba|W|can't receive request: READ ERROR
When I use qrsh -l hostname=some-solaris-hostname , it works just fine
the same works for all linux hosts.
qsub -l hostname=some-hpux-hostname -cwd ./worker.sh works on all my
hpux hosts, so I guess it has something to do with qrsh.
rsh some-hpux-hostname ~/worker.sh works too.
I tried looking at the users' archive and found errors that related to
service ports, host resolution, but found nothing there.
Anyone has any idea why this could happen suddenly, or better yet,
pointers for resolving this?
More information about the gridengine-users