[GE users] Protocol error, closed connection when using qrsh for hpux 11i

Nir Dvir nir at chipx.co.il
Mon Jun 7 10:14:58 BST 2004


When trying to use qrsh -l hostname=some-hpux-hostname I receive an
immediate error 

 

Protocol error, some-hpux-hostname closed connection

 

This suddenly happens on all my hpux systems running hp-ux 11i and 5.3p2
(sgeee)

 

None of the hosts reported anything in the spool/hostname/messages, nor
in the qmaster/messages.

 

I installed a new host and got the same error, but it did record the
following:

 

In the qmaster messages:

Mon Jun  7 11:24:00 2004|qmaster|endor|I|starting up 5.3p2 (sgeee)

Mon Jun  7 11:24:34 2004|qmaster|endor|W|job 26099.1 failed on host
jabba assumedly after job because: job 26099.1 died through signal KILL
(9)

 

In the host's messages:

Mon Jun  7 11:18:13 2004|execd|jabba|I|starting up 5.3p2 (sgeee)

Mon Jun  7 11:19:33 2004|execd|jabba|W|can't receive request: WRITE
ERROR

Mon Jun  7 11:40:57 2004|execd|jabba|W|can't receive request: READ ERROR

Mon Jun  7 11:42:11 2004|execd|jabba|W|can't receive request: READ ERROR

 

When I use qrsh -l hostname=some-solaris-hostname , it works just fine
the same works for all linux hosts.

 

qsub -l hostname=some-hpux-hostname -cwd ./worker.sh works on all my
hpux hosts, so I guess it has something to do with qrsh.

 

rsh  some-hpux-hostname ~/worker.sh works too. 

 

I tried looking at the users' archive and found errors that related to
service ports, host resolution, but found nothing there.

 

Anyone has any idea why this could happen suddenly, or better yet,
pointers for resolving this?

 

Thanks

 

Nir

 

 




More information about the gridengine-users mailing list