[GE users] ?????: [GE users] Protocol error, closed connection when using qrsh for hpux 11i

Nir Dvir nir at chipx.co.il
Tue Jun 8 03:32:41 BST 2004


    [ The following text is in the "windows-1255" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

No, never experienced it before.

The only recent change made to these hosts is that I installed latest X11 libraries - but is worked for a few days after that. Furthermore, there is one host that I did not install this patch on and it gives the same error, so I ruled this out.


I added on of my HPs to be a submit host and tried to qrsh from it and got this error: "rcmd: Lost connection"
qrsh to a sun host works just fine from this host too.

I am mapping the service via nis map
ypcat services | grep sge
sge_commd       536/tcp             # communication port for Grid Engine

nsswitch.conf  = services:     nis [NOTFOUND=return] files

but I think it is not related to this since qsub works just fine.

My guess is that it is related to the interactive mode, as qrsh and qsh stopped working.

What does this means:
Tue Jun  8 04:17:02 2004|execd|jabba|W|can't receive request: READ ERROR
Tue Jun  8 05:26:16 2004|execd|jabba|W|can't receive request: READ ERROR


Nir

-----????? ??????-----
???: Ron Chen [mailto:ron_chen_123 at yahoo.com] 
????: ? 08 ???? 2004 4:25
??: users at gridengine.sunsource.net
????: Re: [GE users] Protocol error, closed connection when using qrsh for hpux 11i

Did it happen before?

 -Ron

--- Nir Dvir <nir at chipx.co.il> wrote:
> When trying to use qrsh -l
> hostname=some-hpux-hostname I receive an
> immediate error 
> 
>  
> 
> Protocol error, some-hpux-hostname closed connection
> 
>  
> 
> This suddenly happens on all my hpux systems running
> hp-ux 11i and 5.3p2
> (sgeee)
> 
>  
> 
> None of the hosts reported anything in the
> spool/hostname/messages, nor
> in the qmaster/messages.
> 
>  
> 
> I installed a new host and got the same error, but
> it did record the
> following:
> 
>  
> 
> In the qmaster messages:
> 
> Mon Jun  7 11:24:00 2004|qmaster|endor|I|starting up
> 5.3p2 (sgeee)
> 
> Mon Jun  7 11:24:34 2004|qmaster|endor|W|job 26099.1
> failed on host
> jabba assumedly after job because: job 26099.1 died
> through signal KILL
> (9)
> 
>  
> 
> In the host's messages:
> 
> Mon Jun  7 11:18:13 2004|execd|jabba|I|starting up
> 5.3p2 (sgeee)
> 
> Mon Jun  7 11:19:33 2004|execd|jabba|W|can't receive
> request: WRITE
> ERROR
> 
> Mon Jun  7 11:40:57 2004|execd|jabba|W|can't receive
> request: READ ERROR
> 
> Mon Jun  7 11:42:11 2004|execd|jabba|W|can't receive
> request: READ ERROR
> 
>  
> 
> When I use qrsh -l hostname=some-solaris-hostname ,
> it works just fine
> the same works for all linux hosts.
> 
>  
> 
> qsub -l hostname=some-hpux-hostname -cwd ./worker.sh
> works on all my
> hpux hosts, so I guess it has something to do with
> qrsh.
> 
>  
> 
> rsh  some-hpux-hostname ~/worker.sh works too. 
> 
>  
> 
> I tried looking at the users' archive and found
> errors that related to
> service ports, host resolution, but found nothing
> there.
> 
>  
> 
> Anyone has any idea why this could happen suddenly,
> or better yet,
> pointers for resolving this?
> 
>  
> 
> Thanks
> 
>  
> 
> Nir
> 
>  
> 
>  
> 
> 



	
		
__________________________________
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list