[GE users] Jumbo Frame and Gridengine

ron ron_chen_123 at yahoo.com
Thu Jul 9 06:23:05 BST 2009


The latest version of SGE is 6.2 update 3, yet I don't think it has any changes related to commlib.

Can you compile SGE with CL_DEFINE_DATA_BUFFER_SIZE set to something large, like 16K? Then setup a 2-node test cluster and see if it works or not?

 -Ron



--- On Thu, 7/9/09, hargitai <joseph.hargitai at nyu.edu> wrote:
> from master node qmaster:
> 
> 07/08/2009 13:56:44|qmaster|cardiac|E|commlib error:
> endpoint is not unique error (endpoint
> "cardiac.es.its.nyu.edu/qmaster/1" is already connected)
> 07/08/2009 13:56:46|qmaster|cardiac|E|commlib error: got
> read error (closing "cardiac.es.its.nyu.edu/qstat/8395")
> 
> 
> Is the new SGE version released? 
> 
> best,
> 
> j
> 
> ----- Original Message -----
> From: rayson <rayrayson at gmail.com>
> Date: Wednesday, July 8, 2009 2:13 pm
> Subject: Re: [GE users] Jumbo Frame and Gridengine
> 
> > Anything from the qmaster's side??
> > 
> > Rayson
> > 
> > 
> > 
> > On 7/8/09, hargitai <joseph.hargitai at nyu.edu>
> wrote:
> > > When you enable jumbo frames - in about a minute
> the node goes off 
> > SGE. qstat -f shows node au - and on the node itself
> retarting SGE 
> > client does not work.
> > >
> > >
> > > When you unset jumbo frame - node becomes
> available right away 
> > without communication.
> > >
> > > this is the message in sge messages on the node
> while on jumbo frame:
> > >
> > > (No route to host)
> > > 07/08/2009 11:57:23|execd|compute-8-8|E|commlib
> error: got read 
> > error (closing "
> > > cardiac.es.its.nyu.edu/qmaster/1")
> > > 07/08/2009 11:57:23|execd|compute-8-8|W|can't
> register at "qmaster": 
> > unable to c
> > > ontact qmaster using port 536 on host
> "cardiac.es.its.nyu.edu"
> > > 07/08/2009 12:04:44|execd|compute-8-8|W|can't
> register at "qmaster": 
> > unable to s
> > > end message to qmaster using port 536 on host 
> > "cardiac.es.its.nyu.edu": got mess
> > > age ackno
> > > 07/08/2009
> 12:48:36|execd|compute-8-8|I|controlled shutdown 6.1u4
> > > 07/08/2009 12:54:08|execd|compute-8-8|I|starting
> up GE 6.1u4 (lx26-amd64)
> > > 07/08/2009 13:53:58|execd|compute-8-8|E|commlib
> error: got read 
> > error (closing "
> > > cardiac.es.its.nyu.edu/qmaster/1")
> > > 07/08/2009 13:56:43|execd|compute-8-8|E|commlib
> error: endpoint is 
> > not unique er
> > > ror (endpoint "cardiac.es.its.nyu.edu/qmaster/1"
> is already connected)
> > > 07/08/2009
> 13:57:43|execd|compute-8-8|E|acknowledge for unknown job 
> > 8188.1/maste
> > > r
> > >
> > > j
> > >
> > > ----- Original Message -----
> > > From: rayson <rayrayson at gmail.com>
> > > Date: Wednesday, July 8, 2009 1:58 pm
> > > Subject: Re: [GE users] Jumbo Frame and
> Gridengine
> > >
> > > > Did you get anything in the log files or
> "messages"??
> > > >
> > > > As a test, can you enable jumbo frames and
> run some client commands,
> > > > like qhost and qstat and see if you get any
> response from qmaster??
> > > >
> > > > Looking at the commlib code, we have
> CL_DEFINE_DATA_BUFFER_SIZE
> > > > defined to 1024 * 4. However, 4K is smaller
> than the size of a jumbo
> > > > frame, which can be as big as 9KB. Note that
> 4K is used as the 
> > size of
> > > > the read buffer and the write buffer
> (libs/comm/cl_communication.c).
> > > >
> > > > My socket programming is a bit rusty, and I
> forgot how ethernet frames
> > > > get assembled into TCP segments and
> presented to applications... I 
> > may
> > > > need to do a bit of googling to see how it
> affects user-applications.
> > > >
> > > > Rayson
> > > >
> > > >
> > > >
> > > >
> > > > On 7/8/09, hargitai <joseph.hargitai at nyu.edu>
> wrote:
> > > > > Hey all:
> > > > >
> > > > > We enabled jumbo frames on our cluster
> and SGE services stopped
> > > > communicating on eth0 - while ssh was/is
> working.
> > > > >
> > > > > Once jumbo frames were unset - SGE
> picked up and worked again.
> > > > >
> > > > > Is there a way to have SGE collaborate
> with jumbo frame settings?
> > > > >
> > > > > thanks,
> > > > > joseph
> > > > >
> > > > >
> ------------------------------------------------------
> > > > > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206185
> > > > >
> > > > > To unsubscribe from this discussion,
> e-mail: [users-unsubscribe at gridengine.sunsource.net].
> > > > >
> > > >
> > > >
> ------------------------------------------------------
> > > > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206193
> > > >
> > > > To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
> > >
> > >
> ------------------------------------------------------
> > > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206196
> > >
> > > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> > >
> > 
> >
> ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206200
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206201
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206249

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list