[GE users] Help with error messages (better formatted)

Viktor Oudovenko udo at physics.rutgers.edu
Fri May 20 02:39:19 BST 2005


HI, Mac,

Thank you very  much for the prompt answer:

Yes, on all nodes I run sge_execcd.
I must give the following information that I update nodes to new linux
version and forgot to chnge /etc/services files/
But now I have chenged them but it did not help.

See here the info:

============================================================================
=
On one of my compute nodes:


sub04n01:/tmp # qping -info rupc-cs04b 536  qmaster 1

05/19/2005 21:22:29:

SIRM version:             0.1

SIRM message id:          1

start time:               05/19/2005 18:40:09 (1116542409)

run time [s]:             9756

messages in read buffer:  0

messages in write buffer: 0

nr. of connected clients: 163

status:                   0

info:                     EDT: R (0.17) | TET: R (6.71) | MT: R (0.17) |
SIGT: R (9755.92) | ok


Is it OK?


----------------------------------------------------------------------------

Then I   on master node (rupc-cs04b) I didL
/etc/init.d/sgemaster softstop

And then 
/etc/init.d/sgemaster



----------------------------------------------------------------------------

rupc-cs04b:~ # ps -axuf

sgeadmin 19318  0.3  0.4 71580 14508 ?       S    21:25   0:00
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19320  0.0  0.4 71580 14508 ?       S    21:25   0:00  \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19321  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19322  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19323  0.7  0.4 71580 14508 ?       S    21:25   0:01      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19324  0.1  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19326  0.1  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19327  0.4  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19328  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19329  2.7  0.4 71580 14508 ?       S    21:25   0:03      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19332  0.8  0.2  8196 6892 ?        S    21:25   0:01
/opt/SGE/bin/lx24-x86/sge_schedd

rupc-cs04b:~ # 

I think it looks fine.  Qping agai works.
But the same messages!


In rupc-cs04b:/opt/SGE/default/spool/qmaster/schedd :
Message file:

05/19/2005 18:40:07|schedd|rupc-cs04b|I|starting up 6.0u3

05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: got read error
(closing connection)

05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: got pipe error
(closing connection)

05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: can't connect to
service (socket error errno=111)

05/19/2005 18:40:10|schedd|rupc-cs04b|W|qmaster alive timeout expired

05/19/2005 20:59:50|schedd|rupc-cs04b|I|starting up 6.0u3

05/19/2005 21:24:16|schedd|rupc-cs04b|I|controlled shutdown 6.0u3

05/19/2005 21:25:01|schedd|rupc-cs04b|C|Please set the environment variable
SGE_ROOT.

05/19/2005 21:25:56|schedd|rupc-cs04b|I|starting up 6.0u3

(this is the last restart) : 21:25 

===========================================================================[
===

With kind regards,
v

> Hi Viktor,
> 
> Are the sge_execd's running on your compute nodes?  Are there 
> any messages in their messages files?  What happens when you 
> stop/start one of the sge_execd's? You could try a qping 
> command from one of your compute nodes back to the qmaster to 
> see if the port assignments are correct in your environment.  
> It looks like the scheduler did not start at all this time 
> when you restarted the qmaster. any error messages in its 
> messages file? 
> 
> mac mccalla
>  
> 
> -----Original Message-----
> From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu] 
> Sent: 19 May 2005 17:59
> To: users at gridengine.sunsource.net
> Subject: [GE users] Help with error messages (better formatted)
> 
> 
> 
> Hi, I just repyped my previous E-mail with better formatting:
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++
> ++++
> ++++++
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|read job database 
> with 24 entries in 0 seconds
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> ..............................................................
> ..........
> ....
> ....
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will use 
> max. 1004 file descriptors for communication
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will accept 
> max. 99 dynamic event clients
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host rupc01.rutgers.edu to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host rupc02.rutgers.edu to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host sub04n101 to send conf notification 
> ...............................................
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host sub04n91 to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host rupc04.rutgers.edu to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|starting up 6.0u3
> 
> 05/19/2005 18:40:10|qmaster|rupc-cs04b|E|no event client 
> known with id 1 to modify
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++
> ++++
> ++++++
> 
> Thank you for your help,
> v
> 
> > -----Original Message-----
> > From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu]
> > Sent: Thursday, May 19, 2005 18:52
> > To: users at gridengine.sunsource.net
> > Subject: [GE users] Help with error messages
> > 
> > 
> > Hello to everybody,
> > 
> > Does anybody know what mean those errors and how to set rid of them?
> > file: /opt/SGE/default/spool/qmaster/messages
> > 
> > I restart sgemaster:
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|read job database
> > with 24 entries in 0 seconds 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown 
> > event: 5 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received 
> > unkown event: 5 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > ..............................................................
> > ..............
> > .........................................
> > MANY MESSAGES LIKE THOSE ONES (probably as many as number of
> > hosts 
> > ..............................................................
> > ..............
> > .........................................
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown
> > event: 5 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received 
> > unkown event: 5 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|removing reference to no longer 
> > existing job 19881 of user "udo" 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will use 
> > max. 1004 file descriptors for communication 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|I|qmaster will accept max. 99 
> > dynamic event clients 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > sub04n101 to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > sub04n102 to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > sub04n103 to send conf notification 
> > .....................................................
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n90 to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host sub04n91 
> > to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > rupc04.rutgers.edu to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|I|starting up 6.0u3 05/19/2005 
> > 18:40:10|qmaster|rupc-cs04b|E|no event client known with id 1 
> > to modify
> > 
> > Thank you very much for your help, comments etc.
> > Regards,
> > Viktor
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list