[GE users] Help with error messages (better formatted)

McCalla, Mac macmccalla at hess.com
Fri May 20 03:59:49 BST 2005


I think the qping output is ok.  the scheduler message about setting
SGE_ROOT 
environmental variable i do not understand, if the scheduler is started
by the 
sgemaster script.   if it was started by typing the program name on the
command line
then it certainly depends on what was defined in the shell environment
the 
command was issued from.   the messages from the 18:40 period restart
make
me suspect the qmaster/execd port assignments were not correct at that
time?
you can run the $SGE_ROOT/..../util/dl.sh script and set the debugging
level
and  start the qmaster again to perhaps see more info. I have used this
technique 
when my own "classic spooling" system has become corrupted to see what
file was actually being read at the time of error.
 IIRC there is a section on debugging 
in the N1GE administrators manual that discusses this.

Sorry to leave you now, but I must get up early in the am.....mac


-----Original Message-----
From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu] 
Sent: 19 May 2005 20:39
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Help with error messages (better formatted)


HI, Mac,

Thank you very  much for the prompt answer:

Yes, on all nodes I run sge_execcd.
I must give the following information that I update nodes to new linux
version and forgot to chnge /etc/services files/
But now I have chenged them but it did not help.

See here the info:

========================================================================
====
=
On one of my compute nodes:


sub04n01:/tmp # qping -info rupc-cs04b 536  qmaster 1

05/19/2005 21:22:29:

SIRM version:             0.1

SIRM message id:          1

start time:               05/19/2005 18:40:09 (1116542409)

run time [s]:             9756

messages in read buffer:  0

messages in write buffer: 0

nr. of connected clients: 163

status:                   0

info:                     EDT: R (0.17) | TET: R (6.71) | MT: R (0.17) |
SIGT: R (9755.92) | ok


Is it OK?


------------------------------------------------------------------------
----

Then I   on master node (rupc-cs04b) I didL
/etc/init.d/sgemaster softstop

And then 
/etc/init.d/sgemaster



------------------------------------------------------------------------
----

rupc-cs04b:~ # ps -axuf

sgeadmin 19318  0.3  0.4 71580 14508 ?       S    21:25   0:00
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19320  0.0  0.4 71580 14508 ?       S    21:25   0:00  \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19321  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19322  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19323  0.7  0.4 71580 14508 ?       S    21:25   0:01      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

root     19324  0.1  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19326  0.1  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19327  0.4  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19328  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19329  2.7  0.4 71580 14508 ?       S    21:25   0:03      \_
/opt/SGE/bin/lx24-x86/sge_qmaster

sgeadmin 19332  0.8  0.2  8196 6892 ?        S    21:25   0:01
/opt/SGE/bin/lx24-x86/sge_schedd

rupc-cs04b:~ # 

I think it looks fine.  Qping agai works.
But the same messages!


In rupc-cs04b:/opt/SGE/default/spool/qmaster/schedd :
Message file:

05/19/2005 18:40:07|schedd|rupc-cs04b|I|starting up 6.0u3

05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: got read error
(closing connection)

05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: got pipe error
(closing connection)

05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: can't connect to
service (socket error errno=111)

05/19/2005 18:40:10|schedd|rupc-cs04b|W|qmaster alive timeout expired

05/19/2005 20:59:50|schedd|rupc-cs04b|I|starting up 6.0u3

05/19/2005 21:24:16|schedd|rupc-cs04b|I|controlled shutdown 6.0u3

05/19/2005 21:25:01|schedd|rupc-cs04b|C|Please set the environment
variable
SGE_ROOT.

05/19/2005 21:25:56|schedd|rupc-cs04b|I|starting up 6.0u3

(this is the last restart) : 21:25 

========================================================================
===[
===

With kind regards,
v

> Hi Viktor,
> 
> Are the sge_execd's running on your compute nodes?  Are there 
> any messages in their messages files?  What happens when you 
> stop/start one of the sge_execd's? You could try a qping 
> command from one of your compute nodes back to the qmaster to 
> see if the port assignments are correct in your environment.  
> It looks like the scheduler did not start at all this time 
> when you restarted the qmaster. any error messages in its 
> messages file? 
> 
> mac mccalla
>  
> 
> -----Original Message-----
> From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu] 
> Sent: 19 May 2005 17:59
> To: users at gridengine.sunsource.net
> Subject: [GE users] Help with error messages (better formatted)
> 
> 
> 
> Hi, I just repyped my previous E-mail with better formatting:
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++
> ++++
> ++++++
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|read job database 
> with 24 entries in 0 seconds
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> ..............................................................
> ..........
> ....
> ....
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will use 
> max. 1004 file descriptors for communication
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will accept 
> max. 99 dynamic event clients
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host rupc01.rutgers.edu to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host rupc02.rutgers.edu to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host sub04n101 to send conf notification 
> ...............................................
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host sub04n91 to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on 
> host rupc04.rutgers.edu to send conf notification
> 
> 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|starting up 6.0u3
> 
> 05/19/2005 18:40:10|qmaster|rupc-cs04b|E|no event client 
> known with id 1 to modify
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++
> ++++
> ++++++
> 
> Thank you for your help,
> v
> 
> > -----Original Message-----
> > From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu]
> > Sent: Thursday, May 19, 2005 18:52
> > To: users at gridengine.sunsource.net
> > Subject: [GE users] Help with error messages
> > 
> > 
> > Hello to everybody,
> > 
> > Does anybody know what mean those errors and how to set rid of them?
> > file: /opt/SGE/default/spool/qmaster/messages
> > 
> > I restart sgemaster:
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|read job database
> > with 24 entries in 0 seconds 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown 
> > event: 5 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received 
> > unkown event: 5 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > ..............................................................
> > ..............
> > .........................................
> > MANY MESSAGES LIKE THOSE ONES (probably as many as number of
> > hosts 
> > ..............................................................
> > ..............
> > .........................................
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown
> > event: 5 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received 
> > unkown event: 5 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|removing reference to no longer 
> > existing job 19881 of user "udo" 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will use 
> > max. 1004 file descriptors for communication 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|I|qmaster will accept max. 99 
> > dynamic event clients 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > sub04n101 to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > sub04n102 to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > sub04n103 to send conf notification 
> > .....................................................
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n90 to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host sub04n91 
> > to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > rupc04.rutgers.edu to send conf notification 05/19/2005 
> > 18:40:09|qmaster|rupc-cs04b|I|starting up 6.0u3 05/19/2005 
> > 18:40:10|qmaster|rupc-cs04b|E|no event client known with id 1 
> > to modify
> > 
> > Thank you very much for your help, comments etc.
> > Regards,
> > Viktor
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list