[GE users] Help with error messages (better formatted)

Viktor Oudovenko udo at physics.rutgers.edu
Fri May 20 04:18:22 BST 2005


Thank you very much, Mac!

Messages about SGE_ROOT forget I used cron to detect if scheduler is running
and if not to restart it. I forgot to put setting.sh there. Now it is OK. It
is my work around of this problem for a while. 
Thanks for the advice about dl.sh !
It sounds very helpful.
Best regards,
v

> -----Original Message-----
> From: McCalla, Mac [mailto:macmccalla at hess.com] 
> Sent: Thursday, May 19, 2005 23:00
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Help with error messages (better formatted)
> 
> 
> I think the qping output is ok.  the scheduler message about 
> setting SGE_ROOT 
> environmental variable i do not understand, if the scheduler 
> is started by the 
> sgemaster script.   if it was started by typing the program 
> name on the
> command line
> then it certainly depends on what was defined in the shell 
> environment the 
> command was issued from.   the messages from the 18:40 period restart
> make
> me suspect the qmaster/execd port assignments were not 
> correct at that time? you can run the 
> $SGE_ROOT/..../util/dl.sh script and set the debugging level 
> and  start the qmaster again to perhaps see more info. I have 
> used this technique 
> when my own "classic spooling" system has become corrupted to 
> see what file was actually being read at the time of error.  
> IIRC there is a section on debugging 
> in the N1GE administrators manual that discusses this.
> 
> Sorry to leave you now, but I must get up early in the am.....mac
> 
> 
> -----Original Message-----
> From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu] 
> Sent: 19 May 2005 20:39
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Help with error messages (better formatted)
> 
> 
> HI, Mac,
> 
> Thank you very  much for the prompt answer:
> 
> Yes, on all nodes I run sge_execcd.
> I must give the following information that I update nodes to 
> new linux version and forgot to chnge /etc/services files/ 
> But now I have chenged them but it did not help.
> 
> See here the info:
> 
> ==============================================================
> ==========
> ====
> =
> On one of my compute nodes:
> 
> 
> sub04n01:/tmp # qping -info rupc-cs04b 536  qmaster 1
> 
> 05/19/2005 21:22:29:
> 
> SIRM version:             0.1
> 
> SIRM message id:          1
> 
> start time:               05/19/2005 18:40:09 (1116542409)
> 
> run time [s]:             9756
> 
> messages in read buffer:  0
> 
> messages in write buffer: 0
> 
> nr. of connected clients: 163
> 
> status:                   0
> 
> info:                     EDT: R (0.17) | TET: R (6.71) | MT: 
> R (0.17) |
> SIGT: R (9755.92) | ok
> 
> 
> Is it OK?
> 
> 
> --------------------------------------------------------------
> ----------
> ----
> 
> Then I   on master node (rupc-cs04b) I didL
> /etc/init.d/sgemaster softstop
> 
> And then 
> /etc/init.d/sgemaster
> 
> 
> 
> --------------------------------------------------------------
> ----------
> ----
> 
> rupc-cs04b:~ # ps -axuf
> 
> sgeadmin 19318  0.3  0.4 71580 14508 ?       S    21:25   0:00
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> root     19320  0.0  0.4 71580 14508 ?       S    21:25   0:00  \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> root     19321  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> root     19322  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> root     19323  0.7  0.4 71580 14508 ?       S    21:25   0:01      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> root     19324  0.1  0.4 71580 14508 ?       S    21:25   0:00      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> sgeadmin 19326  0.1  0.4 71580 14508 ?       S    21:25   0:00      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> sgeadmin 19327  0.4  0.4 71580 14508 ?       S    21:25   0:00      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> sgeadmin 19328  0.0  0.4 71580 14508 ?       S    21:25   0:00      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> sgeadmin 19329  2.7  0.4 71580 14508 ?       S    21:25   0:03      \_
> /opt/SGE/bin/lx24-x86/sge_qmaster
> 
> sgeadmin 19332  0.8  0.2  8196 6892 ?        S    21:25   0:01
> /opt/SGE/bin/lx24-x86/sge_schedd
> 
> rupc-cs04b:~ # 
> 
> I think it looks fine.  Qping agai works.
> But the same messages!
> 
> 
> In rupc-cs04b:/opt/SGE/default/spool/qmaster/schedd :
> Message file:
> 
> 05/19/2005 18:40:07|schedd|rupc-cs04b|I|starting up 6.0u3
> 
> 05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: got 
> read error (closing connection)
> 
> 05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: got 
> pipe error (closing connection)
> 
> 05/19/2005 18:40:08|schedd|rupc-cs04b|E|commlib error: can't 
> connect to service (socket error errno=111)
> 
> 05/19/2005 18:40:10|schedd|rupc-cs04b|W|qmaster alive timeout expired
> 
> 05/19/2005 20:59:50|schedd|rupc-cs04b|I|starting up 6.0u3
> 
> 05/19/2005 21:24:16|schedd|rupc-cs04b|I|controlled shutdown 6.0u3
> 
> 05/19/2005 21:25:01|schedd|rupc-cs04b|C|Please set the 
> environment variable SGE_ROOT.
> 
> 05/19/2005 21:25:56|schedd|rupc-cs04b|I|starting up 6.0u3
> 
> (this is the last restart) : 21:25 
> 
> ==============================================================
> ==========
> ===[
> ===
> 
> With kind regards,
> v
> 
> > Hi Viktor,
> > 
> > Are the sge_execd's running on your compute nodes?  Are there
> > any messages in their messages files?  What happens when you 
> > stop/start one of the sge_execd's? You could try a qping 
> > command from one of your compute nodes back to the qmaster to 
> > see if the port assignments are correct in your environment.  
> > It looks like the scheduler did not start at all this time 
> > when you restarted the qmaster. any error messages in its 
> > messages file? 
> > 
> > mac mccalla
> >  
> > 
> > -----Original Message-----
> > From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu]
> > Sent: 19 May 2005 17:59
> > To: users at gridengine.sunsource.net
> > Subject: [GE users] Help with error messages (better formatted)
> > 
> > 
> > 
> > Hi, I just repyped my previous E-mail with better formatting:
> > 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++++
> > ++++
> > ++++++
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|read job database
> > with 24 entries in 0 seconds
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > ..............................................................
> > ..........
> > ....
> > ....
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will use
> > max. 1004 file descriptors for communication
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will accept
> > max. 99 dynamic event clients
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host rupc01.rutgers.edu to send conf notification
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host rupc02.rutgers.edu to send conf notification
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n101 to send conf notification 
> > ...............................................
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host sub04n91 to send conf notification
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on
> > host rupc04.rutgers.edu to send conf notification
> > 
> > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|starting up 6.0u3
> > 
> > 05/19/2005 18:40:10|qmaster|rupc-cs04b|E|no event client
> > known with id 1 to modify
> > 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++++
> > ++++
> > ++++++
> > 
> > Thank you for your help,
> > v
> > 
> > > -----Original Message-----
> > > From: Viktor Oudovenko [mailto:udo at physics.rutgers.edu]
> > > Sent: Thursday, May 19, 2005 18:52
> > > To: users at gridengine.sunsource.net
> > > Subject: [GE users] Help with error messages
> > > 
> > > 
> > > Hello to everybody,
> > > 
> > > Does anybody know what mean those errors and how to set 
> rid of them?
> > > file: /opt/SGE/default/spool/qmaster/messages
> > > 
> > > I restart sgemaster:
> > > 
> > > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|read job 
> database with 24 
> > > entries in 0 seconds 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|W|received unkown
> > > event: 5 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received
> > > unkown event: 5 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5
> > > 
> > > ..............................................................
> > > ..............
> > > .........................................
> > > MANY MESSAGES LIKE THOSE ONES (probably as many as number of hosts
> > > ..............................................................
> > > ..............
> > > .........................................
> > > 
> > > 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received unkown
> > > event: 5 05/19/2005 18:40:09|qmaster|rupc-cs04b|W|received
> > > unkown event: 5 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|W|removing reference to no longer 
> > > existing job 19881 of user "udo" 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|W|received unkown event: 5 
> > > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|qmaster will use 
> > > max. 1004 file descriptors for communication 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|I|qmaster will accept max. 99 
> > > dynamic event clients 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > > sub04n101 to send conf notification 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > > sub04n102 to send conf notification 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > > sub04n103 to send conf notification 
> > > .....................................................
> > > 
> > > 05/19/2005 18:40:09|qmaster|rupc-cs04b|E|no execd known on host 
> > > sub04n90 to send conf notification 05/19/2005 
> > > 18:40:09|qmaster|rupc-cs04b|E|no execd known on host sub04n91 to 
> > > send conf notification 05/19/2005 
> 18:40:09|qmaster|rupc-cs04b|E|no 
> > > execd known on host rupc04.rutgers.edu to send conf notification 
> > > 05/19/2005 18:40:09|qmaster|rupc-cs04b|I|starting up 6.0u3 
> > > 05/19/2005 18:40:10|qmaster|rupc-cs04b|E|no event client 
> known with 
> > > id 1 to modify
> > > 
> > > Thank you very much for your help, comments etc.
> > > Regards,
> > > Viktor
> > > 
> > > 
> > > 
> > 
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> > > 
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list