[GE users] The Scheduler dies" COMPLETE information

Viktor Oudovenko udo at physics.rutgers.edu
Sun May 22 19:46:57 BST 2005


Hi, Stephan and anybody who can help!

Could you have a look at the attachment to see what is going on with my
scheduler.
What I did I just run as you advised scheduler demon in dl 1 mode and waited
until it crashes.
And it did. It dies even without any events.  I mean you will find two lines
in from messages file when the scheduler died without any reason. But the
last crash happened because one of the myrinet jobs finished.
Could you give any hint what could it be and what could it be done.
I am running Linux SuSE 8.2 on the server  and 9.0 and 9.2 on the slaves. 
I also have a few opterons (8 machines). I am happy to provide any further
information if necessary.
Please help. 

With kind regards,
Viktor
P.S. In the attachment I put  not only the last iteration but a couple of
successful ones. 
Actually in debug mode the scheduler updates information like every 5-10
second or so.

> -----Original Message-----
> From: Stephan Grell - Sun Germany - SSG - Software Engineer 
> [mailto:stephan.grell at sun.com] 
> Sent: Friday, May 20, 2005 3:05
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Scheduler dies like a hell
> 
> 
> Hi,
> 
> I am not sure, that a currupted file is the problem. The 
> qmaster does some validation during the startup. Could you 
> run the scheduler in debug mode and post the output just 
> before it dies?
> 
> You can set the debug mode with:
> 
> source $SGE_ROOT/<CELL>/common/settings.csh
> source $SGE_ROOT/util/dl.csh
> dl 1
> 
> bin/<arch>/sge_schedd
> 
> Or, do you have a stack trace of the scheduler?
> 
> Which version are you running on which arch?
> 
> Thanks,
> Stephan
> 
> Viktor Oudovenko wrote:
> 
> >Ron,
> >
> >Can I try to cat part of accounting file ? I mean to EDIT it 
> MANUALLY 
> >despite it is written do not do it? Best regards,
> >v
> >
> >  
> >
> >>-----Original Message-----
> >>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> >>Sent: Thursday, May 19, 2005 22:02
> >>To: users at gridengine.sunsource.net
> >>Subject: RE: [GE users] Scheduler dies like a hell
> >>
> >>
> >>It is not easy to find out which file gets corrupted
> >>:(
> >>
> >>One thing you can try is to move spooled job files (in
> >>default/spool/qmaster/jobs) to a backup directory.
> >>Also, you can use qconf to dump the configuration for
> >>the queues/users/hosts, and see if the values "make
> >>sense".
> >>
> >>Of course the best way to fix this is to restore from
> >>backup!
> >>
> >> -Ron
> >>
> >>
> >>--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> >>    
> >>
> >>>Hi, Ron,
> >>>
> >>>I am using classic spooling.
> >>>Which file should I look for corruption? Can I edit
> >>>it manually?
> >>>Thank you very much in advance.
> >>>v
> >>>
> >>>      
> >>>
> >>>>-----Original Message-----
> >>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> >>>>Sent: Thursday, May 19, 2005 20:38
> >>>>To: users at gridengine.sunsource.net
> >>>>Subject: RE: [GE users] Scheduler dies like a hell
> >>>>
> >>>>
> >>>>Are you using classic spooling or Berkeley DB
> >>>>spooling?
> >>>>
> >>>>With classic spooling, when the machine crashes,
> >>>>        
> >>>>
> >>>the
> >>>      
> >>>
> >>>>files may get corrupted. And when qmaster reads in
> >>>>        
> >>>>
> >>>the
> >>>      
> >>>
> >>>>corrupted files, it may also corrupt the qmasters'
> >>>>        
> >>>>
> >>>data structures.
> >>>      
> >>>
> >>>>IIRC, Berkeley DB handles recovery itself, but I
> >>>>        
> >>>>
> >>>have
> >>>      
> >>>
> >>>>never played with it myself :)
> >>>>
> >>>> -Ron
> >>>>
> >>>>
> >>>>--- Viktor Oudovenko <udo at physics.rutgers.edu>
> >>>>        
> >>>>
> >>>wrote:
> >>>      
> >>>
> >>>>>Hi, Mac,
> >>>>>Thank you very much for your advices!
> >>>>>I'll try. I think one of running or finished
> >>>>>          
> >>>>>
> >>>jobs
> >>>      
> >>>
> >>>>>did a bad record somewhere
> >>>>>(like jobs directory).
> >>>>>Best regards,
> >>>>>v
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: McCalla, Mac
> >>>>>>            
> >>>>>>
> >>>[mailto:macmccalla at hess.com]
> >>>      
> >>>
> >>>>>>Sent: Thursday, May 19, 2005 15:12
> >>>>>>To: users at gridengine.sunsource.net
> >>>>>>Subject: RE: [GE users] Scheduler dies like a
> >>>>>>            
> >>>>>>
> >>>hell
> >>>      
> >>>
> >>>>>>Hi,
> >>>>>>
> >>>>>>Some thinks to look at:  any messages in
> >>>>>>$SGE_ROOT/......../qmaster/schedd/messages  ?
> >>>>>>            
> >>>>>>
> >>>To
> >>>      
> >>>
> >>>>>get more
> >>>>>          
> >>>>>
> >>>>>>info about what scheduler is doing while it is
> >>>>>>            
> >>>>>>
> >>>>>running, see
> >>>>>          
> >>>>>
> >>>>>>info about scheduler params profile and
> >>>>>>            
> >>>>>>
> >>>monitor,
> >>>      
> >>>
> >>>>>you can set
> >>>>>          
> >>>>>
> >>>>>>them equal to 1 to turn on
> >>>>>>some scheduler diagnostics,  see man
> >>>>>>            
> >>>>>>
> >>>sched_conf.
> >>>      
> >>>
> >>>>>>To extend timeout value for scheduler you can
> >>>>>>            
> >>>>>>
> >>>set
> >>>      
> >>>
> >>>>>>qmaster_params SCHEDULER_TIMEOUT to some value
> >>>>>>            
> >>>>>>
> >>>>>greater than
> >>>>>          
> >>>>>
> >>>>>>600 (seconds).
> >>>>>>You can also use system command strace to get
> >>>>>>            
> >>>>>>
> >>>>>trace of
> >>>>>          
> >>>>>
> >>>>>>scheduler activity while it is running to
> >>>>>>            
> >>>>>>
> >>>perhaps
> >>>      
> >>>
> >>>>>get a
> >>>>>          
> >>>>>
> >>>>>>better idea of what it is spending its time
> >>>>>>            
> >>>>>>
> >>>doing.
> >>>      
> >>>
> >>>>>>Hope this helps,
> >>>>>>
> >>>>>>mac mccalla
> >>>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: Viktor Oudovenko
> >>>>>>            
> >>>>>>
> >>>>>[mailto:udo at physics.rutgers.edu]
> >>>>>          
> >>>>>
> >>>>>>Sent: Thursday, May 19, 2005 12:00 PM
> >>>>>>To: users at gridengine.sunsource.net
> >>>>>>Subject: [GE users] Scheduler dies like a hell
> >>>>>>
> >>>>>>Hi, everybody,
> >>>>>>
> >>>>>>I am asking your help and ideas what could be
> >>>>>>            
> >>>>>>
> >>>done
> >>>      
> >>>
> >>>>>to restore
> >>>>>          
> >>>>>
> >>>>>>normal operation of the scheduler. First what
> >>>>>>            
> >>>>>>
> >>>>>happened. A few
> >>>>>          
> >>>>>
> >>>>>>time during last week our main server died and
> >>>>>>            
> >>>>>>
> >>>I
> >>>      
> >>>
> >>>>>needed to
> >>>>>          
> >>>>>
> >>>>>>reboot it and even replace it. But jobs which
> >>>>>>            
> >>>>>>
> >>>used
> >>>      
> >>>
> >>>>>automount
> >>>>>          
> >>>>>
> >>>>>>proceed run. But from yesterday or day before
> >>>>>>            
> >>>>>>
> >>>>>yesterday
> >>>>>          
> >>>>>
> >>>>>>scheduler demon dies. I tried to restart
> >>>>>>            
> >>>>>>
> >>>>>sge_master but it
> >>>>>          
> >>>>>
> >>>>>>did not help. Now when demon died I start it
> >>>>>>            
> >>>>>>
> >>>>>manually simply typing:
> >>>>>          
> >>>>>
> >>>>>>/opt/SGE/bin/lx24-x86/sge_schedd
> >>>>>>
> >>>>>>but after some time it died again. Please
> >>>>>>            
> >>>>>>
> >>>advice
> >>>      
> >>>
> >>>>>what could it be?
> >>>>>          
> >>>>>
> >>>>>>Below plz find some info form file messages:
> >>>>>>
> >>>>>>
> >>>>>>05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
> >>>>>>            
> >>>>>>
> >>>execd
> >>>      
> >>>
> >>>>>known on
> >>>>>          
> >>>>>
> >>>>>>host sub04n87 to send conf notification
> >>>>>>            
> >>>>>>
> >>>05/19/2005
> >>>      
> >>>
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n88
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n89
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n90
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n91
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host
> >>>>>          
> >>>>>
> >>>>>>rupc04.rutgers.edu to send conf notification
> >>>>>>            
> >>>>>>
> >>>>>05/19/2005
> >>>>>          
> >>>>>
> >>>>>>01:02:37|qmaster|rupc-cs04b|I|starting up
> >>>>>>            
> >>>>>>
> >>>6.0u3
> >>>      
> >>>
> >>>>>05/19/2005
> >>>>>          
> >>>>>
> >>>>>>01:08:11|qmaster|rupc-cs04b|E|commlib error:
> >>>>>>            
> >>>>>>
> >>>got
> >>>      
> >>>
> >>>>>read error
> >>>>>          
> >>>>>
> >>>>>>(closing connection) 05/19/2005
> >>>>>>01:11:06|qmaster|rupc-cs04b|E|event client
> >>>>>>            
> >>>>>>
> >>>>>"scheduler"
> >>>>>          
> >>>>>
> >>>>>>(rupc-cs04b/schedd/1) reregistered - it will
> >>>>>>            
> >>>>>>
> >>>need
> >>>      
> >>>
> >>>>>a total
> >>>>>          
> >>>>>
> >>>>>>update 05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>01:24:31|qmaster|rupc-cs04b|W|job 21171.1
> >>>>>          
> >>>>>
> >>>>>>failed on host sub04n203 assumedly after job
> >>>>>>            
> >>>>>>
> >>>>>because: job
> >>>>>          
> >>>>>
> >>>>>>21171.1 died through signal TERM
> >>>>>>(15)
> >>>>>>05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>05:17:19|qmaster|rupc-cs04b|E|acknowledge
> >>>>>          
> >>>>>
> >>>timeout
> >>>      
> >>>
> >>>>>>after 600 seconds for event client (schedd:1)
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host
> >>>>>          
> >>>>>
> >>>>>>"rupc-cs04b" 05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>09:29:03|qmaster|rupc-cs04b|W|job
> >>>>>          
> >>>>>
> >>>>>>21060.1 failed on host sub04n74 assumedly
> >>>>>>            
> >>>>>>
> >>>after
> >>>      
> >>>
> >>>>>job because:
> >>>>>          
> >>>>>
> >>>>>>job 21060.1 died through signal TERM (15)
> >>>>>>            
> >>>>>>
> >>>>>05/19/2005
> >>>>>          
> >>>>>
> >>>>>>09:30:37|qmaster|rupc-cs04b|E|event client
> >>>>>>            
> >>>>>>
> >>>>>"scheduler"
> >>>>>          
> >>>>>
> >>>>>>(rupc-cs04b/schedd/1) reregistered - it will
> >>>>>>            
> >>>>>>
> >>>need
> >>>      
> >>>
> >>>>>a total
> >>>>>          
> >>>>>
> >>>>>>update 05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>11:04:21|qmaster|rupc-cs04b|W|job 20222.1
> >>>>>          
> >>>>>
> >>>>>>failed on host sub04n29 assumedly after job
> >>>>>>            
> >>>>>>
> >>>>>because: job
> >>>>>          
> >>>>>
> >>>>>>20222.1 died through signal KILL (9)
> >>>>>>            
> >>>>>>
> >>>05/19/2005
> >>>      
> >>>
> >>>>>>11:05:50|qmaster|rupc-cs04b|W|job 21212.1
> >>>>>>            
> >>>>>>
> >>>failed
> >>>      
> >>>
> >>>>>on host
> >>>>>          
> >>>>>
> >>>>>>sub04n25 assumedly after job because: job
> >>>>>>            
> >>>>>>
> >>>21212.1
> >>>      
> >>>
> >>>>>died
> >>>>>          
> >>>>>
> >>>>>>through signal KILL (9) 05/19/2005
> >>>>>>12:04:51|qmaster|rupc-cs04b|E|acknowledge
> >>>>>>            
> >>>>>>
> >>>timeout
> >>>      
> >>>
> >>>>>after 600
> >>>>>          
> >>>>>
> >>>>>>seconds for event client (schedd:1) on host
> >>>>>>            
> >>>>>>
> >>>>>"rupc-cs04b"
> >>>>>          
> >>>>>
> >>=== message truncated ===
> >>
> >>
> >>
> >>		
> >>Discover Yahoo!
> >>Have fun online with music videos, cool games, IM and more. 
> >>Check it out! 
> >>http://discover.yahoo.com/online.html
> >>
> >>------------------------------------------------------------
> ---------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>    
> >>
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


    [ Part 2, Text/PLAIN (Name: "scheduler_all.txt") ~1,766 lines. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list