[GE users] Scheduler dies like a hell

Viktor Oudovenko udo at physics.rutgers.edu
Fri May 20 17:29:20 BST 2005


Hi, Stephan!

Thank you very much that you paid attention on my problem.
I just did run the scheduler with dl 1 option and I did not expect that much
stuff to be printed out.
I restarted it and sending you as an attachment.
Plz have a look. For me from the first glance it looks like jungle.

I appreciate your help very much,
With kind regards,
V


> -----Original Message-----
> From: Stephan Grell - Sun Germany - SSG - Software Engineer 
> [mailto:stephan.grell at sun.com] 
> Sent: Friday, May 20, 2005 3:05
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Scheduler dies like a hell
> 
> 
> Hi,
> 
> I am not sure, that a currupted file is the problem. The 
> qmaster does some validation during the startup. Could you 
> run the scheduler in debug mode and post the output just 
> before it dies?
> 
> You can set the debug mode with:
> 
> source $SGE_ROOT/<CELL>/common/settings.csh
> source $SGE_ROOT/util/dl.csh
> dl 1
> 
> bin/<arch>/sge_schedd
> 
> Or, do you have a stack trace of the scheduler?
> 
> Which version are you running on which arch?
> 
> Thanks,
> Stephan
> 
> Viktor Oudovenko wrote:
> 
> >Ron,
> >
> >Can I try to cat part of accounting file ? I mean to EDIT it 
> MANUALLY 
> >despite it is written do not do it? Best regards,
> >v
> >
> >  
> >
> >>-----Original Message-----
> >>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> >>Sent: Thursday, May 19, 2005 22:02
> >>To: users at gridengine.sunsource.net
> >>Subject: RE: [GE users] Scheduler dies like a hell
> >>
> >>
> >>It is not easy to find out which file gets corrupted
> >>:(
> >>
> >>One thing you can try is to move spooled job files (in
> >>default/spool/qmaster/jobs) to a backup directory.
> >>Also, you can use qconf to dump the configuration for
> >>the queues/users/hosts, and see if the values "make
> >>sense".
> >>
> >>Of course the best way to fix this is to restore from
> >>backup!
> >>
> >> -Ron
> >>
> >>
> >>--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> >>    
> >>
> >>>Hi, Ron,
> >>>
> >>>I am using classic spooling.
> >>>Which file should I look for corruption? Can I edit
> >>>it manually?
> >>>Thank you very much in advance.
> >>>v
> >>>
> >>>      
> >>>
> >>>>-----Original Message-----
> >>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> >>>>Sent: Thursday, May 19, 2005 20:38
> >>>>To: users at gridengine.sunsource.net
> >>>>Subject: RE: [GE users] Scheduler dies like a hell
> >>>>
> >>>>
> >>>>Are you using classic spooling or Berkeley DB
> >>>>spooling?
> >>>>
> >>>>With classic spooling, when the machine crashes,
> >>>>        
> >>>>
> >>>the
> >>>      
> >>>
> >>>>files may get corrupted. And when qmaster reads in
> >>>>        
> >>>>
> >>>the
> >>>      
> >>>
> >>>>corrupted files, it may also corrupt the qmasters'
> >>>>        
> >>>>
> >>>data structures.
> >>>      
> >>>
> >>>>IIRC, Berkeley DB handles recovery itself, but I
> >>>>        
> >>>>
> >>>have
> >>>      
> >>>
> >>>>never played with it myself :)
> >>>>
> >>>> -Ron
> >>>>
> >>>>
> >>>>--- Viktor Oudovenko <udo at physics.rutgers.edu>
> >>>>        
> >>>>
> >>>wrote:
> >>>      
> >>>
> >>>>>Hi, Mac,
> >>>>>Thank you very much for your advices!
> >>>>>I'll try. I think one of running or finished
> >>>>>          
> >>>>>
> >>>jobs
> >>>      
> >>>
> >>>>>did a bad record somewhere
> >>>>>(like jobs directory).
> >>>>>Best regards,
> >>>>>v
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: McCalla, Mac
> >>>>>>            
> >>>>>>
> >>>[mailto:macmccalla at hess.com]
> >>>      
> >>>
> >>>>>>Sent: Thursday, May 19, 2005 15:12
> >>>>>>To: users at gridengine.sunsource.net
> >>>>>>Subject: RE: [GE users] Scheduler dies like a
> >>>>>>            
> >>>>>>
> >>>hell
> >>>      
> >>>
> >>>>>>Hi,
> >>>>>>
> >>>>>>Some thinks to look at:  any messages in
> >>>>>>$SGE_ROOT/......../qmaster/schedd/messages  ?
> >>>>>>            
> >>>>>>
> >>>To
> >>>      
> >>>
> >>>>>get more
> >>>>>          
> >>>>>
> >>>>>>info about what scheduler is doing while it is
> >>>>>>            
> >>>>>>
> >>>>>running, see
> >>>>>          
> >>>>>
> >>>>>>info about scheduler params profile and
> >>>>>>            
> >>>>>>
> >>>monitor,
> >>>      
> >>>
> >>>>>you can set
> >>>>>          
> >>>>>
> >>>>>>them equal to 1 to turn on
> >>>>>>some scheduler diagnostics,  see man
> >>>>>>            
> >>>>>>
> >>>sched_conf.
> >>>      
> >>>
> >>>>>>To extend timeout value for scheduler you can
> >>>>>>            
> >>>>>>
> >>>set
> >>>      
> >>>
> >>>>>>qmaster_params SCHEDULER_TIMEOUT to some value
> >>>>>>            
> >>>>>>
> >>>>>greater than
> >>>>>          
> >>>>>
> >>>>>>600 (seconds).
> >>>>>>You can also use system command strace to get
> >>>>>>            
> >>>>>>
> >>>>>trace of
> >>>>>          
> >>>>>
> >>>>>>scheduler activity while it is running to
> >>>>>>            
> >>>>>>
> >>>perhaps
> >>>      
> >>>
> >>>>>get a
> >>>>>          
> >>>>>
> >>>>>>better idea of what it is spending its time
> >>>>>>            
> >>>>>>
> >>>doing.
> >>>      
> >>>
> >>>>>>Hope this helps,
> >>>>>>
> >>>>>>mac mccalla
> >>>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: Viktor Oudovenko
> >>>>>>            
> >>>>>>
> >>>>>[mailto:udo at physics.rutgers.edu]
> >>>>>          
> >>>>>
> >>>>>>Sent: Thursday, May 19, 2005 12:00 PM
> >>>>>>To: users at gridengine.sunsource.net
> >>>>>>Subject: [GE users] Scheduler dies like a hell
> >>>>>>
> >>>>>>Hi, everybody,
> >>>>>>
> >>>>>>I am asking your help and ideas what could be
> >>>>>>            
> >>>>>>
> >>>done
> >>>      
> >>>
> >>>>>to restore
> >>>>>          
> >>>>>
> >>>>>>normal operation of the scheduler. First what
> >>>>>>            
> >>>>>>
> >>>>>happened. A few
> >>>>>          
> >>>>>
> >>>>>>time during last week our main server died and
> >>>>>>            
> >>>>>>
> >>>I
> >>>      
> >>>
> >>>>>needed to
> >>>>>          
> >>>>>
> >>>>>>reboot it and even replace it. But jobs which
> >>>>>>            
> >>>>>>
> >>>used
> >>>      
> >>>
> >>>>>automount
> >>>>>          
> >>>>>
> >>>>>>proceed run. But from yesterday or day before
> >>>>>>            
> >>>>>>
> >>>>>yesterday
> >>>>>          
> >>>>>
> >>>>>>scheduler demon dies. I tried to restart
> >>>>>>            
> >>>>>>
> >>>>>sge_master but it
> >>>>>          
> >>>>>
> >>>>>>did not help. Now when demon died I start it
> >>>>>>            
> >>>>>>
> >>>>>manually simply typing:
> >>>>>          
> >>>>>
> >>>>>>/opt/SGE/bin/lx24-x86/sge_schedd
> >>>>>>
> >>>>>>but after some time it died again. Please
> >>>>>>            
> >>>>>>
> >>>advice
> >>>      
> >>>
> >>>>>what could it be?
> >>>>>          
> >>>>>
> >>>>>>Below plz find some info form file messages:
> >>>>>>
> >>>>>>
> >>>>>>05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
> >>>>>>            
> >>>>>>
> >>>execd
> >>>      
> >>>
> >>>>>known on
> >>>>>          
> >>>>>
> >>>>>>host sub04n87 to send conf notification
> >>>>>>            
> >>>>>>
> >>>05/19/2005
> >>>      
> >>>
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n88
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n89
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n90
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host sub04n91
> >>>>>          
> >>>>>
> >>>>>>to send conf notification 05/19/2005
> >>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host
> >>>>>          
> >>>>>
> >>>>>>rupc04.rutgers.edu to send conf notification
> >>>>>>            
> >>>>>>
> >>>>>05/19/2005
> >>>>>          
> >>>>>
> >>>>>>01:02:37|qmaster|rupc-cs04b|I|starting up
> >>>>>>            
> >>>>>>
> >>>6.0u3
> >>>      
> >>>
> >>>>>05/19/2005
> >>>>>          
> >>>>>
> >>>>>>01:08:11|qmaster|rupc-cs04b|E|commlib error:
> >>>>>>            
> >>>>>>
> >>>got
> >>>      
> >>>
> >>>>>read error
> >>>>>          
> >>>>>
> >>>>>>(closing connection) 05/19/2005
> >>>>>>01:11:06|qmaster|rupc-cs04b|E|event client
> >>>>>>            
> >>>>>>
> >>>>>"scheduler"
> >>>>>          
> >>>>>
> >>>>>>(rupc-cs04b/schedd/1) reregistered - it will
> >>>>>>            
> >>>>>>
> >>>need
> >>>      
> >>>
> >>>>>a total
> >>>>>          
> >>>>>
> >>>>>>update 05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>01:24:31|qmaster|rupc-cs04b|W|job 21171.1
> >>>>>          
> >>>>>
> >>>>>>failed on host sub04n203 assumedly after job
> >>>>>>            
> >>>>>>
> >>>>>because: job
> >>>>>          
> >>>>>
> >>>>>>21171.1 died through signal TERM
> >>>>>>(15)
> >>>>>>05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>05:17:19|qmaster|rupc-cs04b|E|acknowledge
> >>>>>          
> >>>>>
> >>>timeout
> >>>      
> >>>
> >>>>>>after 600 seconds for event client (schedd:1)
> >>>>>>            
> >>>>>>
> >>>on
> >>>      
> >>>
> >>>>>host
> >>>>>          
> >>>>>
> >>>>>>"rupc-cs04b" 05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>09:29:03|qmaster|rupc-cs04b|W|job
> >>>>>          
> >>>>>
> >>>>>>21060.1 failed on host sub04n74 assumedly
> >>>>>>            
> >>>>>>
> >>>after
> >>>      
> >>>
> >>>>>job because:
> >>>>>          
> >>>>>
> >>>>>>job 21060.1 died through signal TERM (15)
> >>>>>>            
> >>>>>>
> >>>>>05/19/2005
> >>>>>          
> >>>>>
> >>>>>>09:30:37|qmaster|rupc-cs04b|E|event client
> >>>>>>            
> >>>>>>
> >>>>>"scheduler"
> >>>>>          
> >>>>>
> >>>>>>(rupc-cs04b/schedd/1) reregistered - it will
> >>>>>>            
> >>>>>>
> >>>need
> >>>      
> >>>
> >>>>>a total
> >>>>>          
> >>>>>
> >>>>>>update 05/19/2005
> >>>>>>            
> >>>>>>
> >>>>>11:04:21|qmaster|rupc-cs04b|W|job 20222.1
> >>>>>          
> >>>>>
> >>>>>>failed on host sub04n29 assumedly after job
> >>>>>>            
> >>>>>>
> >>>>>because: job
> >>>>>          
> >>>>>
> >>>>>>20222.1 died through signal KILL (9)
> >>>>>>            
> >>>>>>
> >>>05/19/2005
> >>>      
> >>>
> >>>>>>11:05:50|qmaster|rupc-cs04b|W|job 21212.1
> >>>>>>            
> >>>>>>
> >>>failed
> >>>      
> >>>
> >>>>>on host
> >>>>>          
> >>>>>
> >>>>>>sub04n25 assumedly after job because: job
> >>>>>>            
> >>>>>>
> >>>21212.1
> >>>      
> >>>
> >>>>>died
> >>>>>          
> >>>>>
> >>>>>>through signal KILL (9) 05/19/2005
> >>>>>>12:04:51|qmaster|rupc-cs04b|E|acknowledge
> >>>>>>            
> >>>>>>
> >>>timeout
> >>>      
> >>>
> >>>>>after 600
> >>>>>          
> >>>>>
> >>>>>>seconds for event client (schedd:1) on host
> >>>>>>            
> >>>>>>
> >>>>>"rupc-cs04b"
> >>>>>          
> >>>>>
> >>=== message truncated ===
> >>
> >>
> >>
> >>		
> >>Discover Yahoo!
> >>Have fun online with music videos, cool games, IM and more. 
> >>Check it out! 
> >>http://discover.yahoo.com/online.html
> >>
> >>------------------------------------------------------------
> ---------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>    
> >>
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



    [ Part 2: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list