[GE users] Scheduler dies like a hell

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Fri May 20 08:04:30 BST 2005


Hi,

I am not sure, that a currupted file is the problem. The qmaster does some
validation during the startup. Could you run the scheduler in debug mode and
post the output just before it dies?

You can set the debug mode with:

source $SGE_ROOT/<CELL>/common/settings.csh
source $SGE_ROOT/util/dl.csh
dl 1

bin/<arch>/sge_schedd

Or, do you have a stack trace of the scheduler?

Which version are you running on which arch?

Thanks,
Stephan

Viktor Oudovenko wrote:

>Ron,
>
>Can I try to cat part of accounting file ? I mean to EDIT it MANUALLY
>despite it is written do not do it?
>Best regards,
>v
>
>  
>
>>-----Original Message-----
>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com] 
>>Sent: Thursday, May 19, 2005 22:02
>>To: users at gridengine.sunsource.net
>>Subject: RE: [GE users] Scheduler dies like a hell
>>
>>
>>It is not easy to find out which file gets corrupted
>>:(
>>
>>One thing you can try is to move spooled job files (in
>>default/spool/qmaster/jobs) to a backup directory.
>>Also, you can use qconf to dump the configuration for
>>the queues/users/hosts, and see if the values "make
>>sense".
>>
>>Of course the best way to fix this is to restore from
>>backup!
>>
>> -Ron
>>
>>
>>--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
>>    
>>
>>>Hi, Ron,
>>>
>>>I am using classic spooling.
>>>Which file should I look for corruption? Can I edit
>>>it manually?
>>>Thank you very much in advance.
>>>v
>>>
>>>      
>>>
>>>>-----Original Message-----
>>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
>>>>Sent: Thursday, May 19, 2005 20:38
>>>>To: users at gridengine.sunsource.net
>>>>Subject: RE: [GE users] Scheduler dies like a hell
>>>>
>>>>
>>>>Are you using classic spooling or Berkeley DB
>>>>spooling?
>>>>
>>>>With classic spooling, when the machine crashes,
>>>>        
>>>>
>>>the
>>>      
>>>
>>>>files may get corrupted. And when qmaster reads in
>>>>        
>>>>
>>>the
>>>      
>>>
>>>>corrupted files, it may also corrupt the qmasters'
>>>>        
>>>>
>>>data structures.
>>>      
>>>
>>>>IIRC, Berkeley DB handles recovery itself, but I
>>>>        
>>>>
>>>have
>>>      
>>>
>>>>never played with it myself :)
>>>>
>>>> -Ron
>>>>
>>>>
>>>>--- Viktor Oudovenko <udo at physics.rutgers.edu>
>>>>        
>>>>
>>>wrote:
>>>      
>>>
>>>>>Hi, Mac,
>>>>>Thank you very much for your advices!
>>>>>I'll try. I think one of running or finished
>>>>>          
>>>>>
>>>jobs
>>>      
>>>
>>>>>did a bad record somewhere
>>>>>(like jobs directory).
>>>>>Best regards,
>>>>>v
>>>>>
>>>>>          
>>>>>
>>>>>>-----Original Message-----
>>>>>>From: McCalla, Mac
>>>>>>            
>>>>>>
>>>[mailto:macmccalla at hess.com]
>>>      
>>>
>>>>>>Sent: Thursday, May 19, 2005 15:12
>>>>>>To: users at gridengine.sunsource.net
>>>>>>Subject: RE: [GE users] Scheduler dies like a
>>>>>>            
>>>>>>
>>>hell
>>>      
>>>
>>>>>>Hi,
>>>>>>
>>>>>>Some thinks to look at:  any messages in 
>>>>>>$SGE_ROOT/......../qmaster/schedd/messages  ?
>>>>>>            
>>>>>>
>>>To
>>>      
>>>
>>>>>get more
>>>>>          
>>>>>
>>>>>>info about what scheduler is doing while it is
>>>>>>            
>>>>>>
>>>>>running, see
>>>>>          
>>>>>
>>>>>>info about scheduler params profile and
>>>>>>            
>>>>>>
>>>monitor,
>>>      
>>>
>>>>>you can set
>>>>>          
>>>>>
>>>>>>them equal to 1 to turn on
>>>>>>some scheduler diagnostics,  see man
>>>>>>            
>>>>>>
>>>sched_conf.
>>>      
>>>
>>>>>>To extend timeout value for scheduler you can
>>>>>>            
>>>>>>
>>>set
>>>      
>>>
>>>>>>qmaster_params SCHEDULER_TIMEOUT to some value
>>>>>>            
>>>>>>
>>>>>greater than
>>>>>          
>>>>>
>>>>>>600 (seconds).
>>>>>>You can also use system command strace to get
>>>>>>            
>>>>>>
>>>>>trace of
>>>>>          
>>>>>
>>>>>>scheduler activity while it is running to
>>>>>>            
>>>>>>
>>>perhaps
>>>      
>>>
>>>>>get a
>>>>>          
>>>>>
>>>>>>better idea of what it is spending its time
>>>>>>            
>>>>>>
>>>doing.
>>>      
>>>
>>>>>>Hope this helps,
>>>>>>
>>>>>>mac mccalla
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Viktor Oudovenko
>>>>>>            
>>>>>>
>>>>>[mailto:udo at physics.rutgers.edu]
>>>>>          
>>>>>
>>>>>>Sent: Thursday, May 19, 2005 12:00 PM
>>>>>>To: users at gridengine.sunsource.net
>>>>>>Subject: [GE users] Scheduler dies like a hell
>>>>>>
>>>>>>Hi, everybody,
>>>>>>
>>>>>>I am asking your help and ideas what could be
>>>>>>            
>>>>>>
>>>done
>>>      
>>>
>>>>>to restore
>>>>>          
>>>>>
>>>>>>normal operation of the scheduler. First what
>>>>>>            
>>>>>>
>>>>>happened. A few
>>>>>          
>>>>>
>>>>>>time during last week our main server died and
>>>>>>            
>>>>>>
>>>I
>>>      
>>>
>>>>>needed to
>>>>>          
>>>>>
>>>>>>reboot it and even replace it. But jobs which
>>>>>>            
>>>>>>
>>>used
>>>      
>>>
>>>>>automount
>>>>>          
>>>>>
>>>>>>proceed run. But from yesterday or day before
>>>>>>            
>>>>>>
>>>>>yesterday
>>>>>          
>>>>>
>>>>>>scheduler demon dies. I tried to restart
>>>>>>            
>>>>>>
>>>>>sge_master but it
>>>>>          
>>>>>
>>>>>>did not help. Now when demon died I start it
>>>>>>            
>>>>>>
>>>>>manually simply typing:
>>>>>          
>>>>>
>>>>>>/opt/SGE/bin/lx24-x86/sge_schedd
>>>>>>
>>>>>>but after some time it died again. Please
>>>>>>            
>>>>>>
>>>advice
>>>      
>>>
>>>>>what could it be?
>>>>>          
>>>>>
>>>>>>Below plz find some info form file messages:
>>>>>>
>>>>>>
>>>>>>05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
>>>>>>            
>>>>>>
>>>execd
>>>      
>>>
>>>>>known on
>>>>>          
>>>>>
>>>>>>host sub04n87 to send conf notification
>>>>>>            
>>>>>>
>>>05/19/2005
>>>      
>>>
>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>            
>>>>>>
>>>on
>>>      
>>>
>>>>>host sub04n88
>>>>>          
>>>>>
>>>>>>to send conf notification 05/19/2005 
>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>            
>>>>>>
>>>on
>>>      
>>>
>>>>>host sub04n89
>>>>>          
>>>>>
>>>>>>to send conf notification 05/19/2005 
>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>            
>>>>>>
>>>on
>>>      
>>>
>>>>>host sub04n90
>>>>>          
>>>>>
>>>>>>to send conf notification 05/19/2005 
>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>            
>>>>>>
>>>on
>>>      
>>>
>>>>>host sub04n91
>>>>>          
>>>>>
>>>>>>to send conf notification 05/19/2005 
>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>            
>>>>>>
>>>on
>>>      
>>>
>>>>>host
>>>>>          
>>>>>
>>>>>>rupc04.rutgers.edu to send conf notification
>>>>>>            
>>>>>>
>>>>>05/19/2005
>>>>>          
>>>>>
>>>>>>01:02:37|qmaster|rupc-cs04b|I|starting up
>>>>>>            
>>>>>>
>>>6.0u3
>>>      
>>>
>>>>>05/19/2005
>>>>>          
>>>>>
>>>>>>01:08:11|qmaster|rupc-cs04b|E|commlib error:
>>>>>>            
>>>>>>
>>>got
>>>      
>>>
>>>>>read error
>>>>>          
>>>>>
>>>>>>(closing connection) 05/19/2005 
>>>>>>01:11:06|qmaster|rupc-cs04b|E|event client
>>>>>>            
>>>>>>
>>>>>"scheduler"
>>>>>          
>>>>>
>>>>>>(rupc-cs04b/schedd/1) reregistered - it will
>>>>>>            
>>>>>>
>>>need
>>>      
>>>
>>>>>a total
>>>>>          
>>>>>
>>>>>>update 05/19/2005
>>>>>>            
>>>>>>
>>>>>01:24:31|qmaster|rupc-cs04b|W|job 21171.1
>>>>>          
>>>>>
>>>>>>failed on host sub04n203 assumedly after job
>>>>>>            
>>>>>>
>>>>>because: job
>>>>>          
>>>>>
>>>>>>21171.1 died through signal TERM
>>>>>>(15)
>>>>>>05/19/2005
>>>>>>            
>>>>>>
>>>>>05:17:19|qmaster|rupc-cs04b|E|acknowledge
>>>>>          
>>>>>
>>>timeout
>>>      
>>>
>>>>>>after 600 seconds for event client (schedd:1)
>>>>>>            
>>>>>>
>>>on
>>>      
>>>
>>>>>host
>>>>>          
>>>>>
>>>>>>"rupc-cs04b" 05/19/2005
>>>>>>            
>>>>>>
>>>>>09:29:03|qmaster|rupc-cs04b|W|job
>>>>>          
>>>>>
>>>>>>21060.1 failed on host sub04n74 assumedly
>>>>>>            
>>>>>>
>>>after
>>>      
>>>
>>>>>job because:
>>>>>          
>>>>>
>>>>>>job 21060.1 died through signal TERM (15)
>>>>>>            
>>>>>>
>>>>>05/19/2005
>>>>>          
>>>>>
>>>>>>09:30:37|qmaster|rupc-cs04b|E|event client
>>>>>>            
>>>>>>
>>>>>"scheduler"
>>>>>          
>>>>>
>>>>>>(rupc-cs04b/schedd/1) reregistered - it will
>>>>>>            
>>>>>>
>>>need
>>>      
>>>
>>>>>a total
>>>>>          
>>>>>
>>>>>>update 05/19/2005
>>>>>>            
>>>>>>
>>>>>11:04:21|qmaster|rupc-cs04b|W|job 20222.1
>>>>>          
>>>>>
>>>>>>failed on host sub04n29 assumedly after job
>>>>>>            
>>>>>>
>>>>>because: job
>>>>>          
>>>>>
>>>>>>20222.1 died through signal KILL (9)
>>>>>>            
>>>>>>
>>>05/19/2005
>>>      
>>>
>>>>>>11:05:50|qmaster|rupc-cs04b|W|job 21212.1
>>>>>>            
>>>>>>
>>>failed
>>>      
>>>
>>>>>on host
>>>>>          
>>>>>
>>>>>>sub04n25 assumedly after job because: job
>>>>>>            
>>>>>>
>>>21212.1
>>>      
>>>
>>>>>died
>>>>>          
>>>>>
>>>>>>through signal KILL (9) 05/19/2005 
>>>>>>12:04:51|qmaster|rupc-cs04b|E|acknowledge
>>>>>>            
>>>>>>
>>>timeout
>>>      
>>>
>>>>>after 600
>>>>>          
>>>>>
>>>>>>seconds for event client (schedd:1) on host
>>>>>>            
>>>>>>
>>>>>"rupc-cs04b"
>>>>>          
>>>>>
>>=== message truncated ===
>>
>>
>>
>>		
>>Discover Yahoo! 
>>Have fun online with music videos, cool games, IM and more. 
>>Check it out! 
>>http://discover.yahoo.com/online.html
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list