[GE users] The Scheduler dies" COMPLETE information

Viktor Oudovenko udo at physics.rutgers.edu
Mon May 23 08:50:20 BST 2005


Hi, Stephan,

Thank you for the answer.
When u4 will be issued and where I can read about issue 1416?

Meanwhile  I tried many things but nothing helped me at the moment.
Why my scheduler reregister so often. Because  after it dies I restart it
manually. Simply issuing command:
$SGE_ROOT/bin/lx..../scg_schedd 
Then the information about reregistering appears.

Thank you very much for your help.
v

> -----Original Message-----
> From: Stephan Grell - Sun Germany - SSG - Software Engineer 
> [mailto:stephan.grell at sun.com] 
> Sent: Monday, May 23, 2005 3:45
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] The Scheduler dies" COMPLETE information
> 
> 
> Hi Viktor,
> 
> you encounter issue 1416. This is fixed with u4.
> However, the important question is, why your scheduler is 
> reregistering so often.
> 
> Stephan
> 
> Viktor Oudovenko wrote:
> 
> >Hi, Stephan and anybody who can help!
> >
> >Could you have a look at the attachment to see what is going 
> on with my 
> >scheduler. What I did I just run as you advised scheduler 
> demon in dl 1 
> >mode and waited until it crashes.
> >And it did. It dies even without any events.  I mean you 
> will find two lines
> >in from messages file when the scheduler died without any 
> reason. But the
> >last crash happened because one of the myrinet jobs finished.
> >Could you give any hint what could it be and what could it be done.
> >I am running Linux SuSE 8.2 on the server  and 9.0 and 9.2 
> on the slaves. 
> >I also have a few opterons (8 machines). I am happy to 
> provide any further
> >information if necessary.
> >Please help. 
> >
> >With kind regards,
> >Viktor
> >P.S. In the attachment I put  not only the last iteration 
> but a couple 
> >of successful ones. Actually in debug mode the scheduler updates 
> >information like every 5-10 second or so.
> >
> >  
> >
> >>-----Original Message-----
> >>From: Stephan Grell - Sun Germany - SSG - Software Engineer
> >>[mailto:stephan.grell at sun.com] 
> >>Sent: Friday, May 20, 2005 3:05
> >>To: users at gridengine.sunsource.net
> >>Subject: Re: [GE users] Scheduler dies like a hell
> >>
> >>
> >>Hi,
> >>
> >>I am not sure, that a currupted file is the problem. The
> >>qmaster does some validation during the startup. Could you 
> >>run the scheduler in debug mode and post the output just 
> >>before it dies?
> >>
> >>You can set the debug mode with:
> >>
> >>source $SGE_ROOT/<CELL>/common/settings.csh
> >>source $SGE_ROOT/util/dl.csh
> >>dl 1
> >>
> >>bin/<arch>/sge_schedd
> >>
> >>Or, do you have a stack trace of the scheduler?
> >>
> >>Which version are you running on which arch?
> >>
> >>Thanks,
> >>Stephan
> >>
> >>Viktor Oudovenko wrote:
> >>
> >>    
> >>
> >>>Ron,
> >>>
> >>>Can I try to cat part of accounting file ? I mean to EDIT it
> >>>      
> >>>
> >>MANUALLY
> >>    
> >>
> >>>despite it is written do not do it? Best regards,
> >>>v
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>-----Original Message-----
> >>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> >>>>Sent: Thursday, May 19, 2005 22:02
> >>>>To: users at gridengine.sunsource.net
> >>>>Subject: RE: [GE users] Scheduler dies like a hell
> >>>>
> >>>>
> >>>>It is not easy to find out which file gets corrupted
> >>>>:(
> >>>>
> >>>>One thing you can try is to move spooled job files (in
> >>>>default/spool/qmaster/jobs) to a backup directory.
> >>>>Also, you can use qconf to dump the configuration for
> >>>>the queues/users/hosts, and see if the values "make
> >>>>sense".
> >>>>
> >>>>Of course the best way to fix this is to restore from backup!
> >>>>
> >>>>-Ron
> >>>>
> >>>>
> >>>>--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Hi, Ron,
> >>>>>
> >>>>>I am using classic spooling.
> >>>>>Which file should I look for corruption? Can I edit
> >>>>>it manually?
> >>>>>Thank you very much in advance.
> >>>>>v
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
> >>>>>>Sent: Thursday, May 19, 2005 20:38
> >>>>>>To: users at gridengine.sunsource.net
> >>>>>>Subject: RE: [GE users] Scheduler dies like a hell
> >>>>>>
> >>>>>>
> >>>>>>Are you using classic spooling or Berkeley DB
> >>>>>>spooling?
> >>>>>>
> >>>>>>With classic spooling, when the machine crashes,
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>the
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>files may get corrupted. And when qmaster reads in
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>the
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>corrupted files, it may also corrupt the qmasters'
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>data structures.
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>IIRC, Berkeley DB handles recovery itself, but I
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>have
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>never played with it myself :)
> >>>>>>
> >>>>>>-Ron
> >>>>>>
> >>>>>>
> >>>>>>--- Viktor Oudovenko <udo at physics.rutgers.edu>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>wrote:
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>Hi, Mac,
> >>>>>>>Thank you very much for your advices!
> >>>>>>>I'll try. I think one of running or finished
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>jobs
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>did a bad record somewhere
> >>>>>>>(like jobs directory).
> >>>>>>>Best regards,
> >>>>>>>v
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>-----Original Message-----
> >>>>>>>>From: McCalla, Mac
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>[mailto:macmccalla at hess.com]
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>Sent: Thursday, May 19, 2005 15:12
> >>>>>>>>To: users at gridengine.sunsource.net
> >>>>>>>>Subject: RE: [GE users] Scheduler dies like a
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>hell
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>Hi,
> >>>>>>>>
> >>>>>>>>Some thinks to look at:  any messages in 
> >>>>>>>>$SGE_ROOT/......../qmaster/schedd/messages  ?
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>To
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>get more
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>info about what scheduler is doing while it is
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>running, see
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>info about scheduler params profile and
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>monitor,
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>you can set
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>them equal to 1 to turn on
> >>>>>>>>some scheduler diagnostics,  see man
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>sched_conf.
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>To extend timeout value for scheduler you can
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>set
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>qmaster_params SCHEDULER_TIMEOUT to some value
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>greater than
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>600 (seconds).
> >>>>>>>>You can also use system command strace to get
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>trace of
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>scheduler activity while it is running to
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>perhaps
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>get a
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>better idea of what it is spending its time
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>doing.
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>Hope this helps,
> >>>>>>>>
> >>>>>>>>mac mccalla
> >>>>>>>>
> >>>>>>>>-----Original Message-----
> >>>>>>>>From: Viktor Oudovenko
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>[mailto:udo at physics.rutgers.edu]
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>Sent: Thursday, May 19, 2005 12:00 PM
> >>>>>>>>To: users at gridengine.sunsource.net
> >>>>>>>>Subject: [GE users] Scheduler dies like a hell
> >>>>>>>>
> >>>>>>>>Hi, everybody,
> >>>>>>>>
> >>>>>>>>I am asking your help and ideas what could be
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>done
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>to restore
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>normal operation of the scheduler. First what
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>happened. A few
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>time during last week our main server died and
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>I
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>needed to
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>reboot it and even replace it. But jobs which
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>used
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>automount
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>proceed run. But from yesterday or day before
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>yesterday
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>scheduler demon dies. I tried to restart
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>sge_master but it
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>did not help. Now when demon died I start it
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>manually simply typing:
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>/opt/SGE/bin/lx24-x86/sge_schedd
> >>>>>>>>
> >>>>>>>>but after some time it died again. Please
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>advice
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>what could it be?
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>Below plz find some info form file messages:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>execd
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>known on
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>host sub04n87 to send conf notification
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>05/19/2005
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>on
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>host sub04n88
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>to send conf notification 05/19/2005 
> >>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>on
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>host sub04n89
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>to send conf notification 05/19/2005 
> >>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>on
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>host sub04n90
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>to send conf notification 05/19/2005 
> >>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>on
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>host sub04n91
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>to send conf notification 05/19/2005 
> >>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>on
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>host
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>rupc04.rutgers.edu to send conf notification
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>05/19/2005
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>01:02:37|qmaster|rupc-cs04b|I|starting up
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>6.0u3
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>05/19/2005
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>01:08:11|qmaster|rupc-cs04b|E|commlib error:
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>got
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>read error
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>(closing connection) 05/19/2005 
> >>>>>>>>01:11:06|qmaster|rupc-cs04b|E|event client
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>"scheduler"
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>(rupc-cs04b/schedd/1) reregistered - it will
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>need
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>a total
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>update 05/19/2005
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>01:24:31|qmaster|rupc-cs04b|W|job 21171.1
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>failed on host sub04n203 assumedly after job
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>because: job
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>21171.1 died through signal TERM
> >>>>>>>>(15)
> >>>>>>>>05/19/2005
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>05:17:19|qmaster|rupc-cs04b|E|acknowledge
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>timeout
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>after 600 seconds for event client (schedd:1)
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>on
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>host
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>"rupc-cs04b" 05/19/2005
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>09:29:03|qmaster|rupc-cs04b|W|job
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>21060.1 failed on host sub04n74 assumedly
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>after
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>job because:
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>job 21060.1 died through signal TERM (15)
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>05/19/2005
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>09:30:37|qmaster|rupc-cs04b|E|event client
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>"scheduler"
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>(rupc-cs04b/schedd/1) reregistered - it will
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>need
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>a total
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>update 05/19/2005
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>11:04:21|qmaster|rupc-cs04b|W|job 20222.1
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>failed on host sub04n29 assumedly after job
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>because: job
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>20222.1 died through signal KILL (9)
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>05/19/2005
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>>11:05:50|qmaster|rupc-cs04b|W|job 21212.1
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>failed
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>on host
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>sub04n25 assumedly after job because: job
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>21212.1
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>died
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>through signal KILL (9) 05/19/2005 
> >>>>>>>>12:04:51|qmaster|rupc-cs04b|E|acknowledge
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>timeout
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>>after 600
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>seconds for event client (schedd:1) on host
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>"rupc-cs04b"
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>=== message truncated ===
> >>>>
> >>>>
> >>>>
> >>>>		
> >>>>Discover Yahoo!
> >>>>Have fun online with music videos, cool games, IM and more.
> >>>>Check it out! 
> >>>>http://discover.yahoo.com/online.html
> >>>>
> >>>>------------------------------------------------------------
> >>>>        
> >>>>
> >>---------
> >>    
> >>
> >>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>-----------------------------------------------------------
> ----------
> >>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>------------------------------------------------------------
> ---------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>    
> >>
> >
> >  
> >
> >-------------------------------------------------------------
> ----------
> >-
> >
> >WS128133  25368 16384     SENDING 22 ORDERS TO QMASTER
> >128134  25368 16384     RESETTING BUSY STATE OF EVENT CLIENT
> >128135  25368 16384     reresolve port timeout in 340
> >128136  25368 16384     returning cached port value: 536
> >--------------STOP-SCHEDULER-RUN-------------
> >128137  25368 16384     ec_get retrieving events - will do 
> max 20 fetches
> >128138  25368 16384     doing sync fetch for messages, 20 still to do
> >128139  25368 16384     try to get request from qmaster, id 1
> >128140  25368 16384     Checking 55 events (44303-44357) 
> while waiting for #44303
> >128141  25368 16384     check complete, 55 events in list
> >128142  25368 16384     got 55 events till 44357
> >128143  25368 16384     doing async fetch for messages, 19 
> still to do
> >128144  25368 16384     try to get request from qmaster, id 1
> >128145  25368 16384     reresolve port timeout in 320
> >128146  25368 16384     returning cached port value: 536
> >128147  25368 16384     Sent ack for all events lower or equal 44357
> >128148  25368 16384     ec_get - received 55 events
> >128149  25368 16384     44303. EVENT MOD EXECHOST sub04n147
> >128150  25368 16384     44304. EVENT MOD USER udo
> >128151  25368 16384     44305. EVENT MOD USER iber
> >128152  25368 16384     44306. EVENT MOD USER dieguez
> >128153  25368 16384     44307. EVENT MOD USER karenjoh
> >128154  25368 16384     44308. EVENT MOD USER lorenzo
> >128155  25368 16384     44309. EVENT MOD USER parcolle
> >128156  25368 16384     44310. EVENT MOD USER cfennie
> >128157  25368 16384     44311. EVENT MOD USER civelli
> >128158  25368 16384     44312. EVENT MOD EXECHOST sub04n135
> >128159  25368 16384     44313. EVENT MOD EXECHOST sub04n141
> >128160  25368 16384     44314. EVENT MOD EXECHOST sub04n127
> >128161  25368 16384     44315. EVENT MOD EXECHOST sub04n145
> >128162  25368 16384     44316. EVENT MOD EXECHOST sub04n133
> >128163  25368 16384     44317. EVENT MOD EXECHOST sub04n148
> >128164  25368 16384     44318. EVENT MOD EXECHOST sub04n74
> >128165  25368 16384     44319. EVENT JOB 21542.1 task 
> 2.sub04n74 USAGE
> >128166  25368 16384     44320. EVENT JOB 21542.1 task 
> 1.sub04n74 USAGE
> >128167  25368 16384     44321. EVENT MOD EXECHOST rupc03.rutgers.edu
> >128168  25368 16384     44322. EVENT MOD EXECHOST sub04n139
> >128169  25368 16384     44323. EVENT MOD EXECHOST rupc02.rutgers.edu
> >128170  25368 16384     44324. EVENT MOD EXECHOST sub04n80
> >128171  25368 16384     44325. EVENT MOD EXECHOST sub04n207
> >128172  25368 16384     44326. EVENT MOD EXECHOST sub04n180
> >128173  25368 16384     44327. EVENT MOD EXECHOST sub04n23
> >128174  25368 16384     44328. EVENT MOD EXECHOST sub04n30
> >128175  25368 16384     44329. EVENT MOD EXECHOST sub04n203
> >128176  25368 16384     44330. EVENT MOD EXECHOST sub04n109
> >128177  25368 16384     44331. EVENT MOD EXECHOST rupc04.rutgers.edu
> >128178  25368 16384     44332. EVENT MOD EXECHOST sub04n114
> >128179  25368 16384     44333. EVENT MOD EXECHOST sub04n106
> >128180  25368 16384     44334. EVENT MOD EXECHOST sub04n88
> >128181  25368 16384     44335. EVENT JOB 21507.1 task 
> 6.sub04n88 USAGE
> >128182  25368 16384     44336. EVENT JOB 21507.1 task 
> 5.sub04n88 USAGE
> >128183  25368 16384     44337. EVENT MOD EXECHOST sub04n157
> >128184  25368 16384     44338. EVENT MOD EXECHOST sub04n20
> >128185  25368 16384     44339. EVENT MOD EXECHOST sub04n156
> >128186  25368 16384     44340. EVENT MOD EXECHOST sub04n26
> >128187  25368 16384     44341. EVENT JOB 21213.1 USAGE
> >128188  25368 16384     44342. EVENT MOD EXECHOST sub04n05
> >128189  25368 16384     44343. EVENT MOD EXECHOST sub04n103
> >128190  25368 16384     44344. EVENT MOD EXECHOST sub04n164
> >128191  25368 16384     44345. EVENT MOD EXECHOST sub04n09
> >128192  25368 16384     44346. EVENT MOD EXECHOST sub04n105
> >128193  25368 16384     44347. EVENT MOD EXECHOST sub04n113
> >128194  25368 16384     44348. EVENT MOD EXECHOST sub04n28
> >128195  25368 16384     44349. EVENT MOD EXECHOST sub04n76
> >128196  25368 16384     44350. EVENT MOD EXECHOST sub04n162
> >128197  25368 16384     44351. EVENT MOD EXECHOST sub04n108
> >128198  25368 16384     44352. EVENT MOD EXECHOST sub04n38
> >128199  25368 16384     44353. EVENT MOD EXECHOST sub04n04
> >128200  25368 16384     44354. EVENT MOD EXECHOST sub04n116
> >128201  25368 16384     44355. EVENT MOD EXECHOST sub04n179
> >128202  25368 16384     44356. EVENT MOD EXECHOST sub04n160
> >128203  25368 16384     44357. EVENT MOD EXECHOST sub04n107
> >Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7, 
> CKPT:0 US:15 PR:4 S:nd:12/lf:7 
> >128204  25368 16384     
> ================[SCHEDULING-EPOCH]==================
> >128205  25368 16384     JOB 20937.1 start_time = 1116447112 
> running_time 338079 decay_time = 450
> >128206  25368 16384     JOB 20938.1 start_time = 1116374344 
> running_time 410847 decay_time = 450
> >128207  25368 16384     JOB 21040.1 start_time = 1116443073 
> running_time 342118 decay_time = 450
> >128208  25368 16384     JOB 21076.1 start_time = 1116451351 
> running_time 333840 decay_time = 450
> >128209  25368 16384     JOB 21210.1 start_time = 1116514970 
> running_time 270221 decay_time = 450
> >128210  25368 16384     JOB 21213.1 start_time = 1116515250 
> running_time 269941 decay_time = 450
> >128211  25368 16384     JOB 21338.1 start_time = 1116543252 
> running_time 241939 decay_time = 450
> >128212  25368 16384     JOB 21423.1 start_time = 1116629274 
> running_time 155917 decay_time = 450
> >128213  25368 16384     JOB 21424.1 start_time = 1116631365 
> running_time 153826 decay_time = 450
> >128214  25368 16384     JOB 21440.1 start_time = 1116632934 
> running_time 152257 decay_time = 450
> >128215  25368 16384     JOB 21441.1 start_time = 1116632994 
> running_time 152197 decay_time = 450
> >128216  25368 16384     JOB 21443.1 start_time = 1116633602 
> running_time 151589 decay_time = 450
> >128217  25368 16384     JOB 21474.1 start_time = 1116655118 
> running_time 130073 decay_time = 450
> >128218  25368 16384     JOB 21503.1 start_time = 1116707395 
> running_time 77796 decay_time = 450
> >128219  25368 16384     JOB 21507.1 start_time = 1116714061 
> running_time 71130 decay_time = 450
> >128220  25368 16384     JOB 21528.1 start_time = 1116707641 
> running_time 77550 decay_time = 450
> >128221  25368 16384     JOB 21530.1 start_time = 1116714453 
> running_time 70738 decay_time = 450
> >128222  25368 16384     JOB 21537.1 start_time = 1116724845 
> running_time 60346 decay_time = 450
> >128223  25368 16384     JOB 21542.1 start_time = 1116782511 
> running_time 2680 decay_time = 450
> >128224  25368 16384     verified threshold of 169 queues
> >128225  25368 16384     queue myrinet at sub04n61 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128226  25368 16384     queue myrinet at sub04n62 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128227  25368 16384     queue myrinet at sub04n65 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128228  25368 16384     queue myrinet at sub04n66 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128229  25368 16384     queue myrinet at sub04n67 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128230  25368 16384     queue myrinet at sub04n68 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128231  25368 16384     queue myrinet at sub04n69 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128232  25368 16384     queue myrinet at sub04n70 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128233  25368 16384     queue myrinet at sub04n71 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128234  25368 16384     queue myrinet at sub04n72 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128235  25368 16384     queue myrinet at sub04n75 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128236  25368 16384     queue myrinet at sub04n77 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128237  25368 16384     queue myrinet at sub04n78 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128238  25368 16384     queue myrinet at sub04n79 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128239  25368 16384     queue myrinet at sub04n81 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128240  25368 16384     queue myrinet at sub04n84 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128241  25368 16384     queue myrinet at sub04n85 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128242  25368 16384     queue myrinet at sub04n86 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128243  25368 16384     queue myrinet at sub04n87 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128244  25368 16384     queue myrinet at sub04n88 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128245  25368 16384     queue myrinet at sub04n89 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128246  25368 16384     queue myrinet at sub04n90 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128247  25368 16384     queue myrinet at sub04n91 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128248  25368 16384     queue myrinet at sub04n63 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128249  25368 16384     queue myrinet at sub04n64 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128250  25368 16384     queue myrinet at sub04n73 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128251  25368 16384     queue myrinet at sub04n74 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128252  25368 16384     queue opteronp at sub04n202 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128253  25368 16384     queue opteronp at sub04n205 tagged to 
> be overloaded: load_medium=1.010000 (no load adjustment) >= 1.0
> >
> >128254  25368 16384     queue opteronp at sub04n206 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128255  25368 16384     queue opteronp at sub04n208 tagged to 
> be overloaded: load_medium=1.010000 (no load adjustment) >= 1.0
> >
> >128256  25368 16384     queue parallel at sub04n121 tagged to 
> be overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128257  25368 16384     queue parallel at sub04n139 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128258  25368 16384     queue parallel at sub04n140 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128259  25368 16384     queue parallel at sub04n141 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128260  25368 16384     queue parallel at sub04n142 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128261  25368 16384     queue parallel at sub04n143 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128262  25368 16384     queue parallel at sub04n144 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128263  25368 16384     queue parallel at sub04n146 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128264  25368 16384     queue parallel at sub04n02 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128265  25368 16384     queue parallel at sub04n03 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128266  25368 16384     queue parallel at sub04n04 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128267  25368 16384     queue parallel at sub04n05 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128268  25368 16384     queue parallel at sub04n06 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128269  25368 16384     queue parallel at sub04n07 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128270  25368 16384     queue parallel at sub04n08 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128271  25368 16384     queue parallel at sub04n09 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128272  25368 16384     queue parallel at sub04n10 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128273  25368 16384     queue parallel at sub04n11 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128274  25368 16384     verified threshold of 169 queues
> >128275  25368 16384     STARTING PASS 1 WITH 0 PENDING JOBS
> >128276  25368 16384     Not enrolled ja_tasks: 0
> >128277  25368 16384     Enrolled ja_tasks: 1
> >128278  25368 16384     Not enrolled ja_tasks: 0
> >128279  25368 16384     Enrolled ja_tasks: 1
> >128280  25368 16384     Not enrolled ja_tasks: 0
> >128281  25368 16384     Enrolled ja_tasks: 1
> >128282  25368 16384     Not enrolled ja_tasks: 0
> >128283  25368 16384     Enrolled ja_tasks: 1
> >128284  25368 16384     Not enrolled ja_tasks: 0
> >128285  25368 16384     Enrolled ja_tasks: 1
> >128286  25368 16384     Not enrolled ja_tasks: 0
> >128287  25368 16384     Enrolled ja_tasks: 1
> >128288  25368 16384     Not enrolled ja_tasks: 0
> >128289  25368 16384     Enrolled ja_tasks: 1
> >128290  25368 16384     Not enrolled ja_tasks: 0
> >128291  25368 16384     Enrolled ja_tasks: 1
> >128292  25368 16384     Not enrolled ja_tasks: 0
> >128293  25368 16384     Enrolled ja_tasks: 1
> >128294  25368 16384     Not enrolled ja_tasks: 0
> >128295  25368 16384     Enrolled ja_tasks: 1
> >128296  25368 16384     Not enrolled ja_tasks: 0
> >128297  25368 16384     Enrolled ja_tasks: 1
> >128298  25368 16384     Not enrolled ja_tasks: 0
> >128299  25368 16384     Enrolled ja_tasks: 1
> >128300  25368 16384     Not enrolled ja_tasks: 0
> >128301  25368 16384     Enrolled ja_tasks: 1
> >128302  25368 16384     Not enrolled ja_tasks: 0
> >128303  25368 16384     Enrolled ja_tasks: 1
> >128304  25368 16384     Not enrolled ja_tasks: 0
> >128305  25368 16384     Enrolled ja_tasks: 1
> >128306  25368 16384     Not enrolled ja_tasks: 0
> >128307  25368 16384     Enrolled ja_tasks: 1
> >128308  25368 16384     Not enrolled ja_tasks: 0
> >128309  25368 16384     Enrolled ja_tasks: 1
> >128310  25368 16384     Not enrolled ja_tasks: 0
> >128311  25368 16384     Enrolled ja_tasks: 1
> >128312  25368 16384     Not enrolled ja_tasks: 0
> >128313  25368 16384     Enrolled ja_tasks: 1
> >128314  25368 16384     STARTING PASS 2 WITH 0 PENDING JOBS
> >128315  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128316  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >128317  25368 16384     slot request assumed for static 
> urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
> >128318  25368 16384        slots: 1.000000 * 1000.000000 * 
> 20    ---> 20000.000000
> >128319  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128320  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128321  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >128322  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128323  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128324  25368 16384     slot request assumed for static 
> urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
> >128325  25368 16384        slots: 1.000000 * 1000.000000 * 2 
>    ---> 2000.000000
> >128326  25368 16384        slots: 1.000000 * 1000.000000 * 8 
>    ---> 8000.000000
> >128327  25368 16384     ASU min = 1000.00000000000, ASU max 
> = 20000.00000000000
> >128328  25368 16384     
> >128329  25368 16384     no DDJU: do_usage: 1 finished_jobs 0
> >128330  25368 16384     
> >128331  25368 16384     =====================[Pass 
> 0]======================
> >128332  25368 16384     =====================[Pass 
> 1]======================
> >128333  25368 16384     =====================[Pass 
> 2]======================
> >128334  25368 16384     
> >128335  25368 16384     no DDJU: do_usage: 0 finished_jobs 0
> >128336  25368 16384     
> >128337  25368 16384     =====================[Pass 
> 0]======================
> >128338  25368 16384     =====================[Pass 
> 1]======================
> >128339  25368 16384     =====================[Pass 
> 2]======================
> >128340  25368 16384     Normalizing tickets using 
> 0.000000/18.333333 as min_tix/max_tix
> >128341  25368 16384        got 19 running jobs
> >128342  25368 16384        added 19 ticket orders for running jobs
> >128343  25368 16384        added 1 orders for updating usage of user
> >128344  25368 16384        added 0 orders for updating usage 
> of project
> >128345  25368 16384        added 0 orders for updating share tree
> >128346  25368 16384        added 1 orders for scheduler configuration
> >128347  25368 16384     SENDING 22 ORDERS TO QMASTER
> >128348  25368 16384     RESETTING BUSY STATE OF EVENT CLIENT
> >128349  25368 16384     reresolve port timeout in 320
> >128350  25368 16384     returning cached port value: 536
> >--------------STOP-SCHEDULER-RUN-------------
> >128351  25368 16384     ec_get retrieving events - will do 
> max 20 fetches
> >128352  25368 16384     doing sync fetch for messages, 20 still to do
> >128353  25368 16384     try to get request from qmaster, id 1
> >128354  25368 16384     Checking 120 events (44358-44477) 
> while waiting for #44358
> >128355  25368 16384     check complete, 120 events in list
> >128356  25368 16384     got 120 events till 44477
> >128357  25368 16384     doing async fetch for messages, 19 
> still to do
> >128358  25368 16384     try to get request from qmaster, id 1
> >128359  25368 16384     reresolve port timeout in 300
> >128360  25368 16384     returning cached port value: 536
> >128361  25368 16384     Sent ack for all events lower or equal 44477
> >128362  25368 16384     ec_get - received 120 events
> >128363  25368 16384     44358. EVENT MOD EXECHOST sub04n166
> >128364  25368 16384     44359. EVENT MOD EXECHOST sub04n90
> >128365  25368 16384     44360. EVENT JOB 21503.1 task 
> 2.sub04n90 USAGE
> >128366  25368 16384     44361. EVENT JOB 21503.1 task 
> 1.sub04n90 USAGE
> >128367  25368 16384     44362. EVENT MOD EXECHOST sub04n168
> >128368  25368 16384     44363. EVENT MOD EXECHOST sub04n112
> >128369  25368 16384     44364. EVENT MOD EXECHOST sub04n08
> >128370  25368 16384     44365. EVENT MOD EXECHOST sub04n75
> >128371  25368 16384     44366. EVENT JOB 21040.1 task 
> 6.sub04n75 USAGE
> >128372  25368 16384     44367. EVENT JOB 21040.1 task 
> 5.sub04n75 USAGE
> >128373  25368 16384     44368. EVENT MOD USER udo
> >128374  25368 16384     44369. EVENT MOD USER iber
> >128375  25368 16384     44370. EVENT MOD USER dieguez
> >128376  25368 16384     44371. EVENT MOD USER karenjoh
> >128377  25368 16384     44372. EVENT MOD USER lorenzo
> >128378  25368 16384     44373. EVENT MOD USER parcolle
> >128379  25368 16384     44374. EVENT MOD USER cfennie
> >128380  25368 16384     44375. EVENT MOD USER civelli
> >128381  25368 16384     44376. EVENT MOD EXECHOST sub04n14
> >128382  25368 16384     44377. EVENT MOD EXECHOST sub04n150
> >128383  25368 16384     44378. EVENT MOD EXECHOST sub04n169
> >128384  25368 16384     44379. EVENT MOD EXECHOST sub04n165
> >128385  25368 16384     44380. EVENT MOD EXECHOST sub04n136
> >128386  25368 16384     44381. EVENT MOD EXECHOST sub04n81
> >128387  25368 16384     44382. EVENT JOB 21507.1 task 
> 6.sub04n81 USAGE
> >128388  25368 16384     44383. EVENT JOB 21507.1 task 
> 5.sub04n81 USAGE
> >128389  25368 16384     44384. EVENT MOD EXECHOST sub04n176
> >128390  25368 16384     44385. EVENT MOD EXECHOST sub04n161
> >128391  25368 16384     44386. EVENT MOD EXECHOST sub04n124
> >128392  25368 16384     44387. EVENT MOD EXECHOST sub04n01
> >128393  25368 16384     44388. EVENT MOD EXECHOST sub04n158
> >128394  25368 16384     44389. EVENT MOD EXECHOST sub04n159
> >128395  25368 16384     44390. EVENT MOD EXECHOST sub04n134
> >128396  25368 16384     44391. EVENT MOD EXECHOST sub04n143
> >128397  25368 16384     44392. EVENT MOD EXECHOST sub04n121
> >128398  25368 16384     44393. EVENT MOD EXECHOST sub04n15
> >128399  25368 16384     44394. EVENT MOD EXECHOST sub04n13
> >128400  25368 16384     44395. EVENT MOD EXECHOST sub04n118
> >128401  25368 16384     44396. EVENT MOD EXECHOST sub04n64
> >128402  25368 16384     44397. EVENT JOB 21542.1 task 
> 2.sub04n64 USAGE
> >128403  25368 16384     44398. EVENT JOB 21542.1 task 
> 1.sub04n64 USAGE
> >128404  25368 16384     44399. EVENT MOD EXECHOST sub04n151
> >128405  25368 16384     44400. EVENT MOD EXECHOST sub04n154
> >128406  25368 16384     44401. EVENT MOD EXECHOST sub04n149
> >128407  25368 16384     44402. EVENT MOD EXECHOST sub04n16
> >128408  25368 16384     44403. EVENT MOD EXECHOST sub04n155
> >128409  25368 16384     44404. EVENT MOD EXECHOST sub04n152
> >128410  25368 16384     44405. EVENT MOD EXECHOST sub04n163
> >128411  25368 16384     44406. EVENT MOD EXECHOST sub04n86
> >128412  25368 16384     44407. EVENT JOB 21423.1 task 
> 2.sub04n86 USAGE
> >128413  25368 16384     44408. EVENT JOB 21423.1 task 
> 1.sub04n86 USAGE
> >128414  25368 16384     44409. EVENT MOD EXECHOST sub04n43
> >128415  25368 16384     44410. EVENT MOD EXECHOST sub04n204
> >128416  25368 16384     44411. EVENT MOD EXECHOST rupc01.rutgers.edu
> >128417  25368 16384     44412. EVENT MOD EXECHOST sub04n125
> >128418  25368 16384     44413. EVENT MOD EXECHOST sub04n03
> >128419  25368 16384     44414. EVENT JOB 21076.1 USAGE
> >128420  25368 16384     44415. EVENT MOD EXECHOST sub04n44
> >128421  25368 16384     44416. EVENT MOD EXECHOST sub04n32
> >128422  25368 16384     44417. EVENT MOD EXECHOST sub04n21
> >128423  25368 16384     44418. EVENT MOD EXECHOST sub04n22
> >128424  25368 16384     44419. EVENT MOD EXECHOST sub04n35
> >128425  25368 16384     44420. EVENT MOD EXECHOST sub04n201
> >128426  25368 16384     44421. EVENT MOD EXECHOST sub04n146
> >128427  25368 16384     44422. EVENT MOD EXECHOST sub04n111
> >128428  25368 16384     44423. EVENT MOD EXECHOST sub04n177
> >128429  25368 16384     44424. EVENT MOD EXECHOST sub04n89
> >128430  25368 16384     44425. EVENT JOB 21530.1 task 
> 2.sub04n89 USAGE
> >128431  25368 16384     44426. EVENT JOB 21530.1 task 
> 1.sub04n89 USAGE
> >128432  25368 16384     44427. EVENT JOB 21530.1 USAGE
> >128433  25368 16384     44428. EVENT MOD EXECHOST sub04n205
> >128434  25368 16384     44429. EVENT JOB 21440.1 USAGE
> >128435  25368 16384     44430. EVENT MOD EXECHOST sub04n208
> >128436  25368 16384     44431. EVENT JOB 21528.1 USAGE
> >128437  25368 16384     44432. EVENT MOD EXECHOST sub04n104
> >128438  25368 16384     44433. EVENT MOD EXECHOST sub04n24
> >128439  25368 16384     44434. EVENT JOB 21210.1 USAGE
> >128440  25368 16384     44435. EVENT MOD EXECHOST sub04n18
> >128441  25368 16384     44436. EVENT MOD EXECHOST sub04n31
> >128442  25368 16384     44437. EVENT JOB 20937.1 USAGE
> >128443  25368 16384     44438. EVENT MOD EXECHOST sub04n202
> >128444  25368 16384     44439. EVENT JOB 21443.1 USAGE
> >128445  25368 16384     44440. EVENT MOD EXECHOST sub04n171
> >128446  25368 16384     44441. EVENT MOD EXECHOST sub04n37
> >128447  25368 16384     44442. EVENT MOD EXECHOST sub04n36
> >128448  25368 16384     44443. EVENT MOD EXECHOST sub04n40
> >128449  25368 16384     44444. EVENT MOD EXECHOST sub04n12
> >128450  25368 16384     44445. EVENT MOD EXECHOST sub04n172
> >128451  25368 16384     44446. EVENT MOD EXECHOST sub04n79
> >128452  25368 16384     44447. EVENT JOB 21040.1 task 
> 6.sub04n79 USAGE
> >128453  25368 16384     44448. EVENT JOB 21040.1 task 
> 5.sub04n79 USAGE
> >128454  25368 16384     44449. EVENT JOB 21040.1 USAGE
> >128455  25368 16384     44450. EVENT MOD EXECHOST sub04n61
> >128456  25368 16384     44451. EVENT JOB 21040.1 task 
> 6.sub04n61 USAGE
> >128457  25368 16384     44452. EVENT JOB 21040.1 task 
> 5.sub04n61 USAGE
> >128458  25368 16384     44453. EVENT MOD EXECHOST sub04n170
> >128459  25368 16384     44454. EVENT MOD EXECHOST sub04n41
> >128460  25368 16384     44455. EVENT JOB 20938.1 USAGE
> >128461  25368 16384     44456. EVENT MOD EXECHOST sub04n153
> >128462  25368 16384     44457. EVENT MOD EXECHOST sub04n39
> >128463  25368 16384     44458. EVENT MOD EXECHOST sub04n83
> >128464  25368 16384     44459. EVENT MOD EXECHOST sub04n82
> >128465  25368 16384     44460. EVENT MOD EXECHOST sub04n174
> >128466  25368 16384     44461. EVENT MOD EXECHOST sub04n173
> >128467  25368 16384     44462. EVENT MOD EXECHOST sub04n85
> >128468  25368 16384     44463. EVENT JOB 21423.1 task 
> 2.sub04n85 USAGE
> >128469  25368 16384     44464. EVENT JOB 21423.1 task 
> 1.sub04n85 USAGE
> >128470  25368 16384     44465. EVENT MOD EXECHOST sub04n68
> >128471  25368 16384     44466. EVENT JOB 21474.1 task 
> 14.sub04n68 USAGE
> >128472  25368 16384     44467. EVENT JOB 21474.1 task 
> 13.sub04n68 USAGE
> >128473  25368 16384     44468. EVENT MOD EXECHOST beowulf.rutgers.edu
> >128474  25368 16384     44469. EVENT MOD EXECHOST sub04n91
> >128475  25368 16384     44470. EVENT JOB 21423.1 task 
> 2.sub04n91 USAGE
> >128476  25368 16384     44471. EVENT JOB 21423.1 task 
> 1.sub04n91 USAGE
> >128477  25368 16384     44472. EVENT JOB 21423.1 USAGE
> >128478  25368 16384     44473. EVENT MOD EXECHOST sub04n29
> >128479  25368 16384     44474. EVENT MOD EXECHOST sub04n69
> >128480  25368 16384     44475. EVENT JOB 21474.1 task 
> 14.sub04n69 USAGE
> >128481  25368 16384     44476. EVENT JOB 21474.1 task 
> 13.sub04n69 USAGE
> >128482  25368 16384     44477. EVENT MOD EXECHOST sub04n175
> >Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7, 
> CKPT:0 US:15 PR:4 S:nd:12/lf:7 
> >128483  25368 16384     
> ================[SCHEDULING-EPOCH]==================
> >128484  25368 16384     JOB 20937.1 start_time = 1116447112 
> running_time 338099 decay_time = 450
> >128485  25368 16384     JOB 20938.1 start_time = 1116374344 
> running_time 410867 decay_time = 450
> >128486  25368 16384     JOB 21040.1 start_time = 1116443073 
> running_time 342138 decay_time = 450
> >128487  25368 16384     JOB 21076.1 start_time = 1116451351 
> running_time 333860 decay_time = 450
> >128488  25368 16384     JOB 21210.1 start_time = 1116514970 
> running_time 270241 decay_time = 450
> >128489  25368 16384     JOB 21213.1 start_time = 1116515250 
> running_time 269961 decay_time = 450
> >128490  25368 16384     JOB 21338.1 start_time = 1116543252 
> running_time 241959 decay_time = 450
> >128491  25368 16384     JOB 21423.1 start_time = 1116629274 
> running_time 155937 decay_time = 450
> >128492  25368 16384     JOB 21424.1 start_time = 1116631365 
> running_time 153846 decay_time = 450
> >128493  25368 16384     JOB 21440.1 start_time = 1116632934 
> running_time 152277 decay_time = 450
> >128494  25368 16384     JOB 21441.1 start_time = 1116632994 
> running_time 152217 decay_time = 450
> >128495  25368 16384     JOB 21443.1 start_time = 1116633602 
> running_time 151609 decay_time = 450
> >128496  25368 16384     JOB 21474.1 start_time = 1116655118 
> running_time 130093 decay_time = 450
> >128497  25368 16384     JOB 21503.1 start_time = 1116707395 
> running_time 77816 decay_time = 450
> >128498  25368 16384     JOB 21507.1 start_time = 1116714061 
> running_time 71150 decay_time = 450
> >128499  25368 16384     JOB 21528.1 start_time = 1116707641 
> running_time 77570 decay_time = 450
> >128500  25368 16384     JOB 21530.1 start_time = 1116714453 
> running_time 70758 decay_time = 450
> >128501  25368 16384     JOB 21537.1 start_time = 1116724845 
> running_time 60366 decay_time = 450
> >128502  25368 16384     JOB 21542.1 start_time = 1116782511 
> running_time 2700 decay_time = 450
> >128503  25368 16384     verified threshold of 169 queues
> >128504  25368 16384     queue myrinet at sub04n61 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128505  25368 16384     queue myrinet at sub04n62 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128506  25368 16384     queue myrinet at sub04n65 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128507  25368 16384     queue myrinet at sub04n66 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128508  25368 16384     queue myrinet at sub04n67 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128509  25368 16384     queue myrinet at sub04n68 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128510  25368 16384     queue myrinet at sub04n69 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128511  25368 16384     queue myrinet at sub04n70 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128512  25368 16384     queue myrinet at sub04n71 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128513  25368 16384     queue myrinet at sub04n72 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128514  25368 16384     queue myrinet at sub04n75 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128515  25368 16384     queue myrinet at sub04n77 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128516  25368 16384     queue myrinet at sub04n78 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128517  25368 16384     queue myrinet at sub04n79 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128518  25368 16384     queue myrinet at sub04n81 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128519  25368 16384     queue myrinet at sub04n84 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128520  25368 16384     queue myrinet at sub04n85 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128521  25368 16384     queue myrinet at sub04n86 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128522  25368 16384     queue myrinet at sub04n87 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128523  25368 16384     queue myrinet at sub04n88 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128524  25368 16384     queue myrinet at sub04n89 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128525  25368 16384     queue myrinet at sub04n90 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128526  25368 16384     queue myrinet at sub04n91 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128527  25368 16384     queue myrinet at sub04n63 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128528  25368 16384     queue myrinet at sub04n64 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128529  25368 16384     queue myrinet at sub04n73 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128530  25368 16384     queue myrinet at sub04n74 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128531  25368 16384     queue opteronp at sub04n202 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128532  25368 16384     queue opteronp at sub04n205 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128533  25368 16384     queue opteronp at sub04n206 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128534  25368 16384     queue opteronp at sub04n208 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128535  25368 16384     queue parallel at sub04n121 tagged to 
> be overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128536  25368 16384     queue parallel at sub04n139 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128537  25368 16384     queue parallel at sub04n140 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128538  25368 16384     queue parallel at sub04n141 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128539  25368 16384     queue parallel at sub04n142 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128540  25368 16384     queue parallel at sub04n143 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128541  25368 16384     queue parallel at sub04n144 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128542  25368 16384     queue parallel at sub04n146 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128543  25368 16384     queue parallel at sub04n02 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128544  25368 16384     queue parallel at sub04n03 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128545  25368 16384     queue parallel at sub04n04 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128546  25368 16384     queue parallel at sub04n05 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128547  25368 16384     queue parallel at sub04n06 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128548  25368 16384     queue parallel at sub04n07 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128549  25368 16384     queue parallel at sub04n08 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128550  25368 16384     queue parallel at sub04n09 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128551  25368 16384     queue parallel at sub04n10 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128552  25368 16384     queue parallel at sub04n11 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128553  25368 16384     verified threshold of 169 queues
> >128554  25368 16384     STARTING PASS 1 WITH 0 PENDING JOBS
> >128555  25368 16384     Not enrolled ja_tasks: 0
> >128556  25368 16384     Enrolled ja_tasks: 1
> >128557  25368 16384     Not enrolled ja_tasks: 0
> >128558  25368 16384     Enrolled ja_tasks: 1
> >128559  25368 16384     Not enrolled ja_tasks: 0
> >128560  25368 16384     Enrolled ja_tasks: 1
> >128561  25368 16384     Not enrolled ja_tasks: 0
> >128562  25368 16384     Enrolled ja_tasks: 1
> >128563  25368 16384     Not enrolled ja_tasks: 0
> >128564  25368 16384     Enrolled ja_tasks: 1
> >128565  25368 16384     Not enrolled ja_tasks: 0
> >128566  25368 16384     Enrolled ja_tasks: 1
> >128567  25368 16384     Not enrolled ja_tasks: 0
> >128568  25368 16384     Enrolled ja_tasks: 1
> >128569  25368 16384     Not enrolled ja_tasks: 0
> >128570  25368 16384     Enrolled ja_tasks: 1
> >128571  25368 16384     Not enrolled ja_tasks: 0
> >128572  25368 16384     Enrolled ja_tasks: 1
> >128573  25368 16384     Not enrolled ja_tasks: 0
> >128574  25368 16384     Enrolled ja_tasks: 1
> >128575  25368 16384     Not enrolled ja_tasks: 0
> >128576  25368 16384     Enrolled ja_tasks: 1
> >128577  25368 16384     Not enrolled ja_tasks: 0
> >128578  25368 16384     Enrolled ja_tasks: 1
> >128579  25368 16384     Not enrolled ja_tasks: 0
> >128580  25368 16384     Enrolled ja_tasks: 1
> >128581  25368 16384     Not enrolled ja_tasks: 0
> >128582  25368 16384     Enrolled ja_tasks: 1
> >128583  25368 16384     Not enrolled ja_tasks: 0
> >128584  25368 16384     Enrolled ja_tasks: 1
> >128585  25368 16384     Not enrolled ja_tasks: 0
> >128586  25368 16384     Enrolled ja_tasks: 1
> >128587  25368 16384     Not enrolled ja_tasks: 0
> >128588  25368 16384     Enrolled ja_tasks: 1
> >128589  25368 16384     Not enrolled ja_tasks: 0
> >128590  25368 16384     Enrolled ja_tasks: 1
> >128591  25368 16384     Not enrolled ja_tasks: 0
> >128592  25368 16384     Enrolled ja_tasks: 1
> >128593  25368 16384     STARTING PASS 2 WITH 0 PENDING JOBS
> >128594  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128595  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >128596  25368 16384     slot request assumed for static 
> urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
> >128597  25368 16384        slots: 1.000000 * 1000.000000 * 
> 20    ---> 20000.000000
> >128598  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128599  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128600  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >128601  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128602  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128603  25368 16384     slot request assumed for static 
> urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
> >128604  25368 16384        slots: 1.000000 * 1000.000000 * 2 
>    ---> 2000.000000
> >128605  25368 16384        slots: 1.000000 * 1000.000000 * 8 
>    ---> 8000.000000
> >128606  25368 16384     ASU min = 1000.00000000000, ASU max 
> = 20000.00000000000
> >128607  25368 16384     
> >128608  25368 16384     no DDJU: do_usage: 1 finished_jobs 0
> >128609  25368 16384     
> >128610  25368 16384     =====================[Pass 
> 0]======================
> >128611  25368 16384     =====================[Pass 
> 1]======================
> >128612  25368 16384     =====================[Pass 
> 2]======================
> >128613  25368 16384     
> >128614  25368 16384     no DDJU: do_usage: 0 finished_jobs 0
> >128615  25368 16384     
> >128616  25368 16384     =====================[Pass 
> 0]======================
> >128617  25368 16384     =====================[Pass 
> 1]======================
> >128618  25368 16384     =====================[Pass 
> 2]======================
> >128619  25368 16384     Normalizing tickets using 
> 0.000000/18.333333 as min_tix/max_tix
> >128620  25368 16384        got 19 running jobs
> >128621  25368 16384        added 19 ticket orders for running jobs
> >128622  25368 16384        added 1 orders for updating usage of user
> >128623  25368 16384        added 0 orders for updating usage 
> of project
> >128624  25368 16384        added 0 orders for updating share tree
> >128625  25368 16384        added 1 orders for scheduler configuration
> >128626  25368 16384     SENDING 22 ORDERS TO QMASTER
> >128627  25368 16384     RESETTING BUSY STATE OF EVENT CLIENT
> >128628  25368 16384     reresolve port timeout in 300
> >128629  25368 16384     returning cached port value: 536
> >--------------STOP-SCHEDULER-RUN-------------
> >128630  25368 16384     ec_get retrieving events - will do 
> max 20 fetches
> >128631  25368 16384     doing sync fetch for messages, 20 still to do
> >128632  25368 16384     try to get request from qmaster, id 1
> >128633  25368 16384     Checking 84 events (44478-44561) 
> while waiting for #44478
> >128634  25368 16384     check complete, 84 events in list
> >128635  25368 16384     got 84 events till 44561
> >128636  25368 16384     doing async fetch for messages, 19 
> still to do
> >128637  25368 16384     try to get request from qmaster, id 1
> >128638  25368 16384     reresolve port timeout in 280
> >128639  25368 16384     returning cached port value: 536
> >128640  25368 16384     Getting host by name - Linux
> >128641  25368 16384     1 names in h_addr_list
> >128642  25368 16384     0 names in h_aliases
> >128643  25368 16384     Sent ack for all events lower or equal 44561
> >128644  25368 16384     ec_get - received 84 events
> >128645  25368 16384     44478. EVENT MOD EXECHOST sub04n167
> >128646  25368 16384     44479. EVENT MOD EXECHOST sub04n63
> >128647  25368 16384     44480. EVENT JOB 21542.1 task 
> 2.sub04n63 USAGE
> >128648  25368 16384     44481. EVENT JOB 21542.1 task 
> 1.sub04n63 USAGE
> >128649  25368 16384     44482. EVENT JOB 21542.1 USAGE
> >128650  25368 16384     44483. EVENT MOD EXECHOST sub04n71
> >128651  25368 16384     44484. EVENT JOB 21537.1 task 
> 2.sub04n71 USAGE
> >128652  25368 16384     44485. EVENT JOB 21537.1 task 
> 1.sub04n71 USAGE
> >128653  25368 16384     44486. EVENT MOD EXECHOST sub04n65
> >128654  25368 16384     44487. EVENT JOB 21424.1 task 
> 2.sub04n65 USAGE
> >128655  25368 16384     44488. EVENT JOB 21424.1 task 
> 1.sub04n65 USAGE
> >128656  25368 16384     44489. EVENT MOD USER udo
> >128657  25368 16384     44490. EVENT MOD USER iber
> >128658  25368 16384     44491. EVENT MOD USER dieguez
> >128659  25368 16384     44492. EVENT MOD USER karenjoh
> >128660  25368 16384     44493. EVENT MOD USER lorenzo
> >128661  25368 16384     44494. EVENT MOD USER parcolle
> >128662  25368 16384     44495. EVENT MOD USER cfennie
> >128663  25368 16384     44496. EVENT MOD USER civelli
> >128664  25368 16384     44497. EVENT MOD EXECHOST sub04n25
> >128665  25368 16384     44498. EVENT MOD EXECHOST sub04n144
> >128666  25368 16384     44499. EVENT MOD EXECHOST sub04n206
> >128667  25368 16384     44500. EVENT JOB 21441.1 USAGE
> >128668  25368 16384     44501. EVENT MOD EXECHOST sub04n87
> >128669  25368 16384     44502. EVENT JOB 21503.1 task 
> 2.sub04n87 USAGE
> >128670  25368 16384     44503. EVENT JOB 21503.1 task 
> 1.sub04n87 USAGE
> >128671  25368 16384     44504. EVENT MOD EXECHOST sub04n70
> >128672  25368 16384     44505. EVENT JOB 21503.1 task 
> 2.sub04n70 USAGE
> >128673  25368 16384     44506. EVENT JOB 21503.1 task 
> 1.sub04n70 USAGE
> >128674  25368 16384     44507. EVENT JOB 21503.1 USAGE
> >128675  25368 16384     44508. EVENT MOD EXECHOST sub04n19
> >128676  25368 16384     44509. EVENT JOB 21338.1 USAGE
> >128677  25368 16384     44510. EVENT MOD EXECHOST sub04n84
> >128678  25368 16384     44511. EVENT JOB 21424.1 task 
> 2.sub04n84 USAGE
> >128679  25368 16384     44512. EVENT JOB 21424.1 task 
> 1.sub04n84 USAGE
> >128680  25368 16384     44513. EVENT MOD EXECHOST sub04n178
> >128681  25368 16384     44514. EVENT MOD EXECHOST sub04n67
> >128682  25368 16384     44515. EVENT JOB 21474.1 task 
> 14.sub04n67 USAGE
> >128683  25368 16384     44516. EVENT JOB 21474.1 task 
> 13.sub04n67 USAGE
> >128684  25368 16384     44517. EVENT JOB 21474.1 USAGE
> >128685  25368 16384     44518. EVENT MOD EXECHOST sub04n27
> >128686  25368 16384     44519. EVENT MOD EXECHOST sub04n34
> >128687  25368 16384     44520. EVENT MOD EXECHOST sub04n72
> >128688  25368 16384     44521. EVENT JOB 21537.1 task 
> 2.sub04n72 USAGE
> >128689  25368 16384     44522. EVENT JOB 21537.1 task 
> 1.sub04n72 USAGE
> >128690  25368 16384     44523. EVENT MOD EXECHOST sub04n78
> >128691  25368 16384     44524. EVENT JOB 21507.1 task 
> 6.sub04n78 USAGE
> >128692  25368 16384     44525. EVENT JOB 21507.1 task 
> 5.sub04n78 USAGE
> >128693  25368 16384     44526. EVENT JOB 21507.1 USAGE
> >128694  25368 16384     44527. EVENT MOD EXECHOST sub04n17
> >128695  25368 16384     44528. EVENT MOD EXECHOST sub04n07
> >128696  25368 16384     44529. EVENT MOD EXECHOST sub04n128
> >128697  25368 16384     44530. EVENT MOD EXECHOST sub04n42
> >128698  25368 16384     44531. EVENT MOD EXECHOST sub04n62
> >128699  25368 16384     44532. EVENT JOB 21424.1 task 
> 2.sub04n62 USAGE
> >128700  25368 16384     44533. EVENT JOB 21424.1 task 
> 1.sub04n62 USAGE
> >128701  25368 16384     44534. EVENT JOB 21424.1 USAGE
> >128702  25368 16384     44535. EVENT MOD EXECHOST sub04n10
> >128703  25368 16384     44536. EVENT MOD EXECHOST sub04n77
> >128704  25368 16384     44537. EVENT JOB 21537.1 task 
> 2.sub04n77 USAGE
> >128705  25368 16384     44538. EVENT JOB 21537.1 task 
> 1.sub04n77 USAGE
> >128706  25368 16384     44539. EVENT MOD EXECHOST sub04n11
> >128707  25368 16384     44540. EVENT MOD EXECHOST sub04n02
> >128708  25368 16384     44541. EVENT MOD EXECHOST sub04n120
> >128709  25368 16384     44542. EVENT MOD EXECHOST sub04n115
> >128710  25368 16384     44543. EVENT MOD EXECHOST sub04n101
> >128711  25368 16384     44544. EVENT MOD EXECHOST sub04n66
> >128712  25368 16384     44545. EVENT JOB 21537.1 task 
> 2.sub04n66 USAGE
> >128713  25368 16384     44546. EVENT JOB 21537.1 task 
> 1.sub04n66 USAGE
> >128714  25368 16384     44547. EVENT JOB 21537.1 USAGE
> >128715  25368 16384     44548. EVENT MOD EXECHOST sub04n142
> >128716  25368 16384     44549. EVENT MOD EXECHOST sub04n123
> >128717  25368 16384     44550. EVENT MOD EXECHOST sub04n33
> >128718  25368 16384     44551. EVENT MOD EXECHOST sub04n126
> >128719  25368 16384     44552. EVENT MOD EXECHOST sub04n140
> >128720  25368 16384     44553. EVENT MOD EXECHOST sub04n119
> >128721  25368 16384     44554. EVENT MOD EXECHOST sub04n102
> >128722  25368 16384     44555. EVENT MOD EXECHOST sub04n110
> >128723  25368 16384     44556. EVENT MOD EXECHOST sub04n117
> >128724  25368 16384     44557. EVENT MOD EXECHOST sub04n06
> >128725  25368 16384     44558. EVENT MOD EXECHOST sub04n73
> >128726  25368 16384     44559. EVENT JOB 21542.1 task 
> 2.sub04n73 USAGE
> >128727  25368 16384     44560. EVENT JOB 21542.1 task 
> 1.sub04n73 USAGE
> >128728  25368 16384     44561. EVENT MOD EXECHOST sub04n122
> >Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7, 
> CKPT:0 US:15 PR:4 S:nd:12/lf:7 
> >128729  25368 16384     
> ================[SCHEDULING-EPOCH]==================
> >128730  25368 16384     JOB 20937.1 start_time = 1116447112 
> running_time 338119 decay_time = 450
> >128731  25368 16384     JOB 20938.1 start_time = 1116374344 
> running_time 410887 decay_time = 450
> >128732  25368 16384     JOB 21040.1 start_time = 1116443073 
> running_time 342158 decay_time = 450
> >128733  25368 16384     JOB 21076.1 start_time = 1116451351 
> running_time 333880 decay_time = 450
> >128734  25368 16384     JOB 21210.1 start_time = 1116514970 
> running_time 270261 decay_time = 450
> >128735  25368 16384     JOB 21213.1 start_time = 1116515250 
> running_time 269981 decay_time = 450
> >128736  25368 16384     JOB 21338.1 start_time = 1116543252 
> running_time 241979 decay_time = 450
> >128737  25368 16384     JOB 21423.1 start_time = 1116629274 
> running_time 155957 decay_time = 450
> >128738  25368 16384     JOB 21424.1 start_time = 1116631365 
> running_time 153866 decay_time = 450
> >128739  25368 16384     JOB 21440.1 start_time = 1116632934 
> running_time 152297 decay_time = 450
> >128740  25368 16384     JOB 21441.1 start_time = 1116632994 
> running_time 152237 decay_time = 450
> >128741  25368 16384     JOB 21443.1 start_time = 1116633602 
> running_time 151629 decay_time = 450
> >128742  25368 16384     JOB 21474.1 start_time = 1116655118 
> running_time 130113 decay_time = 450
> >128743  25368 16384     JOB 21503.1 start_time = 1116707395 
> running_time 77836 decay_time = 450
> >128744  25368 16384     JOB 21507.1 start_time = 1116714061 
> running_time 71170 decay_time = 450
> >128745  25368 16384     JOB 21528.1 start_time = 1116707641 
> running_time 77590 decay_time = 450
> >128746  25368 16384     JOB 21530.1 start_time = 1116714453 
> running_time 70778 decay_time = 450
> >128747  25368 16384     JOB 21537.1 start_time = 1116724845 
> running_time 60386 decay_time = 450
> >128748  25368 16384     JOB 21542.1 start_time = 1116782511 
> running_time 2720 decay_time = 450
> >128749  25368 16384     verified threshold of 169 queues
> >128750  25368 16384     queue myrinet at sub04n61 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128751  25368 16384     queue myrinet at sub04n62 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128752  25368 16384     queue myrinet at sub04n65 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128753  25368 16384     queue myrinet at sub04n66 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128754  25368 16384     queue myrinet at sub04n67 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128755  25368 16384     queue myrinet at sub04n68 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128756  25368 16384     queue myrinet at sub04n69 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128757  25368 16384     queue myrinet at sub04n70 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128758  25368 16384     queue myrinet at sub04n71 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128759  25368 16384     queue myrinet at sub04n72 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128760  25368 16384     queue myrinet at sub04n75 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128761  25368 16384     queue myrinet at sub04n77 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128762  25368 16384     queue myrinet at sub04n78 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128763  25368 16384     queue myrinet at sub04n79 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128764  25368 16384     queue myrinet at sub04n81 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128765  25368 16384     queue myrinet at sub04n84 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128766  25368 16384     queue myrinet at sub04n85 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128767  25368 16384     queue myrinet at sub04n86 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128768  25368 16384     queue myrinet at sub04n87 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128769  25368 16384     queue myrinet at sub04n88 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128770  25368 16384     queue myrinet at sub04n89 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128771  25368 16384     queue myrinet at sub04n90 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128772  25368 16384     queue myrinet at sub04n91 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128773  25368 16384     queue myrinet at sub04n63 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128774  25368 16384     queue myrinet at sub04n64 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128775  25368 16384     queue myrinet at sub04n73 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128776  25368 16384     queue myrinet at sub04n74 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128777  25368 16384     queue opteronp at sub04n202 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128778  25368 16384     queue opteronp at sub04n205 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128779  25368 16384     queue opteronp at sub04n206 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128780  25368 16384     queue opteronp at sub04n208 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128781  25368 16384     queue parallel at sub04n121 tagged to 
> be overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128782  25368 16384     queue parallel at sub04n139 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128783  25368 16384     queue parallel at sub04n140 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128784  25368 16384     queue parallel at sub04n141 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128785  25368 16384     queue parallel at sub04n142 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128786  25368 16384     queue parallel at sub04n143 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128787  25368 16384     queue parallel at sub04n144 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128788  25368 16384     queue parallel at sub04n146 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128789  25368 16384     queue parallel at sub04n02 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128790  25368 16384     queue parallel at sub04n03 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128791  25368 16384     queue parallel at sub04n04 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128792  25368 16384     queue parallel at sub04n05 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128793  25368 16384     queue parallel at sub04n06 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128794  25368 16384     queue parallel at sub04n07 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128795  25368 16384     queue parallel at sub04n08 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128796  25368 16384     queue parallel at sub04n09 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128797  25368 16384     queue parallel at sub04n10 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128798  25368 16384     queue parallel at sub04n11 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128799  25368 16384     verified threshold of 169 queues
> >128800  25368 16384     STARTING PASS 1 WITH 0 PENDING JOBS
> >128801  25368 16384     Not enrolled ja_tasks: 0
> >128802  25368 16384     Enrolled ja_tasks: 1
> >128803  25368 16384     Not enrolled ja_tasks: 0
> >128804  25368 16384     Enrolled ja_tasks: 1
> >128805  25368 16384     Not enrolled ja_tasks: 0
> >128806  25368 16384     Enrolled ja_tasks: 1
> >128807  25368 16384     Not enrolled ja_tasks: 0
> >128808  25368 16384     Enrolled ja_tasks: 1
> >128809  25368 16384     Not enrolled ja_tasks: 0
> >128810  25368 16384     Enrolled ja_tasks: 1
> >128811  25368 16384     Not enrolled ja_tasks: 0
> >128812  25368 16384     Enrolled ja_tasks: 1
> >128813  25368 16384     Not enrolled ja_tasks: 0
> >128814  25368 16384     Enrolled ja_tasks: 1
> >128815  25368 16384     Not enrolled ja_tasks: 0
> >128816  25368 16384     Enrolled ja_tasks: 1
> >128817  25368 16384     Not enrolled ja_tasks: 0
> >128818  25368 16384     Enrolled ja_tasks: 1
> >128819  25368 16384     Not enrolled ja_tasks: 0
> >128820  25368 16384     Enrolled ja_tasks: 1
> >128821  25368 16384     Not enrolled ja_tasks: 0
> >128822  25368 16384     Enrolled ja_tasks: 1
> >128823  25368 16384     Not enrolled ja_tasks: 0
> >128824  25368 16384     Enrolled ja_tasks: 1
> >128825  25368 16384     Not enrolled ja_tasks: 0
> >128826  25368 16384     Enrolled ja_tasks: 1
> >128827  25368 16384     Not enrolled ja_tasks: 0
> >128828  25368 16384     Enrolled ja_tasks: 1
> >128829  25368 16384     Not enrolled ja_tasks: 0
> >128830  25368 16384     Enrolled ja_tasks: 1
> >128831  25368 16384     Not enrolled ja_tasks: 0
> >128832  25368 16384     Enrolled ja_tasks: 1
> >128833  25368 16384     Not enrolled ja_tasks: 0
> >128834  25368 16384     Enrolled ja_tasks: 1
> >128835  25368 16384     Not enrolled ja_tasks: 0
> >128836  25368 16384     Enrolled ja_tasks: 1
> >128837  25368 16384     Not enrolled ja_tasks: 0
> >128838  25368 16384     Enrolled ja_tasks: 1
> >128839  25368 16384     STARTING PASS 2 WITH 0 PENDING JOBS
> >128840  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128841  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >128842  25368 16384     slot request assumed for static 
> urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
> >128843  25368 16384        slots: 1.000000 * 1000.000000 * 
> 20    ---> 20000.000000
> >128844  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128845  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128846  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >128847  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128848  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >128849  25368 16384     slot request assumed for static 
> urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
> >128850  25368 16384        slots: 1.000000 * 1000.000000 * 2 
>    ---> 2000.000000
> >128851  25368 16384        slots: 1.000000 * 1000.000000 * 8 
>    ---> 8000.000000
> >128852  25368 16384     ASU min = 1000.00000000000, ASU max 
> = 20000.00000000000
> >128853  25368 16384     
> >128854  25368 16384     no DDJU: do_usage: 1 finished_jobs 0
> >128855  25368 16384     
> >128856  25368 16384     =====================[Pass 
> 0]======================
> >128857  25368 16384     =====================[Pass 
> 1]======================
> >128858  25368 16384     =====================[Pass 
> 2]======================
> >128859  25368 16384     
> >128860  25368 16384     no DDJU: do_usage: 0 finished_jobs 0
> >128861  25368 16384     
> >128862  25368 16384     =====================[Pass 
> 0]======================
> >128863  25368 16384     =====================[Pass 
> 1]======================
> >128864  25368 16384     =====================[Pass 
> 2]======================
> >128865  25368 16384     Normalizing tickets using 
> 0.000000/18.333333 as min_tix/max_tix
> >128866  25368 16384        got 19 running jobs
> >128867  25368 16384        added 19 ticket orders for running jobs
> >128868  25368 16384        added 1 orders for updating usage of user
> >128869  25368 16384        added 0 orders for updating usage 
> of project
> >128870  25368 16384        added 0 orders for updating share tree
> >128871  25368 16384        added 1 orders for scheduler configuration
> >128872  25368 16384     SENDING 22 ORDERS TO QMASTER
> >128873  25368 16384     RESETTING BUSY STATE OF EVENT CLIENT
> >128874  25368 16384     reresolve port timeout in 280
> >128875  25368 16384     returning cached port value: 536
> >--------------STOP-SCHEDULER-RUN-------------
> >128876  25368 16384     ec_get retrieving events - will do 
> max 20 fetches
> >128877  25368 16384     doing sync fetch for messages, 20 still to do
> >128878  25368 16384     try to get request from qmaster, id 1
> >128879  25368 16384     Checking 55 events (44562-44616) 
> while waiting for #44562
> >128880  25368 16384     check complete, 55 events in list
> >128881  25368 16384     got 55 events till 44616
> >128882  25368 16384     doing async fetch for messages, 19 
> still to do
> >128883  25368 16384     try to get request from qmaster, id 1
> >128884  25368 16384     reresolve port timeout in 260
> >128885  25368 16384     returning cached port value: 536
> >128886  25368 16384     Sent ack for all events lower or equal 44616
> >128887  25368 16384     ec_get - received 55 events
> >128888  25368 16384     44562. EVENT MOD EXECHOST sub04n147
> >128889  25368 16384     44563. EVENT MOD USER udo
> >128890  25368 16384     44564. EVENT MOD USER iber
> >128891  25368 16384     44565. EVENT MOD USER dieguez
> >128892  25368 16384     44566. EVENT MOD USER karenjoh
> >128893  25368 16384     44567. EVENT MOD USER lorenzo
> >128894  25368 16384     44568. EVENT MOD USER parcolle
> >128895  25368 16384     44569. EVENT MOD USER cfennie
> >128896  25368 16384     44570. EVENT MOD USER civelli
> >128897  25368 16384     44571. EVENT MOD EXECHOST sub04n135
> >128898  25368 16384     44572. EVENT MOD EXECHOST sub04n141
> >128899  25368 16384     44573. EVENT MOD EXECHOST sub04n127
> >128900  25368 16384     44574. EVENT MOD EXECHOST sub04n145
> >128901  25368 16384     44575. EVENT MOD EXECHOST sub04n133
> >128902  25368 16384     44576. EVENT MOD EXECHOST sub04n148
> >128903  25368 16384     44577. EVENT MOD EXECHOST sub04n74
> >128904  25368 16384     44578. EVENT JOB 21542.1 task 
> 2.sub04n74 USAGE
> >128905  25368 16384     44579. EVENT JOB 21542.1 task 
> 1.sub04n74 USAGE
> >128906  25368 16384     44580. EVENT MOD EXECHOST rupc03.rutgers.edu
> >128907  25368 16384     44581. EVENT MOD EXECHOST sub04n139
> >128908  25368 16384     44582. EVENT MOD EXECHOST rupc02.rutgers.edu
> >128909  25368 16384     44583. EVENT MOD EXECHOST sub04n80
> >128910  25368 16384     44584. EVENT MOD EXECHOST sub04n207
> >128911  25368 16384     44585. EVENT MOD EXECHOST sub04n180
> >128912  25368 16384     44586. EVENT MOD EXECHOST sub04n23
> >128913  25368 16384     44587. EVENT MOD EXECHOST sub04n30
> >128914  25368 16384     44588. EVENT MOD EXECHOST sub04n203
> >128915  25368 16384     44589. EVENT MOD EXECHOST sub04n109
> >128916  25368 16384     44590. EVENT MOD EXECHOST rupc04.rutgers.edu
> >128917  25368 16384     44591. EVENT MOD EXECHOST sub04n114
> >128918  25368 16384     44592. EVENT MOD EXECHOST sub04n106
> >128919  25368 16384     44593. EVENT MOD EXECHOST sub04n88
> >128920  25368 16384     44594. EVENT JOB 21507.1 task 
> 6.sub04n88 USAGE
> >128921  25368 16384     44595. EVENT JOB 21507.1 task 
> 5.sub04n88 USAGE
> >128922  25368 16384     44596. EVENT MOD EXECHOST sub04n157
> >128923  25368 16384     44597. EVENT MOD EXECHOST sub04n20
> >128924  25368 16384     44598. EVENT MOD EXECHOST sub04n156
> >128925  25368 16384     44599. EVENT MOD EXECHOST sub04n26
> >128926  25368 16384     44600. EVENT JOB 21213.1 USAGE
> >128927  25368 16384     44601. EVENT MOD EXECHOST sub04n09
> >128928  25368 16384     44602. EVENT MOD EXECHOST sub04n05
> >128929  25368 16384     44603. EVENT MOD EXECHOST sub04n103
> >128930  25368 16384     44604. EVENT MOD EXECHOST sub04n164
> >128931  25368 16384     44605. EVENT MOD EXECHOST sub04n105
> >128932  25368 16384     44606. EVENT MOD EXECHOST sub04n113
> >128933  25368 16384     44607. EVENT MOD EXECHOST sub04n28
> >128934  25368 16384     44608. EVENT MOD EXECHOST sub04n76
> >128935  25368 16384     44609. EVENT MOD EXECHOST sub04n162
> >128936  25368 16384     44610. EVENT MOD EXECHOST sub04n108
> >128937  25368 16384     44611. EVENT MOD EXECHOST sub04n38
> >128938  25368 16384     44612. EVENT MOD EXECHOST sub04n116
> >128939  25368 16384     44613. EVENT MOD EXECHOST sub04n179
> >128940  25368 16384     44614. EVENT MOD EXECHOST sub04n04
> >128941  25368 16384     44615. EVENT MOD EXECHOST sub04n160
> >128942  25368 16384     44616. EVENT MOD EXECHOST sub04n107
> >Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7, 
> CKPT:0 US:15 PR:4 S:nd:12/lf:7 
> >128943  25368 16384     
> ================[SCHEDULING-EPOCH]==================
> >128944  25368 16384     JOB 20937.1 start_time = 1116447112 
> running_time 338139 decay_time = 450
> >128945  25368 16384     JOB 20938.1 start_time = 1116374344 
> running_time 410907 decay_time = 450
> >128946  25368 16384     JOB 21040.1 start_time = 1116443073 
> running_time 342178 decay_time = 450
> >128947  25368 16384     JOB 21076.1 start_time = 1116451351 
> running_time 333900 decay_time = 450
> >128948  25368 16384     JOB 21210.1 start_time = 1116514970 
> running_time 270281 decay_time = 450
> >128949  25368 16384     JOB 21213.1 start_time = 1116515250 
> running_time 270001 decay_time = 450
> >128950  25368 16384     JOB 21338.1 start_time = 1116543252 
> running_time 241999 decay_time = 450
> >128951  25368 16384     JOB 21423.1 start_time = 1116629274 
> running_time 155977 decay_time = 450
> >128952  25368 16384     JOB 21424.1 start_time = 1116631365 
> running_time 153886 decay_time = 450
> >128953  25368 16384     JOB 21440.1 start_time = 1116632934 
> running_time 152317 decay_time = 450
> >128954  25368 16384     JOB 21441.1 start_time = 1116632994 
> running_time 152257 decay_time = 450
> >128955  25368 16384     JOB 21443.1 start_time = 1116633602 
> running_time 151649 decay_time = 450
> >128956  25368 16384     JOB 21474.1 start_time = 1116655118 
> running_time 130133 decay_time = 450
> >128957  25368 16384     JOB 21503.1 start_time = 1116707395 
> running_time 77856 decay_time = 450
> >128958  25368 16384     JOB 21507.1 start_time = 1116714061 
> running_time 71190 decay_time = 450
> >128959  25368 16384     JOB 21528.1 start_time = 1116707641 
> running_time 77610 decay_time = 450
> >128960  25368 16384     JOB 21530.1 start_time = 1116714453 
> running_time 70798 decay_time = 450
> >128961  25368 16384     JOB 21537.1 start_time = 1116724845 
> running_time 60406 decay_time = 450
> >128962  25368 16384     JOB 21542.1 start_time = 1116782511 
> running_time 2740 decay_time = 450
> >128963  25368 16384     verified threshold of 169 queues
> >128964  25368 16384     queue myrinet at sub04n61 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128965  25368 16384     queue myrinet at sub04n62 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128966  25368 16384     queue myrinet at sub04n65 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128967  25368 16384     queue myrinet at sub04n66 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128968  25368 16384     queue myrinet at sub04n67 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128969  25368 16384     queue myrinet at sub04n68 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128970  25368 16384     queue myrinet at sub04n69 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128971  25368 16384     queue myrinet at sub04n70 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >128972  25368 16384     queue myrinet at sub04n71 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128973  25368 16384     queue myrinet at sub04n72 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128974  25368 16384     queue myrinet at sub04n75 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128975  25368 16384     queue myrinet at sub04n77 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128976  25368 16384     queue myrinet at sub04n78 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128977  25368 16384     queue myrinet at sub04n79 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128978  25368 16384     queue myrinet at sub04n81 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128979  25368 16384     queue myrinet at sub04n84 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128980  25368 16384     queue myrinet at sub04n85 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128981  25368 16384     queue myrinet at sub04n86 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128982  25368 16384     queue myrinet at sub04n87 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128983  25368 16384     queue myrinet at sub04n88 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128984  25368 16384     queue myrinet at sub04n89 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128985  25368 16384     queue myrinet at sub04n90 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128986  25368 16384     queue myrinet at sub04n91 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128987  25368 16384     queue myrinet at sub04n63 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128988  25368 16384     queue myrinet at sub04n64 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128989  25368 16384     queue myrinet at sub04n73 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128990  25368 16384     queue myrinet at sub04n74 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128991  25368 16384     queue opteronp at sub04n202 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128992  25368 16384     queue opteronp at sub04n205 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128993  25368 16384     queue opteronp at sub04n206 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128994  25368 16384     queue opteronp at sub04n208 tagged to 
> be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
> >
> >128995  25368 16384     queue parallel at sub04n121 tagged to 
> be overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >128996  25368 16384     queue parallel at sub04n139 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128997  25368 16384     queue parallel at sub04n140 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128998  25368 16384     queue parallel at sub04n141 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >128999  25368 16384     queue parallel at sub04n142 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129000  25368 16384     queue parallel at sub04n143 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129001  25368 16384     queue parallel at sub04n144 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129002  25368 16384     queue parallel at sub04n146 tagged to 
> be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129003  25368 16384     queue parallel at sub04n02 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129004  25368 16384     queue parallel at sub04n03 tagged to be 
> overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
> >
> >129005  25368 16384     queue parallel at sub04n04 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129006  25368 16384     queue parallel at sub04n05 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129007  25368 16384     queue parallel at sub04n06 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129008  25368 16384     queue parallel at sub04n07 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >129009  25368 16384     queue parallel at sub04n08 tagged to be 
> overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
> >
> >129010  25368 16384     queue parallel at sub04n09 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129011  25368 16384     queue parallel at sub04n10 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129012  25368 16384     queue parallel at sub04n11 tagged to be 
> overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
> >
> >129013  25368 16384     verified threshold of 169 queues
> >129014  25368 16384     STARTING PASS 1 WITH 0 PENDING JOBS
> >129015  25368 16384     Not enrolled ja_tasks: 0
> >129016  25368 16384     Enrolled ja_tasks: 1
> >129017  25368 16384     Not enrolled ja_tasks: 0
> >129018  25368 16384     Enrolled ja_tasks: 1
> >129019  25368 16384     Not enrolled ja_tasks: 0
> >129020  25368 16384     Enrolled ja_tasks: 1
> >129021  25368 16384     Not enrolled ja_tasks: 0
> >129022  25368 16384     Enrolled ja_tasks: 1
> >129023  25368 16384     Not enrolled ja_tasks: 0
> >129024  25368 16384     Enrolled ja_tasks: 1
> >129025  25368 16384     Not enrolled ja_tasks: 0
> >129026  25368 16384     Enrolled ja_tasks: 1
> >129027  25368 16384     Not enrolled ja_tasks: 0
> >129028  25368 16384     Enrolled ja_tasks: 1
> >129029  25368 16384     Not enrolled ja_tasks: 0
> >129030  25368 16384     Enrolled ja_tasks: 1
> >129031  25368 16384     Not enrolled ja_tasks: 0
> >129032  25368 16384     Enrolled ja_tasks: 1
> >129033  25368 16384     Not enrolled ja_tasks: 0
> >129034  25368 16384     Enrolled ja_tasks: 1
> >129035  25368 16384     Not enrolled ja_tasks: 0
> >129036  25368 16384     Enrolled ja_tasks: 1
> >129037  25368 16384     Not enrolled ja_tasks: 0
> >129038  25368 16384     Enrolled ja_tasks: 1
> >129039  25368 16384     Not enrolled ja_tasks: 0
> >129040  25368 16384     Enrolled ja_tasks: 1
> >129041  25368 16384     Not enrolled ja_tasks: 0
> >129042  25368 16384     Enrolled ja_tasks: 1
> >129043  25368 16384     Not enrolled ja_tasks: 0
> >129044  25368 16384     Enrolled ja_tasks: 1
> >129045  25368 16384     Not enrolled ja_tasks: 0
> >129046  25368 16384     Enrolled ja_tasks: 1
> >129047  25368 16384     Not enrolled ja_tasks: 0
> >129048  25368 16384     Enrolled ja_tasks: 1
> >129049  25368 16384     Not enrolled ja_tasks: 0
> >129050  25368 16384     Enrolled ja_tasks: 1
> >129051  25368 16384     Not enrolled ja_tasks: 0
> >129052  25368 16384     Enrolled ja_tasks: 1
> >129053  25368 16384     STARTING PASS 2 WITH 0 PENDING JOBS
> >129054  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >129055  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >129056  25368 16384     slot request assumed for static 
> urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
> >129057  25368 16384        slots: 1.000000 * 1000.000000 * 
> 20    ---> 20000.000000
> >129058  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >129059  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >129060  25368 16384        slots: 1.000000 * 1000.000000 * 6 
>    ---> 6000.000000
> >129061  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >129062  25368 16384        slots: 1.000000 * 1000.000000 * 1 
>    ---> 1000.000000
> >129063  25368 16384     slot request assumed for static 
> urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
> >129064  25368 16384        slots: 1.000000 * 1000.000000 * 2 
>    ---> 2000.000000
> >129065  25368 16384        slots: 1.000000 * 1000.000000 * 8 
>    ---> 8000.000000
> >129066  25368 16384     ASU min = 1000.00000000000, ASU max 
> = 20000.00000000000
> >129067  25368 16384     
> >129068  25368 16384     no DDJU: do_usage: 1 finished_jobs 0
> >129069  25368 16384     
> >129070  25368 16384     =====================[Pass 
> 0]======================
> >129071  25368 16384     =====================[Pass 
> 1]======================
> >129072  25368 16384     =====================[Pass 
> 2]======================
> >129073  25368 16384     
> >129074  25368 16384     no DDJU: do_usage: 0 finished_jobs 0
> >129075  25368 16384     
> >129076  25368 16384     =====================[Pass 
> 0]======================
> >129077  25368 16384     =====================[Pass 
> 1]======================
> >129078  25368 16384     =====================[Pass 
> 2]======================
> >129079  25368 16384     Normalizing tickets using 
> 0.000000/18.333333 as min_tix/max_tix
> >129080  25368 16384        got 19 running jobs
> >129081  25368 16384        added 19 ticket orders for running jobs
> >129082  25368 16384        added 1 orders for updating usage of user
> >129083  25368 16384        added 0 orders for updating usage 
> of project
> >129084  25368 16384        added 0 orders for updating share tree
> >129085  25368 16384        added 1 orders for scheduler configuration
> >129086  25368 16384     SENDING 22 ORDERS TO QMASTER
> >129087  25368 16384     RESETTING BUSY STATE OF EVENT CLIENT
> >129088  25368 16384     reresolve port timeout in 260
> >129089  25368 16384     returning cached port value: 536
> >--------------STOP-SCHEDULER-RUN-------------
> >129090  25368 16384     ec_get retrieving events - will do 
> max 20 fetches
> >129091  25368 16384     doing sync fetch for messages, 20 still to do
> >129092  25368 16384     try to get request from qmaster, id 1
> >129093  25368 16384     Checking 154 events (44617-44770) 
> while waiting for #44617
> >129094  25368 16384     check complete, 154 events in list
> >129095  25368 16384     got 154 events till 44770
> >129096  25368 16384     doing async fetch for messages, 19 
> still to do
> >129097  25368 16384     try to get request from qmaster, id 1
> >129098  25368 16384     reresolve port timeout in 240
> >129099  25368 16384     returning cached port value: 536
> >129100  25368 16384     Sent ack for all events lower or equal 44770
> >129101  25368 16384     ec_get - received 154 events
> >129102  25368 16384     44617. EVENT MOD EXECHOST sub04n08
> >129103  25368 16384     44618. EVENT MOD EXECHOST sub04n166
> >129104  25368 16384     44619. EVENT MOD EXECHOST sub04n168
> >129105  25368 16384     44620. EVENT MOD EXECHOST sub04n112
> >129106  25368 16384     44621. EVENT MOD EXECHOST sub04n90
> >129107  25368 16384     44622. EVENT JOB 21503.1 task 
> 2.sub04n90 USAGE
> >129108  25368 16384     44623. EVENT JOB 21503.1 task 
> 1.sub04n90 USAGE
> >129109  25368 16384     44624. EVENT MOD USER udo
> >129110  25368 16384     44625. EVENT MOD USER iber
> >129111  25368 16384     44626. EVENT MOD USER dieguez
> >129112  25368 16384     44627. EVENT MOD USER karenjoh
> >129113  25368 16384     44628. EVENT MOD USER lorenzo
> >129114  25368 16384     44629. EVENT MOD USER parcolle
> >129115  25368 16384     44630. EVENT MOD USER cfennie
> >129116  25368 16384     44631. EVENT MOD USER civelli
> >129117  25368 16384     44632. EVENT MOD EXECHOST sub04n14
> >129118  25368 16384     44633. EVENT MOD EXECHOST sub04n75
> >129119  25368 16384     44634. EVENT JOB 21040.1 task 
> 6.sub04n75 USAGE
> >129120  25368 16384     44635. EVENT JOB 21040.1 task 
> 5.sub04n75 USAGE
> >129121  25368 16384     44636. EVENT MOD EXECHOST sub04n150
> >129122  25368 16384     44637. EVENT MOD EXECHOST sub04n169
> >129123  25368 16384     44638. EVENT MOD EXECHOST sub04n165
> >129124  25368 16384     44639. EVENT MOD EXECHOST sub04n136
> >129125  25368 16384     44640. EVENT MOD EXECHOST sub04n176
> >129126  25368 16384     44641. EVENT MOD EXECHOST sub04n81
> >129127  25368 16384     44642. EVENT JOB 21507.1 task 
> 6.sub04n81 USAGE
> >129128  25368 16384     44643. EVENT JOB 21507.1 task 
> 5.sub04n81 USAGE
> >129129  25368 16384     44644. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129130  25368 16384     44645. EVENT DEL PETASK 21507.1 task 
> 6.sub04n88
> >129131  25368 16384     44646. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129132  25368 16384     44647. EVENT DEL PETASK 21507.1 task 
> 6.sub04n78
> >129133  25368 16384     44648. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129134  25368 16384     44649. EVENT DEL PETASK 21507.1 task 
> 6.sub04n81
> >129135  25368 16384     44650. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129136  25368 16384     44651. EVENT DEL PETASK 21507.1 task 
> 5.sub04n81
> >129137  25368 16384     44652. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129138  25368 16384     44653. EVENT DEL PETASK 21507.1 task 
> 5.sub04n88
> >129139  25368 16384     44654. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129140  25368 16384     44655. EVENT DEL PETASK 21507.1 task 
> 5.sub04n78
> >129141  25368 16384     44656. EVENT MOD EXECHOST sub04n161
> >129142  25368 16384     44657. EVENT MOD EXECHOST sub04n124
> >129143  25368 16384     44658. EVENT ADD PETASK 21507.1 task 
> 7.sub04n88
> >129144  25368 16384     44659. EVENT ADD PETASK 21507.1 task 
> 7.sub04n78
> >129145  25368 16384     44660. EVENT MOD EXECHOST sub04n158
> >129146  25368 16384     44661. EVENT MOD EXECHOST sub04n01
> >129147  25368 16384     44662. EVENT MOD EXECHOST sub04n159
> >129148  25368 16384     44663. EVENT ADD PETASK 21507.1 task 
> 7.sub04n81
> >129149  25368 16384     44664. EVENT MOD EXECHOST sub04n134
> >129150  25368 16384     44665. EVENT ADD PETASK 21507.1 task 
> 8.sub04n88
> >129151  25368 16384     44666. EVENT ADD PETASK 21507.1 task 
> 8.sub04n78
> >129152  25368 16384     44667. EVENT ADD PETASK 21507.1 task 
> 8.sub04n81
> >129153  25368 16384     44668. EVENT MOD EXECHOST sub04n121
> >129154  25368 16384     44669. EVENT MOD EXECHOST sub04n143
> >129155  25368 16384     44670. EVENT MOD EXECHOST sub04n15
> >129156  25368 16384     44671. EVENT MOD EXECHOST sub04n13
> >129157  25368 16384     44672. EVENT MOD EXECHOST sub04n64
> >129158  25368 16384     44673. EVENT JOB 21542.1 task 
> 2.sub04n64 USAGE
> >129159  25368 16384     44674. EVENT JOB 21542.1 task 
> 1.sub04n64 USAGE
> >129160  25368 16384     44675. EVENT MOD EXECHOST sub04n118
> >129161  25368 16384     44676. EVENT MOD EXECHOST sub04n151
> >129162  25368 16384     44677. EVENT MOD EXECHOST sub04n154
> >129163  25368 16384     44678. EVENT MOD EXECHOST sub04n149
> >129164  25368 16384     44679. EVENT MOD EXECHOST sub04n16
> >129165  25368 16384     44680. EVENT MOD EXECHOST sub04n155
> >129166  25368 16384     44681. EVENT MOD EXECHOST sub04n152
> >129167  25368 16384     44682. EVENT MOD EXECHOST sub04n163
> >129168  25368 16384     44683. EVENT MOD EXECHOST sub04n43
> >129169  25368 16384     44684. EVENT MOD EXECHOST sub04n86
> >129170  25368 16384     44685. EVENT JOB 21423.1 task 
> 2.sub04n86 USAGE
> >129171  25368 16384     44686. EVENT JOB 21423.1 task 
> 1.sub04n86 USAGE
> >129172  25368 16384     44687. EVENT MOD EXECHOST sub04n03
> >129173  25368 16384     44688. EVENT JOB 21076.1 USAGE
> >129174  25368 16384     44689. EVENT MOD EXECHOST sub04n204
> >129175  25368 16384     44690. EVENT MOD EXECHOST rupc01.rutgers.edu
> >129176  25368 16384     44691. EVENT MOD EXECHOST sub04n125
> >129177  25368 16384     44692. EVENT MOD EXECHOST sub04n44
> >129178  25368 16384     44693. EVENT MOD EXECHOST sub04n32
> >129179  25368 16384     44694. EVENT MOD EXECHOST sub04n21
> >129180  25368 16384     44695. EVENT MOD EXECHOST sub04n22
> >129181  25368 16384     44696. EVENT MOD EXECHOST sub04n35
> >129182  25368 16384     44697. EVENT MOD EXECHOST sub04n201
> >129183  25368 16384     44698. EVENT MOD EXECHOST sub04n205
> >129184  25368 16384     44699. EVENT JOB 21440.1 USAGE
> >129185  25368 16384     44700. EVENT MOD EXECHOST sub04n111
> >129186  25368 16384     44701. EVENT MOD EXECHOST sub04n89
> >129187  25368 16384     44702. EVENT JOB 21530.1 task 
> 2.sub04n89 USAGE
> >129188  25368 16384     44703. EVENT JOB 21530.1 task 
> 1.sub04n89 USAGE
> >129189  25368 16384     44704. EVENT JOB 21530.1 USAGE
> >129190  25368 16384     44705. EVENT MOD EXECHOST sub04n177
> >129191  25368 16384     44706. EVENT MOD EXECHOST sub04n146
> >129192  25368 16384     44707. EVENT ADD PETASK 21507.1 task 
> 9.sub04n88
> >129193  25368 16384     44708. EVENT JOB 21507.1 task 
> past_usage USAGE
> >129194  25368 16384     44709. EVENT DEL PETASK 21507.1 task 
> 7.sub04n88
> >Segmentation fault
> >You have new mail in /var/spool/mail/root rupc-cs04b:/opt/SGE/util #
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++++++++++++++++++
> >/opt/SGE/default/spool/qmaster
> >
> >Sun May 22 14:25:16 EDT 2005
> >05/22/2005 00:20:01|qmaster|rupc-cs04b|E|event client "scheduler" 
> >(rupc-cs04b/schedd/1) reregistered - it will need a total update 
> >05/22/2005 00:32:40|qmaster|rupc-cs04b|W|job 21538.1 failed on host 
> >sub04n63 in recognising job because: execd doesn't know this job 
> >05/22/2005 00:32:49|qmaster|rupc-cs04b|E|execd sub04n63 
> reports running 
> >state for job (21538.1/master) in queue "myrinet at sub04n63" 
> while job is 
> >in state 65536 05/22/2005 
> 00:33:49|qmaster|rupc-cs04b|E|execd at sub04n63 
> >reports running job (21538.1/master) in queue 
> "myrinet at sub04n63" that 
> >was not supposed to be there - killing 05/22/2005 
> >02:10:01|qmaster|rupc-cs04b|E|event client "scheduler" 
> >(rupc-cs04b/schedd/1) reregistered - it will need a total update 
> >05/22/2005 02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> >(1035) is not uptodate (1036) for user/project "udo" 05/22/2005 
> >02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> (1035) is not 
> >uptodate (1036) for user/project "iber" 05/22/2005 
> >02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> (1035) is not 
> >uptodate (1036) for user/project "dieguez" 05/22/2005 
> >02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> (1035) is not 
> >uptodate (1036) for user/project "zayak" 05/22/2005 
> >02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> (1035) is not 
> >uptodate (1036) for user/project "karenjoh" 05/22/2005 
> >02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> (1035) is not 
> >uptodate (1036) for user/project "lorenzo" 05/22/2005 
> >02:30:26|qmaster|rupc-cs04b|E|orders user/project version 
> (1035) is not uptodate (1036) for user/project "parcolle" 
> 05/22/2005 02:30:26|qmaster|rupc-cs04b|E|orders user/project 
> version (1035) is not uptodate (1036) for user/project 
> "cfennie" 05/22/2005 02:30:26|qmaster|rupc-cs04b|E|orders 
> user/project version (1035) is not uptodate (1036) for 
> user/project "civelli" 05/22/2005 
> 02:34:06|qmaster|rupc-cs04b|E|orders user/project version 
> (1044) is not uptodate (1045) for user/project "udo" 
> 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project 
> version (1044) is not uptodate (1045) for user/project "iber" 
> 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project 
> version (1044) is not uptodate (1045) for user/project 
> "dieguez" 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders 
> user/project version (1044) is not uptodate (1045) for 
> user/project "zayak" 05/22/2005 
> 02:34:06|qmaster|rupc-cs04b|E|orders user/project version 
> (1044) is not uptodate (1045) for user/project "karenjoh" 
> 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project 
> version (1044) is not uptodate (1045) for user/project 
> "lorenzo" 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders 
> user/project version (1044) is not uptodate (1045) for 
> user/project "parcolle" 05/22/2005 
> 02:34:06|qmaster|rupc-cs04b|E|orders user/project version 
> (1044) is not uptodate (1045) for user/project "cfennie" 
> 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project 
> version (1044) is not uptodate (1045) for user/project 
> "civelli" 05/22/2005 03:02:47|qmaster|rupc-cs04b|E|tightly 
> integrated parallel task 21539.1 task 3.sub04n83 failed - killing job
> >05/22/2005 03:10:01|qmaster|rupc-cs04b|E|event client 
> "scheduler" (rupc-cs04b/schedd/1) reregistered - it will need 
> a total update    <-- YOU SEE THESE 2 lines : THE SCHEDULER 
> DIED EVEN WITHOUT ANY EVENTS , JUST by itself !!!
> >05/22/2005 07:30:01|qmaster|rupc-cs04b|E|event client 
> "scheduler" (rupc-cs04b/schedd/1) reregistered - it will need 
> a total update
> >05/22/2005 11:11:39|qmaster|rupc-cs04b|E|event client 
> "scheduler" (rupc-cs04b/schedd/1) reregistered - it will need 
> a total update    <-- BEFORE THE LAST CRASH
> >05/22/2005 14:07:53|qmaster|rupc-cs04b|E|tightly integrated 
> parallel task 21507.1 task 10.sub04n88 failed - killing job   
>                     <-- THIS IS WHAT TRIGGERED the CRASH
> >05/22/2005 14:09:14|qmaster|rupc-cs04b|W|job 21507.1 failed 
> on host sub04n78 assumedly after job because: job 21507.1 
> died through signal TERM (15)
> >05/22/2005 14:10:00|qmaster|rupc-cs04b|E|event client 
> "scheduler" (rupc-cs04b/schedd/1) reregistered - it will need 
> a total update    <- SCHEDULER START AFTER THE CRASH
> >
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++++++++++++++++++
> >SCHEDULER  messages  BELOW
> >
> >05/22/2005 00:20:01|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005 
> >02:10:01|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005 
> >02:30:26|schedd|rupc-cs04b|I|controlled shutdown 6.0u3 05/22/2005 
> >02:31:10|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005 
> >02:34:06|schedd|rupc-cs04b|I|controlled shutdown 6.0u3 05/22/2005 
> >02:40:00|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005 
> >03:10:01|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005 
> >07:30:01|schedd|rupc-cs04b|I|starting up 6.0u3
> >05/22/2005 11:11:39|schedd|rupc-cs04b|I|starting up 6.0u3    
>     <--- before the last crush (I started debug mode)
> >05/22/2005 14:10:00|schedd|rupc-cs04b|I|starting up 6.0u3    
>     <--- AFTER the last crush
> >
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++++++++++++++++++
> >
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++++++++++++++++++
> >
> >  
> >
> >-------------------------------------------------------------
> ----------
> >-
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list