[GE users] The Scheduler dies" COMPLETE information
Stephan Grell - Sun Germany - SSG - Software Engineer
stephan.grell at sun.com
Mon May 23 09:53:26 BST 2005
The work around is to remove all pe jobs, start the scheduler and than
resubmit the pe jobs...
u4 will be available soon. I do not know the date. Sorry. However, you can
compile it yourself by checking out the u4 tag.
Stephan
Viktor Oudovenko wrote:
>Hi, Stephan,
>
>Thank you for the answer.
>When u4 will be issued and where I can read about issue 1416?
>
>Meanwhile I tried many things but nothing helped me at the moment.
>Why my scheduler reregister so often. Because after it dies I restart it
>manually. Simply issuing command:
>$SGE_ROOT/bin/lx..../scg_schedd
>Then the information about reregistering appears.
>
>Thank you very much for your help.
>v
>
>
>
>>-----Original Message-----
>>From: Stephan Grell - Sun Germany - SSG - Software Engineer
>>[mailto:stephan.grell at sun.com]
>>Sent: Monday, May 23, 2005 3:45
>>To: users at gridengine.sunsource.net
>>Subject: Re: [GE users] The Scheduler dies" COMPLETE information
>>
>>
>>Hi Viktor,
>>
>>you encounter issue 1416. This is fixed with u4.
>>However, the important question is, why your scheduler is
>>reregistering so often.
>>
>>Stephan
>>
>>Viktor Oudovenko wrote:
>>
>>
>>
>>>Hi, Stephan and anybody who can help!
>>>
>>>Could you have a look at the attachment to see what is going
>>>
>>>
>>on with my
>>
>>
>>>scheduler. What I did I just run as you advised scheduler
>>>
>>>
>>demon in dl 1
>>
>>
>>>mode and waited until it crashes.
>>>And it did. It dies even without any events. I mean you
>>>
>>>
>>will find two lines
>>
>>
>>>in from messages file when the scheduler died without any
>>>
>>>
>>reason. But the
>>
>>
>>>last crash happened because one of the myrinet jobs finished.
>>>Could you give any hint what could it be and what could it be done.
>>>I am running Linux SuSE 8.2 on the server and 9.0 and 9.2
>>>
>>>
>>on the slaves.
>>
>>
>>>I also have a few opterons (8 machines). I am happy to
>>>
>>>
>>provide any further
>>
>>
>>>information if necessary.
>>>Please help.
>>>
>>>With kind regards,
>>>Viktor
>>>P.S. In the attachment I put not only the last iteration
>>>
>>>
>>but a couple
>>
>>
>>>of successful ones. Actually in debug mode the scheduler updates
>>>information like every 5-10 second or so.
>>>
>>>
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: Stephan Grell - Sun Germany - SSG - Software Engineer
>>>>[mailto:stephan.grell at sun.com]
>>>>Sent: Friday, May 20, 2005 3:05
>>>>To: users at gridengine.sunsource.net
>>>>Subject: Re: [GE users] Scheduler dies like a hell
>>>>
>>>>
>>>>Hi,
>>>>
>>>>I am not sure, that a currupted file is the problem. The
>>>>qmaster does some validation during the startup. Could you
>>>>run the scheduler in debug mode and post the output just
>>>>before it dies?
>>>>
>>>>You can set the debug mode with:
>>>>
>>>>source $SGE_ROOT/<CELL>/common/settings.csh
>>>>source $SGE_ROOT/util/dl.csh
>>>>dl 1
>>>>
>>>>bin/<arch>/sge_schedd
>>>>
>>>>Or, do you have a stack trace of the scheduler?
>>>>
>>>>Which version are you running on which arch?
>>>>
>>>>Thanks,
>>>>Stephan
>>>>
>>>>Viktor Oudovenko wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Ron,
>>>>>
>>>>>Can I try to cat part of accounting file ? I mean to EDIT it
>>>>>
>>>>>
>>>>>
>>>>>
>>>>MANUALLY
>>>>
>>>>
>>>>
>>>>
>>>>>despite it is written do not do it? Best regards,
>>>>>v
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
>>>>>>Sent: Thursday, May 19, 2005 22:02
>>>>>>To: users at gridengine.sunsource.net
>>>>>>Subject: RE: [GE users] Scheduler dies like a hell
>>>>>>
>>>>>>
>>>>>>It is not easy to find out which file gets corrupted
>>>>>>:(
>>>>>>
>>>>>>One thing you can try is to move spooled job files (in
>>>>>>default/spool/qmaster/jobs) to a backup directory.
>>>>>>Also, you can use qconf to dump the configuration for
>>>>>>the queues/users/hosts, and see if the values "make
>>>>>>sense".
>>>>>>
>>>>>>Of course the best way to fix this is to restore from backup!
>>>>>>
>>>>>>-Ron
>>>>>>
>>>>>>
>>>>>>--- Viktor Oudovenko <udo at physics.rutgers.edu> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Hi, Ron,
>>>>>>>
>>>>>>>I am using classic spooling.
>>>>>>>Which file should I look for corruption? Can I edit
>>>>>>>it manually?
>>>>>>>Thank you very much in advance.
>>>>>>>v
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>-----Original Message-----
>>>>>>>>From: Ron Chen [mailto:ron_chen_123 at yahoo.com]
>>>>>>>>Sent: Thursday, May 19, 2005 20:38
>>>>>>>>To: users at gridengine.sunsource.net
>>>>>>>>Subject: RE: [GE users] Scheduler dies like a hell
>>>>>>>>
>>>>>>>>
>>>>>>>>Are you using classic spooling or Berkeley DB
>>>>>>>>spooling?
>>>>>>>>
>>>>>>>>With classic spooling, when the machine crashes,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>files may get corrupted. And when qmaster reads in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>corrupted files, it may also corrupt the qmasters'
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>data structures.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>IIRC, Berkeley DB handles recovery itself, but I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>have
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>never played with it myself :)
>>>>>>>>
>>>>>>>>-Ron
>>>>>>>>
>>>>>>>>
>>>>>>>>--- Viktor Oudovenko <udo at physics.rutgers.edu>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>Hi, Mac,
>>>>>>>>>Thank you very much for your advices!
>>>>>>>>>I'll try. I think one of running or finished
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>jobs
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>did a bad record somewhere
>>>>>>>>>(like jobs directory).
>>>>>>>>>Best regards,
>>>>>>>>>v
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>-----Original Message-----
>>>>>>>>>>From: McCalla, Mac
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>[mailto:macmccalla at hess.com]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>Sent: Thursday, May 19, 2005 15:12
>>>>>>>>>>To: users at gridengine.sunsource.net
>>>>>>>>>>Subject: RE: [GE users] Scheduler dies like a
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>hell
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>Hi,
>>>>>>>>>>
>>>>>>>>>>Some thinks to look at: any messages in
>>>>>>>>>>$SGE_ROOT/......../qmaster/schedd/messages ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>To
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>get more
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>info about what scheduler is doing while it is
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>running, see
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>info about scheduler params profile and
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>monitor,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>you can set
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>them equal to 1 to turn on
>>>>>>>>>>some scheduler diagnostics, see man
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>sched_conf.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>To extend timeout value for scheduler you can
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>set
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>qmaster_params SCHEDULER_TIMEOUT to some value
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>greater than
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>600 (seconds).
>>>>>>>>>>You can also use system command strace to get
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>trace of
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>scheduler activity while it is running to
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>perhaps
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>get a
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>better idea of what it is spending its time
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>doing.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>Hope this helps,
>>>>>>>>>>
>>>>>>>>>>mac mccalla
>>>>>>>>>>
>>>>>>>>>>-----Original Message-----
>>>>>>>>>>From: Viktor Oudovenko
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>[mailto:udo at physics.rutgers.edu]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Sent: Thursday, May 19, 2005 12:00 PM
>>>>>>>>>>To: users at gridengine.sunsource.net
>>>>>>>>>>Subject: [GE users] Scheduler dies like a hell
>>>>>>>>>>
>>>>>>>>>>Hi, everybody,
>>>>>>>>>>
>>>>>>>>>>I am asking your help and ideas what could be
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>done
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>to restore
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>normal operation of the scheduler. First what
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>happened. A few
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>time during last week our main server died and
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>I
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>needed to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>reboot it and even replace it. But jobs which
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>used
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>automount
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>proceed run. But from yesterday or day before
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>yesterday
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>scheduler demon dies. I tried to restart
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>sge_master but it
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>did not help. Now when demon died I start it
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>manually simply typing:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>/opt/SGE/bin/lx24-x86/sge_schedd
>>>>>>>>>>
>>>>>>>>>>but after some time it died again. Please
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>advice
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>what could it be?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Below plz find some info form file messages:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>05/19/2005 01:02:37|qmaster|rupc-cs04b|E|no
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>execd
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>known on
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>host sub04n87 to send conf notification
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>05/19/2005
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>host sub04n88
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>to send conf notification 05/19/2005
>>>>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>host sub04n89
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>to send conf notification 05/19/2005
>>>>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>host sub04n90
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>to send conf notification 05/19/2005
>>>>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>host sub04n91
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>to send conf notification 05/19/2005
>>>>>>>>>>01:02:37|qmaster|rupc-cs04b|E|no execd known
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>host
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>rupc04.rutgers.edu to send conf notification
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>05/19/2005
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>01:02:37|qmaster|rupc-cs04b|I|starting up
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>6.0u3
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>05/19/2005
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>01:08:11|qmaster|rupc-cs04b|E|commlib error:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>got
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>read error
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>(closing connection) 05/19/2005
>>>>>>>>>>01:11:06|qmaster|rupc-cs04b|E|event client
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>"scheduler"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>(rupc-cs04b/schedd/1) reregistered - it will
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>need
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>a total
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>update 05/19/2005
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>01:24:31|qmaster|rupc-cs04b|W|job 21171.1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>failed on host sub04n203 assumedly after job
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>because: job
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>21171.1 died through signal TERM
>>>>>>>>>>(15)
>>>>>>>>>>05/19/2005
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>05:17:19|qmaster|rupc-cs04b|E|acknowledge
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>timeout
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>after 600 seconds for event client (schedd:1)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>on
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>host
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>"rupc-cs04b" 05/19/2005
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>09:29:03|qmaster|rupc-cs04b|W|job
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>21060.1 failed on host sub04n74 assumedly
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>after
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>job because:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>job 21060.1 died through signal TERM (15)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>05/19/2005
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>09:30:37|qmaster|rupc-cs04b|E|event client
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>"scheduler"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>(rupc-cs04b/schedd/1) reregistered - it will
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>need
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>a total
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>update 05/19/2005
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>11:04:21|qmaster|rupc-cs04b|W|job 20222.1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>failed on host sub04n29 assumedly after job
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>because: job
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>20222.1 died through signal KILL (9)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>05/19/2005
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>11:05:50|qmaster|rupc-cs04b|W|job 21212.1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>failed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>on host
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>sub04n25 assumedly after job because: job
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>21212.1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>died
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>through signal KILL (9) 05/19/2005
>>>>>>>>>>12:04:51|qmaster|rupc-cs04b|E|acknowledge
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>timeout
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>after 600
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>seconds for event client (schedd:1) on host
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>"rupc-cs04b"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>=== message truncated ===
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Discover Yahoo!
>>>>>>Have fun online with music videos, cool games, IM and more.
>>>>>>Check it out!
>>>>>>http://discover.yahoo.com/online.html
>>>>>>
>>>>>>------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>---------
>>>>
>>>>
>>>>
>>>>
>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>For additional commands, e-mail:
>>>>>>
>>>>>>
>>users-help at gridengine.sunsource.net
>>
>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>-----------------------------------------------------------
>>>>>
>>>>>
>>----------
>>
>>
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail:
>>>>>
>>>>>
>>users-help at gridengine.sunsource.net
>>
>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>------------------------------------------------------------
>>>>
>>>>
>>---------
>>
>>
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>-------------------------------------------------------------
>>>
>>>
>>----------
>>
>>
>>>-
>>>
>>>WS128133 25368 16384 SENDING 22 ORDERS TO QMASTER
>>>128134 25368 16384 RESETTING BUSY STATE OF EVENT CLIENT
>>>128135 25368 16384 reresolve port timeout in 340
>>>128136 25368 16384 returning cached port value: 536
>>>--------------STOP-SCHEDULER-RUN-------------
>>>128137 25368 16384 ec_get retrieving events - will do
>>>
>>>
>>max 20 fetches
>>
>>
>>>128138 25368 16384 doing sync fetch for messages, 20 still to do
>>>128139 25368 16384 try to get request from qmaster, id 1
>>>128140 25368 16384 Checking 55 events (44303-44357)
>>>
>>>
>>while waiting for #44303
>>
>>
>>>128141 25368 16384 check complete, 55 events in list
>>>128142 25368 16384 got 55 events till 44357
>>>128143 25368 16384 doing async fetch for messages, 19
>>>
>>>
>>still to do
>>
>>
>>>128144 25368 16384 try to get request from qmaster, id 1
>>>128145 25368 16384 reresolve port timeout in 320
>>>128146 25368 16384 returning cached port value: 536
>>>128147 25368 16384 Sent ack for all events lower or equal 44357
>>>128148 25368 16384 ec_get - received 55 events
>>>128149 25368 16384 44303. EVENT MOD EXECHOST sub04n147
>>>128150 25368 16384 44304. EVENT MOD USER udo
>>>128151 25368 16384 44305. EVENT MOD USER iber
>>>128152 25368 16384 44306. EVENT MOD USER dieguez
>>>128153 25368 16384 44307. EVENT MOD USER karenjoh
>>>128154 25368 16384 44308. EVENT MOD USER lorenzo
>>>128155 25368 16384 44309. EVENT MOD USER parcolle
>>>128156 25368 16384 44310. EVENT MOD USER cfennie
>>>128157 25368 16384 44311. EVENT MOD USER civelli
>>>128158 25368 16384 44312. EVENT MOD EXECHOST sub04n135
>>>128159 25368 16384 44313. EVENT MOD EXECHOST sub04n141
>>>128160 25368 16384 44314. EVENT MOD EXECHOST sub04n127
>>>128161 25368 16384 44315. EVENT MOD EXECHOST sub04n145
>>>128162 25368 16384 44316. EVENT MOD EXECHOST sub04n133
>>>128163 25368 16384 44317. EVENT MOD EXECHOST sub04n148
>>>128164 25368 16384 44318. EVENT MOD EXECHOST sub04n74
>>>128165 25368 16384 44319. EVENT JOB 21542.1 task
>>>
>>>
>>2.sub04n74 USAGE
>>
>>
>>>128166 25368 16384 44320. EVENT JOB 21542.1 task
>>>
>>>
>>1.sub04n74 USAGE
>>
>>
>>>128167 25368 16384 44321. EVENT MOD EXECHOST rupc03.rutgers.edu
>>>128168 25368 16384 44322. EVENT MOD EXECHOST sub04n139
>>>128169 25368 16384 44323. EVENT MOD EXECHOST rupc02.rutgers.edu
>>>128170 25368 16384 44324. EVENT MOD EXECHOST sub04n80
>>>128171 25368 16384 44325. EVENT MOD EXECHOST sub04n207
>>>128172 25368 16384 44326. EVENT MOD EXECHOST sub04n180
>>>128173 25368 16384 44327. EVENT MOD EXECHOST sub04n23
>>>128174 25368 16384 44328. EVENT MOD EXECHOST sub04n30
>>>128175 25368 16384 44329. EVENT MOD EXECHOST sub04n203
>>>128176 25368 16384 44330. EVENT MOD EXECHOST sub04n109
>>>128177 25368 16384 44331. EVENT MOD EXECHOST rupc04.rutgers.edu
>>>128178 25368 16384 44332. EVENT MOD EXECHOST sub04n114
>>>128179 25368 16384 44333. EVENT MOD EXECHOST sub04n106
>>>128180 25368 16384 44334. EVENT MOD EXECHOST sub04n88
>>>128181 25368 16384 44335. EVENT JOB 21507.1 task
>>>
>>>
>>6.sub04n88 USAGE
>>
>>
>>>128182 25368 16384 44336. EVENT JOB 21507.1 task
>>>
>>>
>>5.sub04n88 USAGE
>>
>>
>>>128183 25368 16384 44337. EVENT MOD EXECHOST sub04n157
>>>128184 25368 16384 44338. EVENT MOD EXECHOST sub04n20
>>>128185 25368 16384 44339. EVENT MOD EXECHOST sub04n156
>>>128186 25368 16384 44340. EVENT MOD EXECHOST sub04n26
>>>128187 25368 16384 44341. EVENT JOB 21213.1 USAGE
>>>128188 25368 16384 44342. EVENT MOD EXECHOST sub04n05
>>>128189 25368 16384 44343. EVENT MOD EXECHOST sub04n103
>>>128190 25368 16384 44344. EVENT MOD EXECHOST sub04n164
>>>128191 25368 16384 44345. EVENT MOD EXECHOST sub04n09
>>>128192 25368 16384 44346. EVENT MOD EXECHOST sub04n105
>>>128193 25368 16384 44347. EVENT MOD EXECHOST sub04n113
>>>128194 25368 16384 44348. EVENT MOD EXECHOST sub04n28
>>>128195 25368 16384 44349. EVENT MOD EXECHOST sub04n76
>>>128196 25368 16384 44350. EVENT MOD EXECHOST sub04n162
>>>128197 25368 16384 44351. EVENT MOD EXECHOST sub04n108
>>>128198 25368 16384 44352. EVENT MOD EXECHOST sub04n38
>>>128199 25368 16384 44353. EVENT MOD EXECHOST sub04n04
>>>128200 25368 16384 44354. EVENT MOD EXECHOST sub04n116
>>>128201 25368 16384 44355. EVENT MOD EXECHOST sub04n179
>>>128202 25368 16384 44356. EVENT MOD EXECHOST sub04n160
>>>128203 25368 16384 44357. EVENT MOD EXECHOST sub04n107
>>>Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7,
>>>
>>>
>>CKPT:0 US:15 PR:4 S:nd:12/lf:7
>>
>>
>>>128204 25368 16384
>>>
>>>
>>================[SCHEDULING-EPOCH]==================
>>
>>
>>>128205 25368 16384 JOB 20937.1 start_time = 1116447112
>>>
>>>
>>running_time 338079 decay_time = 450
>>
>>
>>>128206 25368 16384 JOB 20938.1 start_time = 1116374344
>>>
>>>
>>running_time 410847 decay_time = 450
>>
>>
>>>128207 25368 16384 JOB 21040.1 start_time = 1116443073
>>>
>>>
>>running_time 342118 decay_time = 450
>>
>>
>>>128208 25368 16384 JOB 21076.1 start_time = 1116451351
>>>
>>>
>>running_time 333840 decay_time = 450
>>
>>
>>>128209 25368 16384 JOB 21210.1 start_time = 1116514970
>>>
>>>
>>running_time 270221 decay_time = 450
>>
>>
>>>128210 25368 16384 JOB 21213.1 start_time = 1116515250
>>>
>>>
>>running_time 269941 decay_time = 450
>>
>>
>>>128211 25368 16384 JOB 21338.1 start_time = 1116543252
>>>
>>>
>>running_time 241939 decay_time = 450
>>
>>
>>>128212 25368 16384 JOB 21423.1 start_time = 1116629274
>>>
>>>
>>running_time 155917 decay_time = 450
>>
>>
>>>128213 25368 16384 JOB 21424.1 start_time = 1116631365
>>>
>>>
>>running_time 153826 decay_time = 450
>>
>>
>>>128214 25368 16384 JOB 21440.1 start_time = 1116632934
>>>
>>>
>>running_time 152257 decay_time = 450
>>
>>
>>>128215 25368 16384 JOB 21441.1 start_time = 1116632994
>>>
>>>
>>running_time 152197 decay_time = 450
>>
>>
>>>128216 25368 16384 JOB 21443.1 start_time = 1116633602
>>>
>>>
>>running_time 151589 decay_time = 450
>>
>>
>>>128217 25368 16384 JOB 21474.1 start_time = 1116655118
>>>
>>>
>>running_time 130073 decay_time = 450
>>
>>
>>>128218 25368 16384 JOB 21503.1 start_time = 1116707395
>>>
>>>
>>running_time 77796 decay_time = 450
>>
>>
>>>128219 25368 16384 JOB 21507.1 start_time = 1116714061
>>>
>>>
>>running_time 71130 decay_time = 450
>>
>>
>>>128220 25368 16384 JOB 21528.1 start_time = 1116707641
>>>
>>>
>>running_time 77550 decay_time = 450
>>
>>
>>>128221 25368 16384 JOB 21530.1 start_time = 1116714453
>>>
>>>
>>running_time 70738 decay_time = 450
>>
>>
>>>128222 25368 16384 JOB 21537.1 start_time = 1116724845
>>>
>>>
>>running_time 60346 decay_time = 450
>>
>>
>>>128223 25368 16384 JOB 21542.1 start_time = 1116782511
>>>
>>>
>>running_time 2680 decay_time = 450
>>
>>
>>>128224 25368 16384 verified threshold of 169 queues
>>>128225 25368 16384 queue myrinet at sub04n61 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128226 25368 16384 queue myrinet at sub04n62 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128227 25368 16384 queue myrinet at sub04n65 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128228 25368 16384 queue myrinet at sub04n66 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128229 25368 16384 queue myrinet at sub04n67 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128230 25368 16384 queue myrinet at sub04n68 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128231 25368 16384 queue myrinet at sub04n69 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128232 25368 16384 queue myrinet at sub04n70 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128233 25368 16384 queue myrinet at sub04n71 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128234 25368 16384 queue myrinet at sub04n72 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128235 25368 16384 queue myrinet at sub04n75 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128236 25368 16384 queue myrinet at sub04n77 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128237 25368 16384 queue myrinet at sub04n78 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128238 25368 16384 queue myrinet at sub04n79 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128239 25368 16384 queue myrinet at sub04n81 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128240 25368 16384 queue myrinet at sub04n84 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128241 25368 16384 queue myrinet at sub04n85 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128242 25368 16384 queue myrinet at sub04n86 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128243 25368 16384 queue myrinet at sub04n87 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128244 25368 16384 queue myrinet at sub04n88 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128245 25368 16384 queue myrinet at sub04n89 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128246 25368 16384 queue myrinet at sub04n90 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128247 25368 16384 queue myrinet at sub04n91 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128248 25368 16384 queue myrinet at sub04n63 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128249 25368 16384 queue myrinet at sub04n64 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128250 25368 16384 queue myrinet at sub04n73 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128251 25368 16384 queue myrinet at sub04n74 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128252 25368 16384 queue opteronp at sub04n202 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128253 25368 16384 queue opteronp at sub04n205 tagged to
>>>
>>>
>>be overloaded: load_medium=1.010000 (no load adjustment) >= 1.0
>>
>>
>>>128254 25368 16384 queue opteronp at sub04n206 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128255 25368 16384 queue opteronp at sub04n208 tagged to
>>>
>>>
>>be overloaded: load_medium=1.010000 (no load adjustment) >= 1.0
>>
>>
>>>128256 25368 16384 queue parallel at sub04n121 tagged to
>>>
>>>
>>be overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128257 25368 16384 queue parallel at sub04n139 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128258 25368 16384 queue parallel at sub04n140 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128259 25368 16384 queue parallel at sub04n141 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128260 25368 16384 queue parallel at sub04n142 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128261 25368 16384 queue parallel at sub04n143 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128262 25368 16384 queue parallel at sub04n144 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128263 25368 16384 queue parallel at sub04n146 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128264 25368 16384 queue parallel at sub04n02 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128265 25368 16384 queue parallel at sub04n03 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128266 25368 16384 queue parallel at sub04n04 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128267 25368 16384 queue parallel at sub04n05 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128268 25368 16384 queue parallel at sub04n06 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128269 25368 16384 queue parallel at sub04n07 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128270 25368 16384 queue parallel at sub04n08 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128271 25368 16384 queue parallel at sub04n09 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128272 25368 16384 queue parallel at sub04n10 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128273 25368 16384 queue parallel at sub04n11 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128274 25368 16384 verified threshold of 169 queues
>>>128275 25368 16384 STARTING PASS 1 WITH 0 PENDING JOBS
>>>128276 25368 16384 Not enrolled ja_tasks: 0
>>>128277 25368 16384 Enrolled ja_tasks: 1
>>>128278 25368 16384 Not enrolled ja_tasks: 0
>>>128279 25368 16384 Enrolled ja_tasks: 1
>>>128280 25368 16384 Not enrolled ja_tasks: 0
>>>128281 25368 16384 Enrolled ja_tasks: 1
>>>128282 25368 16384 Not enrolled ja_tasks: 0
>>>128283 25368 16384 Enrolled ja_tasks: 1
>>>128284 25368 16384 Not enrolled ja_tasks: 0
>>>128285 25368 16384 Enrolled ja_tasks: 1
>>>128286 25368 16384 Not enrolled ja_tasks: 0
>>>128287 25368 16384 Enrolled ja_tasks: 1
>>>128288 25368 16384 Not enrolled ja_tasks: 0
>>>128289 25368 16384 Enrolled ja_tasks: 1
>>>128290 25368 16384 Not enrolled ja_tasks: 0
>>>128291 25368 16384 Enrolled ja_tasks: 1
>>>128292 25368 16384 Not enrolled ja_tasks: 0
>>>128293 25368 16384 Enrolled ja_tasks: 1
>>>128294 25368 16384 Not enrolled ja_tasks: 0
>>>128295 25368 16384 Enrolled ja_tasks: 1
>>>128296 25368 16384 Not enrolled ja_tasks: 0
>>>128297 25368 16384 Enrolled ja_tasks: 1
>>>128298 25368 16384 Not enrolled ja_tasks: 0
>>>128299 25368 16384 Enrolled ja_tasks: 1
>>>128300 25368 16384 Not enrolled ja_tasks: 0
>>>128301 25368 16384 Enrolled ja_tasks: 1
>>>128302 25368 16384 Not enrolled ja_tasks: 0
>>>128303 25368 16384 Enrolled ja_tasks: 1
>>>128304 25368 16384 Not enrolled ja_tasks: 0
>>>128305 25368 16384 Enrolled ja_tasks: 1
>>>128306 25368 16384 Not enrolled ja_tasks: 0
>>>128307 25368 16384 Enrolled ja_tasks: 1
>>>128308 25368 16384 Not enrolled ja_tasks: 0
>>>128309 25368 16384 Enrolled ja_tasks: 1
>>>128310 25368 16384 Not enrolled ja_tasks: 0
>>>128311 25368 16384 Enrolled ja_tasks: 1
>>>128312 25368 16384 Not enrolled ja_tasks: 0
>>>128313 25368 16384 Enrolled ja_tasks: 1
>>>128314 25368 16384 STARTING PASS 2 WITH 0 PENDING JOBS
>>>128315 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128316 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>128317 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
>>
>>
>>>128318 25368 16384 slots: 1.000000 * 1000.000000 *
>>>
>>>
>>20 ---> 20000.000000
>>
>>
>>>128319 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128320 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128321 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>128322 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128323 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128324 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
>>
>>
>>>128325 25368 16384 slots: 1.000000 * 1000.000000 * 2
>>>
>>>
>> ---> 2000.000000
>>
>>
>>>128326 25368 16384 slots: 1.000000 * 1000.000000 * 8
>>>
>>>
>> ---> 8000.000000
>>
>>
>>>128327 25368 16384 ASU min = 1000.00000000000, ASU max
>>>
>>>
>>= 20000.00000000000
>>
>>
>>>128328 25368 16384
>>>128329 25368 16384 no DDJU: do_usage: 1 finished_jobs 0
>>>128330 25368 16384
>>>128331 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>128332 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>128333 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>128334 25368 16384
>>>128335 25368 16384 no DDJU: do_usage: 0 finished_jobs 0
>>>128336 25368 16384
>>>128337 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>128338 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>128339 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>128340 25368 16384 Normalizing tickets using
>>>
>>>
>>0.000000/18.333333 as min_tix/max_tix
>>
>>
>>>128341 25368 16384 got 19 running jobs
>>>128342 25368 16384 added 19 ticket orders for running jobs
>>>128343 25368 16384 added 1 orders for updating usage of user
>>>128344 25368 16384 added 0 orders for updating usage
>>>
>>>
>>of project
>>
>>
>>>128345 25368 16384 added 0 orders for updating share tree
>>>128346 25368 16384 added 1 orders for scheduler configuration
>>>128347 25368 16384 SENDING 22 ORDERS TO QMASTER
>>>128348 25368 16384 RESETTING BUSY STATE OF EVENT CLIENT
>>>128349 25368 16384 reresolve port timeout in 320
>>>128350 25368 16384 returning cached port value: 536
>>>--------------STOP-SCHEDULER-RUN-------------
>>>128351 25368 16384 ec_get retrieving events - will do
>>>
>>>
>>max 20 fetches
>>
>>
>>>128352 25368 16384 doing sync fetch for messages, 20 still to do
>>>128353 25368 16384 try to get request from qmaster, id 1
>>>128354 25368 16384 Checking 120 events (44358-44477)
>>>
>>>
>>while waiting for #44358
>>
>>
>>>128355 25368 16384 check complete, 120 events in list
>>>128356 25368 16384 got 120 events till 44477
>>>128357 25368 16384 doing async fetch for messages, 19
>>>
>>>
>>still to do
>>
>>
>>>128358 25368 16384 try to get request from qmaster, id 1
>>>128359 25368 16384 reresolve port timeout in 300
>>>128360 25368 16384 returning cached port value: 536
>>>128361 25368 16384 Sent ack for all events lower or equal 44477
>>>128362 25368 16384 ec_get - received 120 events
>>>128363 25368 16384 44358. EVENT MOD EXECHOST sub04n166
>>>128364 25368 16384 44359. EVENT MOD EXECHOST sub04n90
>>>128365 25368 16384 44360. EVENT JOB 21503.1 task
>>>
>>>
>>2.sub04n90 USAGE
>>
>>
>>>128366 25368 16384 44361. EVENT JOB 21503.1 task
>>>
>>>
>>1.sub04n90 USAGE
>>
>>
>>>128367 25368 16384 44362. EVENT MOD EXECHOST sub04n168
>>>128368 25368 16384 44363. EVENT MOD EXECHOST sub04n112
>>>128369 25368 16384 44364. EVENT MOD EXECHOST sub04n08
>>>128370 25368 16384 44365. EVENT MOD EXECHOST sub04n75
>>>128371 25368 16384 44366. EVENT JOB 21040.1 task
>>>
>>>
>>6.sub04n75 USAGE
>>
>>
>>>128372 25368 16384 44367. EVENT JOB 21040.1 task
>>>
>>>
>>5.sub04n75 USAGE
>>
>>
>>>128373 25368 16384 44368. EVENT MOD USER udo
>>>128374 25368 16384 44369. EVENT MOD USER iber
>>>128375 25368 16384 44370. EVENT MOD USER dieguez
>>>128376 25368 16384 44371. EVENT MOD USER karenjoh
>>>128377 25368 16384 44372. EVENT MOD USER lorenzo
>>>128378 25368 16384 44373. EVENT MOD USER parcolle
>>>128379 25368 16384 44374. EVENT MOD USER cfennie
>>>128380 25368 16384 44375. EVENT MOD USER civelli
>>>128381 25368 16384 44376. EVENT MOD EXECHOST sub04n14
>>>128382 25368 16384 44377. EVENT MOD EXECHOST sub04n150
>>>128383 25368 16384 44378. EVENT MOD EXECHOST sub04n169
>>>128384 25368 16384 44379. EVENT MOD EXECHOST sub04n165
>>>128385 25368 16384 44380. EVENT MOD EXECHOST sub04n136
>>>128386 25368 16384 44381. EVENT MOD EXECHOST sub04n81
>>>128387 25368 16384 44382. EVENT JOB 21507.1 task
>>>
>>>
>>6.sub04n81 USAGE
>>
>>
>>>128388 25368 16384 44383. EVENT JOB 21507.1 task
>>>
>>>
>>5.sub04n81 USAGE
>>
>>
>>>128389 25368 16384 44384. EVENT MOD EXECHOST sub04n176
>>>128390 25368 16384 44385. EVENT MOD EXECHOST sub04n161
>>>128391 25368 16384 44386. EVENT MOD EXECHOST sub04n124
>>>128392 25368 16384 44387. EVENT MOD EXECHOST sub04n01
>>>128393 25368 16384 44388. EVENT MOD EXECHOST sub04n158
>>>128394 25368 16384 44389. EVENT MOD EXECHOST sub04n159
>>>128395 25368 16384 44390. EVENT MOD EXECHOST sub04n134
>>>128396 25368 16384 44391. EVENT MOD EXECHOST sub04n143
>>>128397 25368 16384 44392. EVENT MOD EXECHOST sub04n121
>>>128398 25368 16384 44393. EVENT MOD EXECHOST sub04n15
>>>128399 25368 16384 44394. EVENT MOD EXECHOST sub04n13
>>>128400 25368 16384 44395. EVENT MOD EXECHOST sub04n118
>>>128401 25368 16384 44396. EVENT MOD EXECHOST sub04n64
>>>128402 25368 16384 44397. EVENT JOB 21542.1 task
>>>
>>>
>>2.sub04n64 USAGE
>>
>>
>>>128403 25368 16384 44398. EVENT JOB 21542.1 task
>>>
>>>
>>1.sub04n64 USAGE
>>
>>
>>>128404 25368 16384 44399. EVENT MOD EXECHOST sub04n151
>>>128405 25368 16384 44400. EVENT MOD EXECHOST sub04n154
>>>128406 25368 16384 44401. EVENT MOD EXECHOST sub04n149
>>>128407 25368 16384 44402. EVENT MOD EXECHOST sub04n16
>>>128408 25368 16384 44403. EVENT MOD EXECHOST sub04n155
>>>128409 25368 16384 44404. EVENT MOD EXECHOST sub04n152
>>>128410 25368 16384 44405. EVENT MOD EXECHOST sub04n163
>>>128411 25368 16384 44406. EVENT MOD EXECHOST sub04n86
>>>128412 25368 16384 44407. EVENT JOB 21423.1 task
>>>
>>>
>>2.sub04n86 USAGE
>>
>>
>>>128413 25368 16384 44408. EVENT JOB 21423.1 task
>>>
>>>
>>1.sub04n86 USAGE
>>
>>
>>>128414 25368 16384 44409. EVENT MOD EXECHOST sub04n43
>>>128415 25368 16384 44410. EVENT MOD EXECHOST sub04n204
>>>128416 25368 16384 44411. EVENT MOD EXECHOST rupc01.rutgers.edu
>>>128417 25368 16384 44412. EVENT MOD EXECHOST sub04n125
>>>128418 25368 16384 44413. EVENT MOD EXECHOST sub04n03
>>>128419 25368 16384 44414. EVENT JOB 21076.1 USAGE
>>>128420 25368 16384 44415. EVENT MOD EXECHOST sub04n44
>>>128421 25368 16384 44416. EVENT MOD EXECHOST sub04n32
>>>128422 25368 16384 44417. EVENT MOD EXECHOST sub04n21
>>>128423 25368 16384 44418. EVENT MOD EXECHOST sub04n22
>>>128424 25368 16384 44419. EVENT MOD EXECHOST sub04n35
>>>128425 25368 16384 44420. EVENT MOD EXECHOST sub04n201
>>>128426 25368 16384 44421. EVENT MOD EXECHOST sub04n146
>>>128427 25368 16384 44422. EVENT MOD EXECHOST sub04n111
>>>128428 25368 16384 44423. EVENT MOD EXECHOST sub04n177
>>>128429 25368 16384 44424. EVENT MOD EXECHOST sub04n89
>>>128430 25368 16384 44425. EVENT JOB 21530.1 task
>>>
>>>
>>2.sub04n89 USAGE
>>
>>
>>>128431 25368 16384 44426. EVENT JOB 21530.1 task
>>>
>>>
>>1.sub04n89 USAGE
>>
>>
>>>128432 25368 16384 44427. EVENT JOB 21530.1 USAGE
>>>128433 25368 16384 44428. EVENT MOD EXECHOST sub04n205
>>>128434 25368 16384 44429. EVENT JOB 21440.1 USAGE
>>>128435 25368 16384 44430. EVENT MOD EXECHOST sub04n208
>>>128436 25368 16384 44431. EVENT JOB 21528.1 USAGE
>>>128437 25368 16384 44432. EVENT MOD EXECHOST sub04n104
>>>128438 25368 16384 44433. EVENT MOD EXECHOST sub04n24
>>>128439 25368 16384 44434. EVENT JOB 21210.1 USAGE
>>>128440 25368 16384 44435. EVENT MOD EXECHOST sub04n18
>>>128441 25368 16384 44436. EVENT MOD EXECHOST sub04n31
>>>128442 25368 16384 44437. EVENT JOB 20937.1 USAGE
>>>128443 25368 16384 44438. EVENT MOD EXECHOST sub04n202
>>>128444 25368 16384 44439. EVENT JOB 21443.1 USAGE
>>>128445 25368 16384 44440. EVENT MOD EXECHOST sub04n171
>>>128446 25368 16384 44441. EVENT MOD EXECHOST sub04n37
>>>128447 25368 16384 44442. EVENT MOD EXECHOST sub04n36
>>>128448 25368 16384 44443. EVENT MOD EXECHOST sub04n40
>>>128449 25368 16384 44444. EVENT MOD EXECHOST sub04n12
>>>128450 25368 16384 44445. EVENT MOD EXECHOST sub04n172
>>>128451 25368 16384 44446. EVENT MOD EXECHOST sub04n79
>>>128452 25368 16384 44447. EVENT JOB 21040.1 task
>>>
>>>
>>6.sub04n79 USAGE
>>
>>
>>>128453 25368 16384 44448. EVENT JOB 21040.1 task
>>>
>>>
>>5.sub04n79 USAGE
>>
>>
>>>128454 25368 16384 44449. EVENT JOB 21040.1 USAGE
>>>128455 25368 16384 44450. EVENT MOD EXECHOST sub04n61
>>>128456 25368 16384 44451. EVENT JOB 21040.1 task
>>>
>>>
>>6.sub04n61 USAGE
>>
>>
>>>128457 25368 16384 44452. EVENT JOB 21040.1 task
>>>
>>>
>>5.sub04n61 USAGE
>>
>>
>>>128458 25368 16384 44453. EVENT MOD EXECHOST sub04n170
>>>128459 25368 16384 44454. EVENT MOD EXECHOST sub04n41
>>>128460 25368 16384 44455. EVENT JOB 20938.1 USAGE
>>>128461 25368 16384 44456. EVENT MOD EXECHOST sub04n153
>>>128462 25368 16384 44457. EVENT MOD EXECHOST sub04n39
>>>128463 25368 16384 44458. EVENT MOD EXECHOST sub04n83
>>>128464 25368 16384 44459. EVENT MOD EXECHOST sub04n82
>>>128465 25368 16384 44460. EVENT MOD EXECHOST sub04n174
>>>128466 25368 16384 44461. EVENT MOD EXECHOST sub04n173
>>>128467 25368 16384 44462. EVENT MOD EXECHOST sub04n85
>>>128468 25368 16384 44463. EVENT JOB 21423.1 task
>>>
>>>
>>2.sub04n85 USAGE
>>
>>
>>>128469 25368 16384 44464. EVENT JOB 21423.1 task
>>>
>>>
>>1.sub04n85 USAGE
>>
>>
>>>128470 25368 16384 44465. EVENT MOD EXECHOST sub04n68
>>>128471 25368 16384 44466. EVENT JOB 21474.1 task
>>>
>>>
>>14.sub04n68 USAGE
>>
>>
>>>128472 25368 16384 44467. EVENT JOB 21474.1 task
>>>
>>>
>>13.sub04n68 USAGE
>>
>>
>>>128473 25368 16384 44468. EVENT MOD EXECHOST beowulf.rutgers.edu
>>>128474 25368 16384 44469. EVENT MOD EXECHOST sub04n91
>>>128475 25368 16384 44470. EVENT JOB 21423.1 task
>>>
>>>
>>2.sub04n91 USAGE
>>
>>
>>>128476 25368 16384 44471. EVENT JOB 21423.1 task
>>>
>>>
>>1.sub04n91 USAGE
>>
>>
>>>128477 25368 16384 44472. EVENT JOB 21423.1 USAGE
>>>128478 25368 16384 44473. EVENT MOD EXECHOST sub04n29
>>>128479 25368 16384 44474. EVENT MOD EXECHOST sub04n69
>>>128480 25368 16384 44475. EVENT JOB 21474.1 task
>>>
>>>
>>14.sub04n69 USAGE
>>
>>
>>>128481 25368 16384 44476. EVENT JOB 21474.1 task
>>>
>>>
>>13.sub04n69 USAGE
>>
>>
>>>128482 25368 16384 44477. EVENT MOD EXECHOST sub04n175
>>>Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7,
>>>
>>>
>>CKPT:0 US:15 PR:4 S:nd:12/lf:7
>>
>>
>>>128483 25368 16384
>>>
>>>
>>================[SCHEDULING-EPOCH]==================
>>
>>
>>>128484 25368 16384 JOB 20937.1 start_time = 1116447112
>>>
>>>
>>running_time 338099 decay_time = 450
>>
>>
>>>128485 25368 16384 JOB 20938.1 start_time = 1116374344
>>>
>>>
>>running_time 410867 decay_time = 450
>>
>>
>>>128486 25368 16384 JOB 21040.1 start_time = 1116443073
>>>
>>>
>>running_time 342138 decay_time = 450
>>
>>
>>>128487 25368 16384 JOB 21076.1 start_time = 1116451351
>>>
>>>
>>running_time 333860 decay_time = 450
>>
>>
>>>128488 25368 16384 JOB 21210.1 start_time = 1116514970
>>>
>>>
>>running_time 270241 decay_time = 450
>>
>>
>>>128489 25368 16384 JOB 21213.1 start_time = 1116515250
>>>
>>>
>>running_time 269961 decay_time = 450
>>
>>
>>>128490 25368 16384 JOB 21338.1 start_time = 1116543252
>>>
>>>
>>running_time 241959 decay_time = 450
>>
>>
>>>128491 25368 16384 JOB 21423.1 start_time = 1116629274
>>>
>>>
>>running_time 155937 decay_time = 450
>>
>>
>>>128492 25368 16384 JOB 21424.1 start_time = 1116631365
>>>
>>>
>>running_time 153846 decay_time = 450
>>
>>
>>>128493 25368 16384 JOB 21440.1 start_time = 1116632934
>>>
>>>
>>running_time 152277 decay_time = 450
>>
>>
>>>128494 25368 16384 JOB 21441.1 start_time = 1116632994
>>>
>>>
>>running_time 152217 decay_time = 450
>>
>>
>>>128495 25368 16384 JOB 21443.1 start_time = 1116633602
>>>
>>>
>>running_time 151609 decay_time = 450
>>
>>
>>>128496 25368 16384 JOB 21474.1 start_time = 1116655118
>>>
>>>
>>running_time 130093 decay_time = 450
>>
>>
>>>128497 25368 16384 JOB 21503.1 start_time = 1116707395
>>>
>>>
>>running_time 77816 decay_time = 450
>>
>>
>>>128498 25368 16384 JOB 21507.1 start_time = 1116714061
>>>
>>>
>>running_time 71150 decay_time = 450
>>
>>
>>>128499 25368 16384 JOB 21528.1 start_time = 1116707641
>>>
>>>
>>running_time 77570 decay_time = 450
>>
>>
>>>128500 25368 16384 JOB 21530.1 start_time = 1116714453
>>>
>>>
>>running_time 70758 decay_time = 450
>>
>>
>>>128501 25368 16384 JOB 21537.1 start_time = 1116724845
>>>
>>>
>>running_time 60366 decay_time = 450
>>
>>
>>>128502 25368 16384 JOB 21542.1 start_time = 1116782511
>>>
>>>
>>running_time 2700 decay_time = 450
>>
>>
>>>128503 25368 16384 verified threshold of 169 queues
>>>128504 25368 16384 queue myrinet at sub04n61 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128505 25368 16384 queue myrinet at sub04n62 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128506 25368 16384 queue myrinet at sub04n65 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128507 25368 16384 queue myrinet at sub04n66 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128508 25368 16384 queue myrinet at sub04n67 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128509 25368 16384 queue myrinet at sub04n68 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128510 25368 16384 queue myrinet at sub04n69 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128511 25368 16384 queue myrinet at sub04n70 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128512 25368 16384 queue myrinet at sub04n71 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128513 25368 16384 queue myrinet at sub04n72 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128514 25368 16384 queue myrinet at sub04n75 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128515 25368 16384 queue myrinet at sub04n77 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128516 25368 16384 queue myrinet at sub04n78 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128517 25368 16384 queue myrinet at sub04n79 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128518 25368 16384 queue myrinet at sub04n81 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128519 25368 16384 queue myrinet at sub04n84 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128520 25368 16384 queue myrinet at sub04n85 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128521 25368 16384 queue myrinet at sub04n86 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128522 25368 16384 queue myrinet at sub04n87 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128523 25368 16384 queue myrinet at sub04n88 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128524 25368 16384 queue myrinet at sub04n89 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128525 25368 16384 queue myrinet at sub04n90 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128526 25368 16384 queue myrinet at sub04n91 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128527 25368 16384 queue myrinet at sub04n63 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128528 25368 16384 queue myrinet at sub04n64 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128529 25368 16384 queue myrinet at sub04n73 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128530 25368 16384 queue myrinet at sub04n74 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128531 25368 16384 queue opteronp at sub04n202 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128532 25368 16384 queue opteronp at sub04n205 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128533 25368 16384 queue opteronp at sub04n206 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128534 25368 16384 queue opteronp at sub04n208 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128535 25368 16384 queue parallel at sub04n121 tagged to
>>>
>>>
>>be overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128536 25368 16384 queue parallel at sub04n139 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128537 25368 16384 queue parallel at sub04n140 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128538 25368 16384 queue parallel at sub04n141 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128539 25368 16384 queue parallel at sub04n142 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128540 25368 16384 queue parallel at sub04n143 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128541 25368 16384 queue parallel at sub04n144 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128542 25368 16384 queue parallel at sub04n146 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128543 25368 16384 queue parallel at sub04n02 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128544 25368 16384 queue parallel at sub04n03 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128545 25368 16384 queue parallel at sub04n04 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128546 25368 16384 queue parallel at sub04n05 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128547 25368 16384 queue parallel at sub04n06 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128548 25368 16384 queue parallel at sub04n07 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128549 25368 16384 queue parallel at sub04n08 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128550 25368 16384 queue parallel at sub04n09 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128551 25368 16384 queue parallel at sub04n10 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128552 25368 16384 queue parallel at sub04n11 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128553 25368 16384 verified threshold of 169 queues
>>>128554 25368 16384 STARTING PASS 1 WITH 0 PENDING JOBS
>>>128555 25368 16384 Not enrolled ja_tasks: 0
>>>128556 25368 16384 Enrolled ja_tasks: 1
>>>128557 25368 16384 Not enrolled ja_tasks: 0
>>>128558 25368 16384 Enrolled ja_tasks: 1
>>>128559 25368 16384 Not enrolled ja_tasks: 0
>>>128560 25368 16384 Enrolled ja_tasks: 1
>>>128561 25368 16384 Not enrolled ja_tasks: 0
>>>128562 25368 16384 Enrolled ja_tasks: 1
>>>128563 25368 16384 Not enrolled ja_tasks: 0
>>>128564 25368 16384 Enrolled ja_tasks: 1
>>>128565 25368 16384 Not enrolled ja_tasks: 0
>>>128566 25368 16384 Enrolled ja_tasks: 1
>>>128567 25368 16384 Not enrolled ja_tasks: 0
>>>128568 25368 16384 Enrolled ja_tasks: 1
>>>128569 25368 16384 Not enrolled ja_tasks: 0
>>>128570 25368 16384 Enrolled ja_tasks: 1
>>>128571 25368 16384 Not enrolled ja_tasks: 0
>>>128572 25368 16384 Enrolled ja_tasks: 1
>>>128573 25368 16384 Not enrolled ja_tasks: 0
>>>128574 25368 16384 Enrolled ja_tasks: 1
>>>128575 25368 16384 Not enrolled ja_tasks: 0
>>>128576 25368 16384 Enrolled ja_tasks: 1
>>>128577 25368 16384 Not enrolled ja_tasks: 0
>>>128578 25368 16384 Enrolled ja_tasks: 1
>>>128579 25368 16384 Not enrolled ja_tasks: 0
>>>128580 25368 16384 Enrolled ja_tasks: 1
>>>128581 25368 16384 Not enrolled ja_tasks: 0
>>>128582 25368 16384 Enrolled ja_tasks: 1
>>>128583 25368 16384 Not enrolled ja_tasks: 0
>>>128584 25368 16384 Enrolled ja_tasks: 1
>>>128585 25368 16384 Not enrolled ja_tasks: 0
>>>128586 25368 16384 Enrolled ja_tasks: 1
>>>128587 25368 16384 Not enrolled ja_tasks: 0
>>>128588 25368 16384 Enrolled ja_tasks: 1
>>>128589 25368 16384 Not enrolled ja_tasks: 0
>>>128590 25368 16384 Enrolled ja_tasks: 1
>>>128591 25368 16384 Not enrolled ja_tasks: 0
>>>128592 25368 16384 Enrolled ja_tasks: 1
>>>128593 25368 16384 STARTING PASS 2 WITH 0 PENDING JOBS
>>>128594 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128595 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>128596 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
>>
>>
>>>128597 25368 16384 slots: 1.000000 * 1000.000000 *
>>>
>>>
>>20 ---> 20000.000000
>>
>>
>>>128598 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128599 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128600 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>128601 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128602 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128603 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
>>
>>
>>>128604 25368 16384 slots: 1.000000 * 1000.000000 * 2
>>>
>>>
>> ---> 2000.000000
>>
>>
>>>128605 25368 16384 slots: 1.000000 * 1000.000000 * 8
>>>
>>>
>> ---> 8000.000000
>>
>>
>>>128606 25368 16384 ASU min = 1000.00000000000, ASU max
>>>
>>>
>>= 20000.00000000000
>>
>>
>>>128607 25368 16384
>>>128608 25368 16384 no DDJU: do_usage: 1 finished_jobs 0
>>>128609 25368 16384
>>>128610 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>128611 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>128612 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>128613 25368 16384
>>>128614 25368 16384 no DDJU: do_usage: 0 finished_jobs 0
>>>128615 25368 16384
>>>128616 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>128617 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>128618 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>128619 25368 16384 Normalizing tickets using
>>>
>>>
>>0.000000/18.333333 as min_tix/max_tix
>>
>>
>>>128620 25368 16384 got 19 running jobs
>>>128621 25368 16384 added 19 ticket orders for running jobs
>>>128622 25368 16384 added 1 orders for updating usage of user
>>>128623 25368 16384 added 0 orders for updating usage
>>>
>>>
>>of project
>>
>>
>>>128624 25368 16384 added 0 orders for updating share tree
>>>128625 25368 16384 added 1 orders for scheduler configuration
>>>128626 25368 16384 SENDING 22 ORDERS TO QMASTER
>>>128627 25368 16384 RESETTING BUSY STATE OF EVENT CLIENT
>>>128628 25368 16384 reresolve port timeout in 300
>>>128629 25368 16384 returning cached port value: 536
>>>--------------STOP-SCHEDULER-RUN-------------
>>>128630 25368 16384 ec_get retrieving events - will do
>>>
>>>
>>max 20 fetches
>>
>>
>>>128631 25368 16384 doing sync fetch for messages, 20 still to do
>>>128632 25368 16384 try to get request from qmaster, id 1
>>>128633 25368 16384 Checking 84 events (44478-44561)
>>>
>>>
>>while waiting for #44478
>>
>>
>>>128634 25368 16384 check complete, 84 events in list
>>>128635 25368 16384 got 84 events till 44561
>>>128636 25368 16384 doing async fetch for messages, 19
>>>
>>>
>>still to do
>>
>>
>>>128637 25368 16384 try to get request from qmaster, id 1
>>>128638 25368 16384 reresolve port timeout in 280
>>>128639 25368 16384 returning cached port value: 536
>>>128640 25368 16384 Getting host by name - Linux
>>>128641 25368 16384 1 names in h_addr_list
>>>128642 25368 16384 0 names in h_aliases
>>>128643 25368 16384 Sent ack for all events lower or equal 44561
>>>128644 25368 16384 ec_get - received 84 events
>>>128645 25368 16384 44478. EVENT MOD EXECHOST sub04n167
>>>128646 25368 16384 44479. EVENT MOD EXECHOST sub04n63
>>>128647 25368 16384 44480. EVENT JOB 21542.1 task
>>>
>>>
>>2.sub04n63 USAGE
>>
>>
>>>128648 25368 16384 44481. EVENT JOB 21542.1 task
>>>
>>>
>>1.sub04n63 USAGE
>>
>>
>>>128649 25368 16384 44482. EVENT JOB 21542.1 USAGE
>>>128650 25368 16384 44483. EVENT MOD EXECHOST sub04n71
>>>128651 25368 16384 44484. EVENT JOB 21537.1 task
>>>
>>>
>>2.sub04n71 USAGE
>>
>>
>>>128652 25368 16384 44485. EVENT JOB 21537.1 task
>>>
>>>
>>1.sub04n71 USAGE
>>
>>
>>>128653 25368 16384 44486. EVENT MOD EXECHOST sub04n65
>>>128654 25368 16384 44487. EVENT JOB 21424.1 task
>>>
>>>
>>2.sub04n65 USAGE
>>
>>
>>>128655 25368 16384 44488. EVENT JOB 21424.1 task
>>>
>>>
>>1.sub04n65 USAGE
>>
>>
>>>128656 25368 16384 44489. EVENT MOD USER udo
>>>128657 25368 16384 44490. EVENT MOD USER iber
>>>128658 25368 16384 44491. EVENT MOD USER dieguez
>>>128659 25368 16384 44492. EVENT MOD USER karenjoh
>>>128660 25368 16384 44493. EVENT MOD USER lorenzo
>>>128661 25368 16384 44494. EVENT MOD USER parcolle
>>>128662 25368 16384 44495. EVENT MOD USER cfennie
>>>128663 25368 16384 44496. EVENT MOD USER civelli
>>>128664 25368 16384 44497. EVENT MOD EXECHOST sub04n25
>>>128665 25368 16384 44498. EVENT MOD EXECHOST sub04n144
>>>128666 25368 16384 44499. EVENT MOD EXECHOST sub04n206
>>>128667 25368 16384 44500. EVENT JOB 21441.1 USAGE
>>>128668 25368 16384 44501. EVENT MOD EXECHOST sub04n87
>>>128669 25368 16384 44502. EVENT JOB 21503.1 task
>>>
>>>
>>2.sub04n87 USAGE
>>
>>
>>>128670 25368 16384 44503. EVENT JOB 21503.1 task
>>>
>>>
>>1.sub04n87 USAGE
>>
>>
>>>128671 25368 16384 44504. EVENT MOD EXECHOST sub04n70
>>>128672 25368 16384 44505. EVENT JOB 21503.1 task
>>>
>>>
>>2.sub04n70 USAGE
>>
>>
>>>128673 25368 16384 44506. EVENT JOB 21503.1 task
>>>
>>>
>>1.sub04n70 USAGE
>>
>>
>>>128674 25368 16384 44507. EVENT JOB 21503.1 USAGE
>>>128675 25368 16384 44508. EVENT MOD EXECHOST sub04n19
>>>128676 25368 16384 44509. EVENT JOB 21338.1 USAGE
>>>128677 25368 16384 44510. EVENT MOD EXECHOST sub04n84
>>>128678 25368 16384 44511. EVENT JOB 21424.1 task
>>>
>>>
>>2.sub04n84 USAGE
>>
>>
>>>128679 25368 16384 44512. EVENT JOB 21424.1 task
>>>
>>>
>>1.sub04n84 USAGE
>>
>>
>>>128680 25368 16384 44513. EVENT MOD EXECHOST sub04n178
>>>128681 25368 16384 44514. EVENT MOD EXECHOST sub04n67
>>>128682 25368 16384 44515. EVENT JOB 21474.1 task
>>>
>>>
>>14.sub04n67 USAGE
>>
>>
>>>128683 25368 16384 44516. EVENT JOB 21474.1 task
>>>
>>>
>>13.sub04n67 USAGE
>>
>>
>>>128684 25368 16384 44517. EVENT JOB 21474.1 USAGE
>>>128685 25368 16384 44518. EVENT MOD EXECHOST sub04n27
>>>128686 25368 16384 44519. EVENT MOD EXECHOST sub04n34
>>>128687 25368 16384 44520. EVENT MOD EXECHOST sub04n72
>>>128688 25368 16384 44521. EVENT JOB 21537.1 task
>>>
>>>
>>2.sub04n72 USAGE
>>
>>
>>>128689 25368 16384 44522. EVENT JOB 21537.1 task
>>>
>>>
>>1.sub04n72 USAGE
>>
>>
>>>128690 25368 16384 44523. EVENT MOD EXECHOST sub04n78
>>>128691 25368 16384 44524. EVENT JOB 21507.1 task
>>>
>>>
>>6.sub04n78 USAGE
>>
>>
>>>128692 25368 16384 44525. EVENT JOB 21507.1 task
>>>
>>>
>>5.sub04n78 USAGE
>>
>>
>>>128693 25368 16384 44526. EVENT JOB 21507.1 USAGE
>>>128694 25368 16384 44527. EVENT MOD EXECHOST sub04n17
>>>128695 25368 16384 44528. EVENT MOD EXECHOST sub04n07
>>>128696 25368 16384 44529. EVENT MOD EXECHOST sub04n128
>>>128697 25368 16384 44530. EVENT MOD EXECHOST sub04n42
>>>128698 25368 16384 44531. EVENT MOD EXECHOST sub04n62
>>>128699 25368 16384 44532. EVENT JOB 21424.1 task
>>>
>>>
>>2.sub04n62 USAGE
>>
>>
>>>128700 25368 16384 44533. EVENT JOB 21424.1 task
>>>
>>>
>>1.sub04n62 USAGE
>>
>>
>>>128701 25368 16384 44534. EVENT JOB 21424.1 USAGE
>>>128702 25368 16384 44535. EVENT MOD EXECHOST sub04n10
>>>128703 25368 16384 44536. EVENT MOD EXECHOST sub04n77
>>>128704 25368 16384 44537. EVENT JOB 21537.1 task
>>>
>>>
>>2.sub04n77 USAGE
>>
>>
>>>128705 25368 16384 44538. EVENT JOB 21537.1 task
>>>
>>>
>>1.sub04n77 USAGE
>>
>>
>>>128706 25368 16384 44539. EVENT MOD EXECHOST sub04n11
>>>128707 25368 16384 44540. EVENT MOD EXECHOST sub04n02
>>>128708 25368 16384 44541. EVENT MOD EXECHOST sub04n120
>>>128709 25368 16384 44542. EVENT MOD EXECHOST sub04n115
>>>128710 25368 16384 44543. EVENT MOD EXECHOST sub04n101
>>>128711 25368 16384 44544. EVENT MOD EXECHOST sub04n66
>>>128712 25368 16384 44545. EVENT JOB 21537.1 task
>>>
>>>
>>2.sub04n66 USAGE
>>
>>
>>>128713 25368 16384 44546. EVENT JOB 21537.1 task
>>>
>>>
>>1.sub04n66 USAGE
>>
>>
>>>128714 25368 16384 44547. EVENT JOB 21537.1 USAGE
>>>128715 25368 16384 44548. EVENT MOD EXECHOST sub04n142
>>>128716 25368 16384 44549. EVENT MOD EXECHOST sub04n123
>>>128717 25368 16384 44550. EVENT MOD EXECHOST sub04n33
>>>128718 25368 16384 44551. EVENT MOD EXECHOST sub04n126
>>>128719 25368 16384 44552. EVENT MOD EXECHOST sub04n140
>>>128720 25368 16384 44553. EVENT MOD EXECHOST sub04n119
>>>128721 25368 16384 44554. EVENT MOD EXECHOST sub04n102
>>>128722 25368 16384 44555. EVENT MOD EXECHOST sub04n110
>>>128723 25368 16384 44556. EVENT MOD EXECHOST sub04n117
>>>128724 25368 16384 44557. EVENT MOD EXECHOST sub04n06
>>>128725 25368 16384 44558. EVENT MOD EXECHOST sub04n73
>>>128726 25368 16384 44559. EVENT JOB 21542.1 task
>>>
>>>
>>2.sub04n73 USAGE
>>
>>
>>>128727 25368 16384 44560. EVENT JOB 21542.1 task
>>>
>>>
>>1.sub04n73 USAGE
>>
>>
>>>128728 25368 16384 44561. EVENT MOD EXECHOST sub04n122
>>>Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7,
>>>
>>>
>>CKPT:0 US:15 PR:4 S:nd:12/lf:7
>>
>>
>>>128729 25368 16384
>>>
>>>
>>================[SCHEDULING-EPOCH]==================
>>
>>
>>>128730 25368 16384 JOB 20937.1 start_time = 1116447112
>>>
>>>
>>running_time 338119 decay_time = 450
>>
>>
>>>128731 25368 16384 JOB 20938.1 start_time = 1116374344
>>>
>>>
>>running_time 410887 decay_time = 450
>>
>>
>>>128732 25368 16384 JOB 21040.1 start_time = 1116443073
>>>
>>>
>>running_time 342158 decay_time = 450
>>
>>
>>>128733 25368 16384 JOB 21076.1 start_time = 1116451351
>>>
>>>
>>running_time 333880 decay_time = 450
>>
>>
>>>128734 25368 16384 JOB 21210.1 start_time = 1116514970
>>>
>>>
>>running_time 270261 decay_time = 450
>>
>>
>>>128735 25368 16384 JOB 21213.1 start_time = 1116515250
>>>
>>>
>>running_time 269981 decay_time = 450
>>
>>
>>>128736 25368 16384 JOB 21338.1 start_time = 1116543252
>>>
>>>
>>running_time 241979 decay_time = 450
>>
>>
>>>128737 25368 16384 JOB 21423.1 start_time = 1116629274
>>>
>>>
>>running_time 155957 decay_time = 450
>>
>>
>>>128738 25368 16384 JOB 21424.1 start_time = 1116631365
>>>
>>>
>>running_time 153866 decay_time = 450
>>
>>
>>>128739 25368 16384 JOB 21440.1 start_time = 1116632934
>>>
>>>
>>running_time 152297 decay_time = 450
>>
>>
>>>128740 25368 16384 JOB 21441.1 start_time = 1116632994
>>>
>>>
>>running_time 152237 decay_time = 450
>>
>>
>>>128741 25368 16384 JOB 21443.1 start_time = 1116633602
>>>
>>>
>>running_time 151629 decay_time = 450
>>
>>
>>>128742 25368 16384 JOB 21474.1 start_time = 1116655118
>>>
>>>
>>running_time 130113 decay_time = 450
>>
>>
>>>128743 25368 16384 JOB 21503.1 start_time = 1116707395
>>>
>>>
>>running_time 77836 decay_time = 450
>>
>>
>>>128744 25368 16384 JOB 21507.1 start_time = 1116714061
>>>
>>>
>>running_time 71170 decay_time = 450
>>
>>
>>>128745 25368 16384 JOB 21528.1 start_time = 1116707641
>>>
>>>
>>running_time 77590 decay_time = 450
>>
>>
>>>128746 25368 16384 JOB 21530.1 start_time = 1116714453
>>>
>>>
>>running_time 70778 decay_time = 450
>>
>>
>>>128747 25368 16384 JOB 21537.1 start_time = 1116724845
>>>
>>>
>>running_time 60386 decay_time = 450
>>
>>
>>>128748 25368 16384 JOB 21542.1 start_time = 1116782511
>>>
>>>
>>running_time 2720 decay_time = 450
>>
>>
>>>128749 25368 16384 verified threshold of 169 queues
>>>128750 25368 16384 queue myrinet at sub04n61 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128751 25368 16384 queue myrinet at sub04n62 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128752 25368 16384 queue myrinet at sub04n65 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128753 25368 16384 queue myrinet at sub04n66 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128754 25368 16384 queue myrinet at sub04n67 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128755 25368 16384 queue myrinet at sub04n68 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128756 25368 16384 queue myrinet at sub04n69 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128757 25368 16384 queue myrinet at sub04n70 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128758 25368 16384 queue myrinet at sub04n71 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128759 25368 16384 queue myrinet at sub04n72 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128760 25368 16384 queue myrinet at sub04n75 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128761 25368 16384 queue myrinet at sub04n77 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128762 25368 16384 queue myrinet at sub04n78 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128763 25368 16384 queue myrinet at sub04n79 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128764 25368 16384 queue myrinet at sub04n81 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128765 25368 16384 queue myrinet at sub04n84 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128766 25368 16384 queue myrinet at sub04n85 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128767 25368 16384 queue myrinet at sub04n86 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128768 25368 16384 queue myrinet at sub04n87 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128769 25368 16384 queue myrinet at sub04n88 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128770 25368 16384 queue myrinet at sub04n89 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128771 25368 16384 queue myrinet at sub04n90 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128772 25368 16384 queue myrinet at sub04n91 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128773 25368 16384 queue myrinet at sub04n63 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128774 25368 16384 queue myrinet at sub04n64 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128775 25368 16384 queue myrinet at sub04n73 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128776 25368 16384 queue myrinet at sub04n74 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128777 25368 16384 queue opteronp at sub04n202 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128778 25368 16384 queue opteronp at sub04n205 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128779 25368 16384 queue opteronp at sub04n206 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128780 25368 16384 queue opteronp at sub04n208 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128781 25368 16384 queue parallel at sub04n121 tagged to
>>>
>>>
>>be overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128782 25368 16384 queue parallel at sub04n139 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128783 25368 16384 queue parallel at sub04n140 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128784 25368 16384 queue parallel at sub04n141 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128785 25368 16384 queue parallel at sub04n142 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128786 25368 16384 queue parallel at sub04n143 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128787 25368 16384 queue parallel at sub04n144 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128788 25368 16384 queue parallel at sub04n146 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128789 25368 16384 queue parallel at sub04n02 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128790 25368 16384 queue parallel at sub04n03 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128791 25368 16384 queue parallel at sub04n04 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128792 25368 16384 queue parallel at sub04n05 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128793 25368 16384 queue parallel at sub04n06 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128794 25368 16384 queue parallel at sub04n07 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128795 25368 16384 queue parallel at sub04n08 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128796 25368 16384 queue parallel at sub04n09 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128797 25368 16384 queue parallel at sub04n10 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128798 25368 16384 queue parallel at sub04n11 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128799 25368 16384 verified threshold of 169 queues
>>>128800 25368 16384 STARTING PASS 1 WITH 0 PENDING JOBS
>>>128801 25368 16384 Not enrolled ja_tasks: 0
>>>128802 25368 16384 Enrolled ja_tasks: 1
>>>128803 25368 16384 Not enrolled ja_tasks: 0
>>>128804 25368 16384 Enrolled ja_tasks: 1
>>>128805 25368 16384 Not enrolled ja_tasks: 0
>>>128806 25368 16384 Enrolled ja_tasks: 1
>>>128807 25368 16384 Not enrolled ja_tasks: 0
>>>128808 25368 16384 Enrolled ja_tasks: 1
>>>128809 25368 16384 Not enrolled ja_tasks: 0
>>>128810 25368 16384 Enrolled ja_tasks: 1
>>>128811 25368 16384 Not enrolled ja_tasks: 0
>>>128812 25368 16384 Enrolled ja_tasks: 1
>>>128813 25368 16384 Not enrolled ja_tasks: 0
>>>128814 25368 16384 Enrolled ja_tasks: 1
>>>128815 25368 16384 Not enrolled ja_tasks: 0
>>>128816 25368 16384 Enrolled ja_tasks: 1
>>>128817 25368 16384 Not enrolled ja_tasks: 0
>>>128818 25368 16384 Enrolled ja_tasks: 1
>>>128819 25368 16384 Not enrolled ja_tasks: 0
>>>128820 25368 16384 Enrolled ja_tasks: 1
>>>128821 25368 16384 Not enrolled ja_tasks: 0
>>>128822 25368 16384 Enrolled ja_tasks: 1
>>>128823 25368 16384 Not enrolled ja_tasks: 0
>>>128824 25368 16384 Enrolled ja_tasks: 1
>>>128825 25368 16384 Not enrolled ja_tasks: 0
>>>128826 25368 16384 Enrolled ja_tasks: 1
>>>128827 25368 16384 Not enrolled ja_tasks: 0
>>>128828 25368 16384 Enrolled ja_tasks: 1
>>>128829 25368 16384 Not enrolled ja_tasks: 0
>>>128830 25368 16384 Enrolled ja_tasks: 1
>>>128831 25368 16384 Not enrolled ja_tasks: 0
>>>128832 25368 16384 Enrolled ja_tasks: 1
>>>128833 25368 16384 Not enrolled ja_tasks: 0
>>>128834 25368 16384 Enrolled ja_tasks: 1
>>>128835 25368 16384 Not enrolled ja_tasks: 0
>>>128836 25368 16384 Enrolled ja_tasks: 1
>>>128837 25368 16384 Not enrolled ja_tasks: 0
>>>128838 25368 16384 Enrolled ja_tasks: 1
>>>128839 25368 16384 STARTING PASS 2 WITH 0 PENDING JOBS
>>>128840 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128841 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>128842 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
>>
>>
>>>128843 25368 16384 slots: 1.000000 * 1000.000000 *
>>>
>>>
>>20 ---> 20000.000000
>>
>>
>>>128844 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128845 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128846 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>128847 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128848 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>128849 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
>>
>>
>>>128850 25368 16384 slots: 1.000000 * 1000.000000 * 2
>>>
>>>
>> ---> 2000.000000
>>
>>
>>>128851 25368 16384 slots: 1.000000 * 1000.000000 * 8
>>>
>>>
>> ---> 8000.000000
>>
>>
>>>128852 25368 16384 ASU min = 1000.00000000000, ASU max
>>>
>>>
>>= 20000.00000000000
>>
>>
>>>128853 25368 16384
>>>128854 25368 16384 no DDJU: do_usage: 1 finished_jobs 0
>>>128855 25368 16384
>>>128856 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>128857 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>128858 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>128859 25368 16384
>>>128860 25368 16384 no DDJU: do_usage: 0 finished_jobs 0
>>>128861 25368 16384
>>>128862 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>128863 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>128864 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>128865 25368 16384 Normalizing tickets using
>>>
>>>
>>0.000000/18.333333 as min_tix/max_tix
>>
>>
>>>128866 25368 16384 got 19 running jobs
>>>128867 25368 16384 added 19 ticket orders for running jobs
>>>128868 25368 16384 added 1 orders for updating usage of user
>>>128869 25368 16384 added 0 orders for updating usage
>>>
>>>
>>of project
>>
>>
>>>128870 25368 16384 added 0 orders for updating share tree
>>>128871 25368 16384 added 1 orders for scheduler configuration
>>>128872 25368 16384 SENDING 22 ORDERS TO QMASTER
>>>128873 25368 16384 RESETTING BUSY STATE OF EVENT CLIENT
>>>128874 25368 16384 reresolve port timeout in 280
>>>128875 25368 16384 returning cached port value: 536
>>>--------------STOP-SCHEDULER-RUN-------------
>>>128876 25368 16384 ec_get retrieving events - will do
>>>
>>>
>>max 20 fetches
>>
>>
>>>128877 25368 16384 doing sync fetch for messages, 20 still to do
>>>128878 25368 16384 try to get request from qmaster, id 1
>>>128879 25368 16384 Checking 55 events (44562-44616)
>>>
>>>
>>while waiting for #44562
>>
>>
>>>128880 25368 16384 check complete, 55 events in list
>>>128881 25368 16384 got 55 events till 44616
>>>128882 25368 16384 doing async fetch for messages, 19
>>>
>>>
>>still to do
>>
>>
>>>128883 25368 16384 try to get request from qmaster, id 1
>>>128884 25368 16384 reresolve port timeout in 260
>>>128885 25368 16384 returning cached port value: 536
>>>128886 25368 16384 Sent ack for all events lower or equal 44616
>>>128887 25368 16384 ec_get - received 55 events
>>>128888 25368 16384 44562. EVENT MOD EXECHOST sub04n147
>>>128889 25368 16384 44563. EVENT MOD USER udo
>>>128890 25368 16384 44564. EVENT MOD USER iber
>>>128891 25368 16384 44565. EVENT MOD USER dieguez
>>>128892 25368 16384 44566. EVENT MOD USER karenjoh
>>>128893 25368 16384 44567. EVENT MOD USER lorenzo
>>>128894 25368 16384 44568. EVENT MOD USER parcolle
>>>128895 25368 16384 44569. EVENT MOD USER cfennie
>>>128896 25368 16384 44570. EVENT MOD USER civelli
>>>128897 25368 16384 44571. EVENT MOD EXECHOST sub04n135
>>>128898 25368 16384 44572. EVENT MOD EXECHOST sub04n141
>>>128899 25368 16384 44573. EVENT MOD EXECHOST sub04n127
>>>128900 25368 16384 44574. EVENT MOD EXECHOST sub04n145
>>>128901 25368 16384 44575. EVENT MOD EXECHOST sub04n133
>>>128902 25368 16384 44576. EVENT MOD EXECHOST sub04n148
>>>128903 25368 16384 44577. EVENT MOD EXECHOST sub04n74
>>>128904 25368 16384 44578. EVENT JOB 21542.1 task
>>>
>>>
>>2.sub04n74 USAGE
>>
>>
>>>128905 25368 16384 44579. EVENT JOB 21542.1 task
>>>
>>>
>>1.sub04n74 USAGE
>>
>>
>>>128906 25368 16384 44580. EVENT MOD EXECHOST rupc03.rutgers.edu
>>>128907 25368 16384 44581. EVENT MOD EXECHOST sub04n139
>>>128908 25368 16384 44582. EVENT MOD EXECHOST rupc02.rutgers.edu
>>>128909 25368 16384 44583. EVENT MOD EXECHOST sub04n80
>>>128910 25368 16384 44584. EVENT MOD EXECHOST sub04n207
>>>128911 25368 16384 44585. EVENT MOD EXECHOST sub04n180
>>>128912 25368 16384 44586. EVENT MOD EXECHOST sub04n23
>>>128913 25368 16384 44587. EVENT MOD EXECHOST sub04n30
>>>128914 25368 16384 44588. EVENT MOD EXECHOST sub04n203
>>>128915 25368 16384 44589. EVENT MOD EXECHOST sub04n109
>>>128916 25368 16384 44590. EVENT MOD EXECHOST rupc04.rutgers.edu
>>>128917 25368 16384 44591. EVENT MOD EXECHOST sub04n114
>>>128918 25368 16384 44592. EVENT MOD EXECHOST sub04n106
>>>128919 25368 16384 44593. EVENT MOD EXECHOST sub04n88
>>>128920 25368 16384 44594. EVENT JOB 21507.1 task
>>>
>>>
>>6.sub04n88 USAGE
>>
>>
>>>128921 25368 16384 44595. EVENT JOB 21507.1 task
>>>
>>>
>>5.sub04n88 USAGE
>>
>>
>>>128922 25368 16384 44596. EVENT MOD EXECHOST sub04n157
>>>128923 25368 16384 44597. EVENT MOD EXECHOST sub04n20
>>>128924 25368 16384 44598. EVENT MOD EXECHOST sub04n156
>>>128925 25368 16384 44599. EVENT MOD EXECHOST sub04n26
>>>128926 25368 16384 44600. EVENT JOB 21213.1 USAGE
>>>128927 25368 16384 44601. EVENT MOD EXECHOST sub04n09
>>>128928 25368 16384 44602. EVENT MOD EXECHOST sub04n05
>>>128929 25368 16384 44603. EVENT MOD EXECHOST sub04n103
>>>128930 25368 16384 44604. EVENT MOD EXECHOST sub04n164
>>>128931 25368 16384 44605. EVENT MOD EXECHOST sub04n105
>>>128932 25368 16384 44606. EVENT MOD EXECHOST sub04n113
>>>128933 25368 16384 44607. EVENT MOD EXECHOST sub04n28
>>>128934 25368 16384 44608. EVENT MOD EXECHOST sub04n76
>>>128935 25368 16384 44609. EVENT MOD EXECHOST sub04n162
>>>128936 25368 16384 44610. EVENT MOD EXECHOST sub04n108
>>>128937 25368 16384 44611. EVENT MOD EXECHOST sub04n38
>>>128938 25368 16384 44612. EVENT MOD EXECHOST sub04n116
>>>128939 25368 16384 44613. EVENT MOD EXECHOST sub04n179
>>>128940 25368 16384 44614. EVENT MOD EXECHOST sub04n04
>>>128941 25368 16384 44615. EVENT MOD EXECHOST sub04n160
>>>128942 25368 16384 44616. EVENT MOD EXECHOST sub04n107
>>>Q:169, AQ:343 J:19(19), H:169(170), C:49, A:4, D:3, P:7,
>>>
>>>
>>CKPT:0 US:15 PR:4 S:nd:12/lf:7
>>
>>
>>>128943 25368 16384
>>>
>>>
>>================[SCHEDULING-EPOCH]==================
>>
>>
>>>128944 25368 16384 JOB 20937.1 start_time = 1116447112
>>>
>>>
>>running_time 338139 decay_time = 450
>>
>>
>>>128945 25368 16384 JOB 20938.1 start_time = 1116374344
>>>
>>>
>>running_time 410907 decay_time = 450
>>
>>
>>>128946 25368 16384 JOB 21040.1 start_time = 1116443073
>>>
>>>
>>running_time 342178 decay_time = 450
>>
>>
>>>128947 25368 16384 JOB 21076.1 start_time = 1116451351
>>>
>>>
>>running_time 333900 decay_time = 450
>>
>>
>>>128948 25368 16384 JOB 21210.1 start_time = 1116514970
>>>
>>>
>>running_time 270281 decay_time = 450
>>
>>
>>>128949 25368 16384 JOB 21213.1 start_time = 1116515250
>>>
>>>
>>running_time 270001 decay_time = 450
>>
>>
>>>128950 25368 16384 JOB 21338.1 start_time = 1116543252
>>>
>>>
>>running_time 241999 decay_time = 450
>>
>>
>>>128951 25368 16384 JOB 21423.1 start_time = 1116629274
>>>
>>>
>>running_time 155977 decay_time = 450
>>
>>
>>>128952 25368 16384 JOB 21424.1 start_time = 1116631365
>>>
>>>
>>running_time 153886 decay_time = 450
>>
>>
>>>128953 25368 16384 JOB 21440.1 start_time = 1116632934
>>>
>>>
>>running_time 152317 decay_time = 450
>>
>>
>>>128954 25368 16384 JOB 21441.1 start_time = 1116632994
>>>
>>>
>>running_time 152257 decay_time = 450
>>
>>
>>>128955 25368 16384 JOB 21443.1 start_time = 1116633602
>>>
>>>
>>running_time 151649 decay_time = 450
>>
>>
>>>128956 25368 16384 JOB 21474.1 start_time = 1116655118
>>>
>>>
>>running_time 130133 decay_time = 450
>>
>>
>>>128957 25368 16384 JOB 21503.1 start_time = 1116707395
>>>
>>>
>>running_time 77856 decay_time = 450
>>
>>
>>>128958 25368 16384 JOB 21507.1 start_time = 1116714061
>>>
>>>
>>running_time 71190 decay_time = 450
>>
>>
>>>128959 25368 16384 JOB 21528.1 start_time = 1116707641
>>>
>>>
>>running_time 77610 decay_time = 450
>>
>>
>>>128960 25368 16384 JOB 21530.1 start_time = 1116714453
>>>
>>>
>>running_time 70798 decay_time = 450
>>
>>
>>>128961 25368 16384 JOB 21537.1 start_time = 1116724845
>>>
>>>
>>running_time 60406 decay_time = 450
>>
>>
>>>128962 25368 16384 JOB 21542.1 start_time = 1116782511
>>>
>>>
>>running_time 2740 decay_time = 450
>>
>>
>>>128963 25368 16384 verified threshold of 169 queues
>>>128964 25368 16384 queue myrinet at sub04n61 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128965 25368 16384 queue myrinet at sub04n62 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128966 25368 16384 queue myrinet at sub04n65 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128967 25368 16384 queue myrinet at sub04n66 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128968 25368 16384 queue myrinet at sub04n67 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128969 25368 16384 queue myrinet at sub04n68 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128970 25368 16384 queue myrinet at sub04n69 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128971 25368 16384 queue myrinet at sub04n70 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>128972 25368 16384 queue myrinet at sub04n71 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128973 25368 16384 queue myrinet at sub04n72 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128974 25368 16384 queue myrinet at sub04n75 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128975 25368 16384 queue myrinet at sub04n77 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128976 25368 16384 queue myrinet at sub04n78 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128977 25368 16384 queue myrinet at sub04n79 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128978 25368 16384 queue myrinet at sub04n81 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128979 25368 16384 queue myrinet at sub04n84 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128980 25368 16384 queue myrinet at sub04n85 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128981 25368 16384 queue myrinet at sub04n86 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128982 25368 16384 queue myrinet at sub04n87 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128983 25368 16384 queue myrinet at sub04n88 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128984 25368 16384 queue myrinet at sub04n89 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128985 25368 16384 queue myrinet at sub04n90 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128986 25368 16384 queue myrinet at sub04n91 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128987 25368 16384 queue myrinet at sub04n63 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128988 25368 16384 queue myrinet at sub04n64 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128989 25368 16384 queue myrinet at sub04n73 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128990 25368 16384 queue myrinet at sub04n74 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128991 25368 16384 queue opteronp at sub04n202 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128992 25368 16384 queue opteronp at sub04n205 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128993 25368 16384 queue opteronp at sub04n206 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128994 25368 16384 queue opteronp at sub04n208 tagged to
>>>
>>>
>>be overloaded: load_medium=1.000000 (no load adjustment) >= 1.0
>>
>>
>>>128995 25368 16384 queue parallel at sub04n121 tagged to
>>>
>>>
>>be overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>128996 25368 16384 queue parallel at sub04n139 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128997 25368 16384 queue parallel at sub04n140 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128998 25368 16384 queue parallel at sub04n141 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>128999 25368 16384 queue parallel at sub04n142 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129000 25368 16384 queue parallel at sub04n143 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129001 25368 16384 queue parallel at sub04n144 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129002 25368 16384 queue parallel at sub04n146 tagged to
>>>
>>>
>>be overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129003 25368 16384 queue parallel at sub04n02 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129004 25368 16384 queue parallel at sub04n03 tagged to be
>>>
>>>
>>overloaded: load_avg=2.020000 (no load adjustment) >= 1.4
>>
>>
>>>129005 25368 16384 queue parallel at sub04n04 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129006 25368 16384 queue parallel at sub04n05 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129007 25368 16384 queue parallel at sub04n06 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129008 25368 16384 queue parallel at sub04n07 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>129009 25368 16384 queue parallel at sub04n08 tagged to be
>>>
>>>
>>overloaded: load_avg=2.010000 (no load adjustment) >= 1.4
>>
>>
>>>129010 25368 16384 queue parallel at sub04n09 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129011 25368 16384 queue parallel at sub04n10 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129012 25368 16384 queue parallel at sub04n11 tagged to be
>>>
>>>
>>overloaded: load_avg=2.000000 (no load adjustment) >= 1.4
>>
>>
>>>129013 25368 16384 verified threshold of 169 queues
>>>129014 25368 16384 STARTING PASS 1 WITH 0 PENDING JOBS
>>>129015 25368 16384 Not enrolled ja_tasks: 0
>>>129016 25368 16384 Enrolled ja_tasks: 1
>>>129017 25368 16384 Not enrolled ja_tasks: 0
>>>129018 25368 16384 Enrolled ja_tasks: 1
>>>129019 25368 16384 Not enrolled ja_tasks: 0
>>>129020 25368 16384 Enrolled ja_tasks: 1
>>>129021 25368 16384 Not enrolled ja_tasks: 0
>>>129022 25368 16384 Enrolled ja_tasks: 1
>>>129023 25368 16384 Not enrolled ja_tasks: 0
>>>129024 25368 16384 Enrolled ja_tasks: 1
>>>129025 25368 16384 Not enrolled ja_tasks: 0
>>>129026 25368 16384 Enrolled ja_tasks: 1
>>>129027 25368 16384 Not enrolled ja_tasks: 0
>>>129028 25368 16384 Enrolled ja_tasks: 1
>>>129029 25368 16384 Not enrolled ja_tasks: 0
>>>129030 25368 16384 Enrolled ja_tasks: 1
>>>129031 25368 16384 Not enrolled ja_tasks: 0
>>>129032 25368 16384 Enrolled ja_tasks: 1
>>>129033 25368 16384 Not enrolled ja_tasks: 0
>>>129034 25368 16384 Enrolled ja_tasks: 1
>>>129035 25368 16384 Not enrolled ja_tasks: 0
>>>129036 25368 16384 Enrolled ja_tasks: 1
>>>129037 25368 16384 Not enrolled ja_tasks: 0
>>>129038 25368 16384 Enrolled ja_tasks: 1
>>>129039 25368 16384 Not enrolled ja_tasks: 0
>>>129040 25368 16384 Enrolled ja_tasks: 1
>>>129041 25368 16384 Not enrolled ja_tasks: 0
>>>129042 25368 16384 Enrolled ja_tasks: 1
>>>129043 25368 16384 Not enrolled ja_tasks: 0
>>>129044 25368 16384 Enrolled ja_tasks: 1
>>>129045 25368 16384 Not enrolled ja_tasks: 0
>>>129046 25368 16384 Enrolled ja_tasks: 1
>>>129047 25368 16384 Not enrolled ja_tasks: 0
>>>129048 25368 16384 Enrolled ja_tasks: 1
>>>129049 25368 16384 Not enrolled ja_tasks: 0
>>>129050 25368 16384 Enrolled ja_tasks: 1
>>>129051 25368 16384 Not enrolled ja_tasks: 0
>>>129052 25368 16384 Enrolled ja_tasks: 1
>>>129053 25368 16384 STARTING PASS 2 WITH 0 PENDING JOBS
>>>129054 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>129055 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>129056 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 20 for ,20-64 PE range due to PE's "mpi" setting "min"
>>
>>
>>>129057 25368 16384 slots: 1.000000 * 1000.000000 *
>>>
>>>
>>20 ---> 20000.000000
>>
>>
>>>129058 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>129059 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>129060 25368 16384 slots: 1.000000 * 1000.000000 * 6
>>>
>>>
>> ---> 6000.000000
>>
>>
>>>129061 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>129062 25368 16384 slots: 1.000000 * 1000.000000 * 1
>>>
>>>
>> ---> 1000.000000
>>
>>
>>>129063 25368 16384 slot request assumed for static
>>>
>>>
>>urgency is 2 for ,2-8 PE range due to PE's "mpich_myri" setting "min"
>>
>>
>>>129064 25368 16384 slots: 1.000000 * 1000.000000 * 2
>>>
>>>
>> ---> 2000.000000
>>
>>
>>>129065 25368 16384 slots: 1.000000 * 1000.000000 * 8
>>>
>>>
>> ---> 8000.000000
>>
>>
>>>129066 25368 16384 ASU min = 1000.00000000000, ASU max
>>>
>>>
>>= 20000.00000000000
>>
>>
>>>129067 25368 16384
>>>129068 25368 16384 no DDJU: do_usage: 1 finished_jobs 0
>>>129069 25368 16384
>>>129070 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>129071 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>129072 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>129073 25368 16384
>>>129074 25368 16384 no DDJU: do_usage: 0 finished_jobs 0
>>>129075 25368 16384
>>>129076 25368 16384 =====================[Pass
>>>
>>>
>>0]======================
>>
>>
>>>129077 25368 16384 =====================[Pass
>>>
>>>
>>1]======================
>>
>>
>>>129078 25368 16384 =====================[Pass
>>>
>>>
>>2]======================
>>
>>
>>>129079 25368 16384 Normalizing tickets using
>>>
>>>
>>0.000000/18.333333 as min_tix/max_tix
>>
>>
>>>129080 25368 16384 got 19 running jobs
>>>129081 25368 16384 added 19 ticket orders for running jobs
>>>129082 25368 16384 added 1 orders for updating usage of user
>>>129083 25368 16384 added 0 orders for updating usage
>>>
>>>
>>of project
>>
>>
>>>129084 25368 16384 added 0 orders for updating share tree
>>>129085 25368 16384 added 1 orders for scheduler configuration
>>>129086 25368 16384 SENDING 22 ORDERS TO QMASTER
>>>129087 25368 16384 RESETTING BUSY STATE OF EVENT CLIENT
>>>129088 25368 16384 reresolve port timeout in 260
>>>129089 25368 16384 returning cached port value: 536
>>>--------------STOP-SCHEDULER-RUN-------------
>>>129090 25368 16384 ec_get retrieving events - will do
>>>
>>>
>>max 20 fetches
>>
>>
>>>129091 25368 16384 doing sync fetch for messages, 20 still to do
>>>129092 25368 16384 try to get request from qmaster, id 1
>>>129093 25368 16384 Checking 154 events (44617-44770)
>>>
>>>
>>while waiting for #44617
>>
>>
>>>129094 25368 16384 check complete, 154 events in list
>>>129095 25368 16384 got 154 events till 44770
>>>129096 25368 16384 doing async fetch for messages, 19
>>>
>>>
>>still to do
>>
>>
>>>129097 25368 16384 try to get request from qmaster, id 1
>>>129098 25368 16384 reresolve port timeout in 240
>>>129099 25368 16384 returning cached port value: 536
>>>129100 25368 16384 Sent ack for all events lower or equal 44770
>>>129101 25368 16384 ec_get - received 154 events
>>>129102 25368 16384 44617. EVENT MOD EXECHOST sub04n08
>>>129103 25368 16384 44618. EVENT MOD EXECHOST sub04n166
>>>129104 25368 16384 44619. EVENT MOD EXECHOST sub04n168
>>>129105 25368 16384 44620. EVENT MOD EXECHOST sub04n112
>>>129106 25368 16384 44621. EVENT MOD EXECHOST sub04n90
>>>129107 25368 16384 44622. EVENT JOB 21503.1 task
>>>
>>>
>>2.sub04n90 USAGE
>>
>>
>>>129108 25368 16384 44623. EVENT JOB 21503.1 task
>>>
>>>
>>1.sub04n90 USAGE
>>
>>
>>>129109 25368 16384 44624. EVENT MOD USER udo
>>>129110 25368 16384 44625. EVENT MOD USER iber
>>>129111 25368 16384 44626. EVENT MOD USER dieguez
>>>129112 25368 16384 44627. EVENT MOD USER karenjoh
>>>129113 25368 16384 44628. EVENT MOD USER lorenzo
>>>129114 25368 16384 44629. EVENT MOD USER parcolle
>>>129115 25368 16384 44630. EVENT MOD USER cfennie
>>>129116 25368 16384 44631. EVENT MOD USER civelli
>>>129117 25368 16384 44632. EVENT MOD EXECHOST sub04n14
>>>129118 25368 16384 44633. EVENT MOD EXECHOST sub04n75
>>>129119 25368 16384 44634. EVENT JOB 21040.1 task
>>>
>>>
>>6.sub04n75 USAGE
>>
>>
>>>129120 25368 16384 44635. EVENT JOB 21040.1 task
>>>
>>>
>>5.sub04n75 USAGE
>>
>>
>>>129121 25368 16384 44636. EVENT MOD EXECHOST sub04n150
>>>129122 25368 16384 44637. EVENT MOD EXECHOST sub04n169
>>>129123 25368 16384 44638. EVENT MOD EXECHOST sub04n165
>>>129124 25368 16384 44639. EVENT MOD EXECHOST sub04n136
>>>129125 25368 16384 44640. EVENT MOD EXECHOST sub04n176
>>>129126 25368 16384 44641. EVENT MOD EXECHOST sub04n81
>>>129127 25368 16384 44642. EVENT JOB 21507.1 task
>>>
>>>
>>6.sub04n81 USAGE
>>
>>
>>>129128 25368 16384 44643. EVENT JOB 21507.1 task
>>>
>>>
>>5.sub04n81 USAGE
>>
>>
>>>129129 25368 16384 44644. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129130 25368 16384 44645. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>6.sub04n88
>>
>>
>>>129131 25368 16384 44646. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129132 25368 16384 44647. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>6.sub04n78
>>
>>
>>>129133 25368 16384 44648. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129134 25368 16384 44649. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>6.sub04n81
>>
>>
>>>129135 25368 16384 44650. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129136 25368 16384 44651. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>5.sub04n81
>>
>>
>>>129137 25368 16384 44652. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129138 25368 16384 44653. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>5.sub04n88
>>
>>
>>>129139 25368 16384 44654. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129140 25368 16384 44655. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>5.sub04n78
>>
>>
>>>129141 25368 16384 44656. EVENT MOD EXECHOST sub04n161
>>>129142 25368 16384 44657. EVENT MOD EXECHOST sub04n124
>>>129143 25368 16384 44658. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>7.sub04n88
>>
>>
>>>129144 25368 16384 44659. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>7.sub04n78
>>
>>
>>>129145 25368 16384 44660. EVENT MOD EXECHOST sub04n158
>>>129146 25368 16384 44661. EVENT MOD EXECHOST sub04n01
>>>129147 25368 16384 44662. EVENT MOD EXECHOST sub04n159
>>>129148 25368 16384 44663. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>7.sub04n81
>>
>>
>>>129149 25368 16384 44664. EVENT MOD EXECHOST sub04n134
>>>129150 25368 16384 44665. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>8.sub04n88
>>
>>
>>>129151 25368 16384 44666. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>8.sub04n78
>>
>>
>>>129152 25368 16384 44667. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>8.sub04n81
>>
>>
>>>129153 25368 16384 44668. EVENT MOD EXECHOST sub04n121
>>>129154 25368 16384 44669. EVENT MOD EXECHOST sub04n143
>>>129155 25368 16384 44670. EVENT MOD EXECHOST sub04n15
>>>129156 25368 16384 44671. EVENT MOD EXECHOST sub04n13
>>>129157 25368 16384 44672. EVENT MOD EXECHOST sub04n64
>>>129158 25368 16384 44673. EVENT JOB 21542.1 task
>>>
>>>
>>2.sub04n64 USAGE
>>
>>
>>>129159 25368 16384 44674. EVENT JOB 21542.1 task
>>>
>>>
>>1.sub04n64 USAGE
>>
>>
>>>129160 25368 16384 44675. EVENT MOD EXECHOST sub04n118
>>>129161 25368 16384 44676. EVENT MOD EXECHOST sub04n151
>>>129162 25368 16384 44677. EVENT MOD EXECHOST sub04n154
>>>129163 25368 16384 44678. EVENT MOD EXECHOST sub04n149
>>>129164 25368 16384 44679. EVENT MOD EXECHOST sub04n16
>>>129165 25368 16384 44680. EVENT MOD EXECHOST sub04n155
>>>129166 25368 16384 44681. EVENT MOD EXECHOST sub04n152
>>>129167 25368 16384 44682. EVENT MOD EXECHOST sub04n163
>>>129168 25368 16384 44683. EVENT MOD EXECHOST sub04n43
>>>129169 25368 16384 44684. EVENT MOD EXECHOST sub04n86
>>>129170 25368 16384 44685. EVENT JOB 21423.1 task
>>>
>>>
>>2.sub04n86 USAGE
>>
>>
>>>129171 25368 16384 44686. EVENT JOB 21423.1 task
>>>
>>>
>>1.sub04n86 USAGE
>>
>>
>>>129172 25368 16384 44687. EVENT MOD EXECHOST sub04n03
>>>129173 25368 16384 44688. EVENT JOB 21076.1 USAGE
>>>129174 25368 16384 44689. EVENT MOD EXECHOST sub04n204
>>>129175 25368 16384 44690. EVENT MOD EXECHOST rupc01.rutgers.edu
>>>129176 25368 16384 44691. EVENT MOD EXECHOST sub04n125
>>>129177 25368 16384 44692. EVENT MOD EXECHOST sub04n44
>>>129178 25368 16384 44693. EVENT MOD EXECHOST sub04n32
>>>129179 25368 16384 44694. EVENT MOD EXECHOST sub04n21
>>>129180 25368 16384 44695. EVENT MOD EXECHOST sub04n22
>>>129181 25368 16384 44696. EVENT MOD EXECHOST sub04n35
>>>129182 25368 16384 44697. EVENT MOD EXECHOST sub04n201
>>>129183 25368 16384 44698. EVENT MOD EXECHOST sub04n205
>>>129184 25368 16384 44699. EVENT JOB 21440.1 USAGE
>>>129185 25368 16384 44700. EVENT MOD EXECHOST sub04n111
>>>129186 25368 16384 44701. EVENT MOD EXECHOST sub04n89
>>>129187 25368 16384 44702. EVENT JOB 21530.1 task
>>>
>>>
>>2.sub04n89 USAGE
>>
>>
>>>129188 25368 16384 44703. EVENT JOB 21530.1 task
>>>
>>>
>>1.sub04n89 USAGE
>>
>>
>>>129189 25368 16384 44704. EVENT JOB 21530.1 USAGE
>>>129190 25368 16384 44705. EVENT MOD EXECHOST sub04n177
>>>129191 25368 16384 44706. EVENT MOD EXECHOST sub04n146
>>>129192 25368 16384 44707. EVENT ADD PETASK 21507.1 task
>>>
>>>
>>9.sub04n88
>>
>>
>>>129193 25368 16384 44708. EVENT JOB 21507.1 task
>>>
>>>
>>past_usage USAGE
>>
>>
>>>129194 25368 16384 44709. EVENT DEL PETASK 21507.1 task
>>>
>>>
>>7.sub04n88
>>
>>
>>>Segmentation fault
>>>You have new mail in /var/spool/mail/root rupc-cs04b:/opt/SGE/util #
>>>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>+++++++++++++++++++++
>>
>>
>>>/opt/SGE/default/spool/qmaster
>>>
>>>Sun May 22 14:25:16 EDT 2005
>>>05/22/2005 00:20:01|qmaster|rupc-cs04b|E|event client "scheduler"
>>>(rupc-cs04b/schedd/1) reregistered - it will need a total update
>>>05/22/2005 00:32:40|qmaster|rupc-cs04b|W|job 21538.1 failed on host
>>>sub04n63 in recognising job because: execd doesn't know this job
>>>05/22/2005 00:32:49|qmaster|rupc-cs04b|E|execd sub04n63
>>>
>>>
>>reports running
>>
>>
>>>state for job (21538.1/master) in queue "myrinet at sub04n63"
>>>
>>>
>>while job is
>>
>>
>>>in state 65536 05/22/2005
>>>
>>>
>>00:33:49|qmaster|rupc-cs04b|E|execd at sub04n63
>>
>>
>>>reports running job (21538.1/master) in queue
>>>
>>>
>>"myrinet at sub04n63" that
>>
>>
>>>was not supposed to be there - killing 05/22/2005
>>>02:10:01|qmaster|rupc-cs04b|E|event client "scheduler"
>>>(rupc-cs04b/schedd/1) reregistered - it will need a total update
>>>05/22/2005 02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>(1035) is not uptodate (1036) for user/project "udo" 05/22/2005
>>>02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>
>>>
>>(1035) is not
>>
>>
>>>uptodate (1036) for user/project "iber" 05/22/2005
>>>02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>
>>>
>>(1035) is not
>>
>>
>>>uptodate (1036) for user/project "dieguez" 05/22/2005
>>>02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>
>>>
>>(1035) is not
>>
>>
>>>uptodate (1036) for user/project "zayak" 05/22/2005
>>>02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>
>>>
>>(1035) is not
>>
>>
>>>uptodate (1036) for user/project "karenjoh" 05/22/2005
>>>02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>
>>>
>>(1035) is not
>>
>>
>>>uptodate (1036) for user/project "lorenzo" 05/22/2005
>>>02:30:26|qmaster|rupc-cs04b|E|orders user/project version
>>>
>>>
>>(1035) is not uptodate (1036) for user/project "parcolle"
>>05/22/2005 02:30:26|qmaster|rupc-cs04b|E|orders user/project
>>version (1035) is not uptodate (1036) for user/project
>>"cfennie" 05/22/2005 02:30:26|qmaster|rupc-cs04b|E|orders
>>user/project version (1035) is not uptodate (1036) for
>>user/project "civelli" 05/22/2005
>>02:34:06|qmaster|rupc-cs04b|E|orders user/project version
>>(1044) is not uptodate (1045) for user/project "udo"
>>05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project
>>version (1044) is not uptodate (1045) for user/project "iber"
>>05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project
>>version (1044) is not uptodate (1045) for user/project
>>"dieguez" 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders
>>user/project version (1044) is not uptodate (1045) for
>>user/project "zayak" 05/22/2005
>>02:34:06|qmaster|rupc-cs04b|E|orders user/project version
>>(1044) is not uptodate (1045) for user/project "karenjoh"
>>05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project
>>version (1044) is not uptodate (1045) for user/project
>>"lorenzo" 05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders
>>user/project version (1044) is not uptodate (1045) for
>>user/project "parcolle" 05/22/2005
>>02:34:06|qmaster|rupc-cs04b|E|orders user/project version
>>(1044) is not uptodate (1045) for user/project "cfennie"
>>05/22/2005 02:34:06|qmaster|rupc-cs04b|E|orders user/project
>>version (1044) is not uptodate (1045) for user/project
>>"civelli" 05/22/2005 03:02:47|qmaster|rupc-cs04b|E|tightly
>>integrated parallel task 21539.1 task 3.sub04n83 failed - killing job
>>
>>
>>>05/22/2005 03:10:01|qmaster|rupc-cs04b|E|event client
>>>
>>>
>>"scheduler" (rupc-cs04b/schedd/1) reregistered - it will need
>>a total update <-- YOU SEE THESE 2 lines : THE SCHEDULER
>>DIED EVEN WITHOUT ANY EVENTS , JUST by itself !!!
>>
>>
>>>05/22/2005 07:30:01|qmaster|rupc-cs04b|E|event client
>>>
>>>
>>"scheduler" (rupc-cs04b/schedd/1) reregistered - it will need
>>a total update
>>
>>
>>>05/22/2005 11:11:39|qmaster|rupc-cs04b|E|event client
>>>
>>>
>>"scheduler" (rupc-cs04b/schedd/1) reregistered - it will need
>>a total update <-- BEFORE THE LAST CRASH
>>
>>
>>>05/22/2005 14:07:53|qmaster|rupc-cs04b|E|tightly integrated
>>>
>>>
>>parallel task 21507.1 task 10.sub04n88 failed - killing job
>> <-- THIS IS WHAT TRIGGERED the CRASH
>>
>>
>>>05/22/2005 14:09:14|qmaster|rupc-cs04b|W|job 21507.1 failed
>>>
>>>
>>on host sub04n78 assumedly after job because: job 21507.1
>>died through signal TERM (15)
>>
>>
>>>05/22/2005 14:10:00|qmaster|rupc-cs04b|E|event client
>>>
>>>
>>"scheduler" (rupc-cs04b/schedd/1) reregistered - it will need
>>a total update <- SCHEDULER START AFTER THE CRASH
>>
>>
>>>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>+++++++++++++++++++++
>>
>>
>>>SCHEDULER messages BELOW
>>>
>>>05/22/2005 00:20:01|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005
>>>02:10:01|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005
>>>02:30:26|schedd|rupc-cs04b|I|controlled shutdown 6.0u3 05/22/2005
>>>02:31:10|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005
>>>02:34:06|schedd|rupc-cs04b|I|controlled shutdown 6.0u3 05/22/2005
>>>02:40:00|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005
>>>03:10:01|schedd|rupc-cs04b|I|starting up 6.0u3 05/22/2005
>>>07:30:01|schedd|rupc-cs04b|I|starting up 6.0u3
>>>05/22/2005 11:11:39|schedd|rupc-cs04b|I|starting up 6.0u3
>>>
>>>
>> <--- before the last crush (I started debug mode)
>>
>>
>>>05/22/2005 14:10:00|schedd|rupc-cs04b|I|starting up 6.0u3
>>>
>>>
>> <--- AFTER the last crush
>>
>>
>>>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>+++++++++++++++++++++
>>
>>
>>>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>+++++++++++++++++++++
>>
>>
>>>
>>>
>>>-------------------------------------------------------------
>>>
>>>
>>----------
>>
>>
>>>-
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users
mailing list