[GE users] Bug causing 100% memory usage?
richard.hobbs at crl.toshiba.co.uk
Tue Mar 22 16:22:11 GMT 2005
Mainly because of the old saying "if it ain't broke, don't fix it"... We
didn't deem the consumable issues serious enough to warrant upgrading, but
if this memory usage issue is a bug, we may have to.
Please see attached txt file for "qstat -f mem_slot" output.
How easy is it to upgrade to the latest 5 release? How easy is it to upgrade
to the latest 6 release?
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Speech Technology Group
Email: richard.hobbs at crl.toshiba.co.uk
Tel: +44 1223 376964 Mobile: +44 7811 803377
> -----Original Message-----
> From: owner-it at mail.crl.toshiba.co.uk
> [mailto:owner-it at mail.crl.toshiba.co.uk] On Behalf Of Stephan
> Grell - Sun Germany - SSG - Software Engineer
> Sent: 22 March 2005 16:14
> To: users at gridengine.sunsource.net
> Cc: it at crl.toshiba.co.uk
> Subject: Re: [GE users] Bug causing 100% memory usage?
> Hello Richard,
> is there a reason why you are using 5.3p1 and did not upgrade
> p6 or even
> SGE 6u3?
> I did a quick query for the two bug descriptions and did not find an
> issue. However, I did
> not see sees issue in 5.4p6 and/or SGE6 u3. One could assume,
> that they
> are fixed in
> the never releases.
> If not, a closer description of your configuration and what the jobs
> requests would be very
> helpful. The qstat -F XXX for the problematic consumable is also very
> Kind Regards,
> Richard Hobbs wrote:
> >We are using SGE 5.3p1 at the moment, and while the GUI
> isn't great, it is
> >at working for us, apart from a couple of bugs...
> >The first bug is not *that* serious - we have set up a
> consumable resource
> >on each execution host whose initial value equals the number
> of CPU's within
> >the machine. Each host may have 10 queues or more, but if it
> only has two
> >CPUs, it will only ever be able to run 2 jobs or less.
> However, sometimes,
> >when lots of jobs are submitted at once, the machine
> sometimes ends up
> >running more than 2 jobs, and the amount of the consumable resource
> >remaining hits -1, or even -2 and -3 sometimes. Is there a
> way we can stop
> >this happening? Is it a known bug for which there is a fix in a later
> >The second bug is more serious. We have around 38 machines,
> with a total of
> >just over 100 consumable resources (2 or 4 queues in use
> simultaneously per
> >machine). However, one of our users accidentally submitted 1021 jobs
> >yesterday, and the result of this was that the qmaster
> process started
> >eating memory until the entire RAM and swap was full, and
> then the machine
> >simply hung and had to be hard-rebooted. The machine has 1GB
> RAM and 2GB
> >Is this also a bug, or was the qmaster legitimately asking
> for more than 3GB
> >of RAM in order to complete it's scheduling operation? If it
> is a bug, is
> >there a fix or a patch?
> >Also, if there is any information or log files I can send
> you to help you
> >diagnose this issue, please let me know.
> >Thanks in advance,
> >Richard Hobbs.
> This e-mail has been scanned for viruses by MCI's Internet
> Managed Scanning Services - powered by MessageLabs. For
> further information visit http://www.mci.com
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com
[ Part 2, Text/PLAIN (Name: "mem_slot.txt") ~1,877 lines. ]
[ Unable to print this part. ]
[ Part 3: "Attached Text" ]
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users