Opened 8 years ago

Last modified 3 years ago

#1301 new enhancement

more sophisticated memory accounting

Reported by: dlove Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords:
Cc:

Description

Reconsider how compute node memory accounting is done, and try to distinguish
shared and mmapped memory, at least on Linux, to avoid multiple counting.
Code from top (http://www.unixtop.org) is apparently used currently, but from
a fairly old version, and I'm not sure where it lurks. Check the current
top for useful general updates, and http://procps.sourceforge.net/ (LGPL) for
Linux-based systems. We can probably get the relevant info from Linux' /proc.

See also #62.

Attachments (1)

0001-Added-execd_params-RLIMIT_VMEM-configuration-option.patch (4.5 KB) - added by markdixon 3 years ago.

Download all attachments as: .zip

Change History (11)

comment:1 Changed 7 years ago by dlove

Fixed on Linux by [4410].
May need attention on other systems, so not closed.

comment:2 Changed 6 years ago by markdixon

Hi Dave,

I think [4410] is only a partial fix for this issue.

As you know, gridengine uses two mechanisms for limiting memory usage in response to setting h_vmem:

1) The PDC (corrected in [4410])
2) Shepherd setting a process limit (daemons/shepherd/setrlimits.c)

That second mechanism is apparently still in play as of 8.1.3 and, at least on Linux, uses RLIMIT_AS. This causes many operations that have little affect on real memory usage to fail on Linux.

For example, trying to mmap a large file into a process address space fails.

This sort of thing can lead to a mystery segfault at worst, or interesting application error messages at best. What it doesn't do, is result in informative gridengine messages.

I've previously posted patches that introduced a new complex to control address space (h_as / s_as) and decouple the two mechanisms but, personally, I'm currently leaning towards ditching control of RLIMIT_AS entirely.

What do you think?

Mark

comment:3 follow-up: Changed 6 years ago by dlove

Mark Dixon <m.c.dixon@…> writes:

Hi Dave,

I think [4410] is only a partial fix for this issue.

It fixes the accounting, strictly, I think, but point taken.

As you know, gridengine uses two mechanisms for limiting memory usage
in response to setting h_vmem:

1) The PDC (corrected in [4410])
2) Shepherd setting a process limit (daemons/shepherd/setrlimits.c)

That second mechanism is apparently still in play as of 8.1.3 and, at
least on Linux, uses RLIMIT_AS. This causes many operations that have
little affect on real memory usage to fail on Linux.

For example, trying to mmap a large file into a process address space
fails.

Right -- conscious of mmap since SunOS 4 days.

This sort of thing can lead to a mystery segfault at worst, or
interesting application error messages at best. What it doesn't do, is
result in informative gridengine messages.

Do you get informative (accounting) messages ever? Normally when jobs
fail here, they're less than helpful, even when killed by execd
("assumedly after job"), and it seems the same on polaris. I was going
to tart up qacct-summary, but figured I should really fix the basic
reporting, where the initial useful reporting state is trashed later.
Did you look into that by any chance?

I've previously posted patches that introduced a new complex to
control address space (h_as / s_as) and decouple the two mechanisms
but, personally, I'm currently leaning towards ditching control of
RLIMIT_AS entirely.

What do you think?

Yes, I intend to allow turning off the rlimit (mixed up with other
changes that haven't got finished yet). I'd hoped to avoid a new
comms/spooling version, though. Would you be happy with an execd
parameter, or does it really need to be done queue-wise?

comment:4 Changed 6 years ago by dlove

I was going
to tart up qacct-summary, but figured I should really fix the basic
reporting, where the initial useful reporting state is trashed later.
Did you look into that by any chance?

Never mind. I think I did enough while looking at something else.

comment:5 Changed 6 years ago by Dave Love <d.love@…>

In 4564/sge:

Report SSTATE_QMASTER_ENFORCED_LIMIT in accounting when appropriate
Refs #1301

comment:6 Changed 6 years ago by markdixon

On Fri, 26 Jul 2013, Dave Love wrote:
...

Do you get informative (accounting) messages ever? Normally when jobs
fail here, they're less than helpful, even when killed by execd
("assumedly after job"), and it seems the same on polaris. I was going
to tart up qacct-summary, but figured I should really fix the basic
reporting, where the initial useful reporting state is trashed later.
Did you look into that by any chance?

Yes you do, in the execd log. You get something like this:

11/15/2012 19:28:32| main|g11s13n4|W|job 2727.1 exceeds job hard limit "h_vmem" of queue "polaris1.q@…" (24594026496.00000 > limit:15032385536.00000) - sending SIGKILL

I'm trying to recall what I did with qacct-summary, but I think if I found a "assumedly after job" in the qmaster message file, it greps for this through the execd message files.

....

Yes, I intend to allow turning off the rlimit (mixed up with other
changes that haven't got finished yet). I'd hoped to avoid a new
comms/spooling version, though. Would you be happy with an execd
parameter, or does it really need to be done queue-wise?

Personally, I'm happy with an execd parameter now.

I think this all got mixed up with me wanting a different name for the new way to measure memory, to try and force our local users into re-evaluating how much memory their jobs required. I gave up on that after looking at the execd code - and realising that our users would just copy and paste the old values anyway.

Cheers,

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


comment:7 Changed 6 years ago by dlove

Do you get informative (accounting) messages ever? Normally when
jobs fail here, they're less than helpful, even when killed by execd
("assumedly after job"), and it seems the same on polaris. I was
going to tart up qacct-summary, but figured I should really fix the
basic reporting, where the initial useful reporting state is trashed
later. Did you look into that by any chance?

Yes you do, in the execd log. You get something like this:

11/15/2012 19:28:32| main|g11s13n4|W|job 2727.1 exceeds job hard limit "h_vmem" of queue "polaris1.q@…" (24594026496.00000 > limit:15032385536.00000) - sending SIGKILL

Yes, but I expect to see something useful in the accounting data that's
conveniently available via qacct. It now shows

failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit
...
ru_wallclock 7s
...
category ... -l h_rt=5 ...

for the job, or for the PE master task with accounting_summary false (in
which case you still see the "assumedly after job" for the other tasks,
which may not be ideal).

comment:8 in reply to: ↑ 3 Changed 3 years ago by markdixon

Replying to dlove:

Yes, I intend to allow turning off the rlimit (mixed up with other
changes that haven't got finished yet). I'd hoped to avoid a new
comms/spooling version, though. Would you be happy with an execd
parameter, or does it really need to be done queue-wise?

Hi Dave,

I don't know if you have this ready or not, but is the attached patch useful?

Cheers,

Mark

comment:10 Changed 3 years ago by dlove

I'm unsure what to do about this, but I think it needs per-job control
of the behaviour. It seems to me that in most cases the h_vmem is OK,
and allows jobs to abort sanely when they hit the limit (unlike cgroups,
sigh), but it falls down in some cases with mmap, at least.

I wonder if the job can be done with a convention for juggling h_vmem
and s_vmem, but I didn't convince myself originally, and haven't thought
about it for some time. What do you think?

comment:11 Changed 3 years ago by markdixon

On Tue, 6 Sep 2016, Dave Love wrote:

I'm unsure what to do about this, but I think it needs per-job control
of the behaviour. It seems to me that in most cases the h_vmem is OK,
and allows jobs to abort sanely when they hit the limit (unlike cgroups,
sigh), but it falls down in some cases with mmap, at least.

I wonder if the job can be done with a convention for juggling h_vmem
and s_vmem, but I didn't convince myself originally, and haven't thought
about it for some time. What do you think?

I couldn't find a way of getting h_vmem / s_vmem juggling to work either: they already have fairly well-described purposes, even though the underlying vmem measurement method isn't documented.

Must say though: we've had _plenty_ of queries about why memory doesn't go as far as it should, but none about why a job failed after trying to use memory it had successfully malloc'd. Aren't most of us used to oom killers by now? If it's critical, don't people allocate and initialise memory instead of just allocate?

But if it helps, I wrote a patch for per-job control (by introducing new h_as / s_as controls) instead of per-execd some time ago. I can dust that off if you like.

Mark

Note: See TracTickets for help on using tickets.