[GE users] Setting memlock limit with SGE 6.2; was: Re: [GE users] worth a wiki entry for SGE with OpenMPI and Infiniband

Andy Schwierskott andy.schwierskott at sun.com
Mon Jul 21 09:21:55 BST 2008


just on a side node regarding the 'memlock' resource limit issue which are
reported here sometimes: For SGE 6.2 (it's not part of the Beta and Beta
refresh however) we added as a last minute feature the ability to configure
the memlock limit and a few others on the execd level, i.e. via the
'execd_params' cluster config setting.

Full background: in the SGE queue config you can configure most but not all
Unix resource limits like CPU time, max. virtual memory and so on. There are
a few others like the maximum file descriptor limit which exist on virtually
all OS'es and some which just exit on one or a few OS'es (like the "memlock"
limit). It was too late to extend the queue configuration and we found the
workaround to configure these limits indirectly by hacking the SGE execd
startup scripts (this is the chain how the job inherits such limits if they
are not set) to implicit and error prone, therefore we decided to enable an
admin to set such limits via the execd_params setting.

It's not a 100% perfect solution: it's a execd setting valid for all jobs
running in all queues on that host and it does not provide a solution to use
the configured system wide limits set e.g. in /etc/security/limits.conf on
Linux. Nevertheless it's much better than requiring to edit the job scripts
or the execd startup scripts which could get overwritten with an update and
would not work if for testing purposes the execd is started directly e.g. in
debug mode.

For the interested reader here's an excerpt from the SGE 6.2 sge_conf(5) man
page which describes the syntax and semantic of these settings:

           Specifies soft and hard resource limits as  implemented
           by  the  setrlimit(2) system call. See this manual page
           on your system for more information.  These  parameters
           complete  the list of limits set by the RESOURCE LIMITS
           parameter of the queue configuration  as  described  in
           queue_conf(5).  Unlike the resource limits in the queue
           configuration, these resource limits are set for  every
           job  on  this  execution host. If a value is not speci-
           fied, the resource limit is inherited from  the  execu-
           tion   daemon  process.  Because  this  would  lead  to
           unpredicted results, if only one limit of a resource is
           set  (soft  or  hard), the corresponding other limit is
           set to the same value.
           S_DESCRIPTORS and H_DESCRIPTORS  specify  a  value  one
           greater  than  the  maximum file descriptor number that
           can be opened by any process of a job.
           S_MAXPROC and H_MAXPROC specify the maximum  number  of
           processes  that  can be created by the job user on this
           execution host
           S_MEMORYLOCKED and H_MEMORYLOCKED specify  the  maximum
           number  of  bytes  of virtual memory that may be locked
           into RAM.
           S_LOCKS and H_LOCKS specify the maximum number of  file
           locks any process of a job may establish.
           All of these values can be specified using  the  multi-
           plier letters k, K, m, M, g and G, see sge_types(1) for

So you would simply set

   execd_params H_MEMORYLOCKED=unlimited

to set the soft and hard Linux "memlock" limit to unlimited.

On OS'es which do not support one of these limits the setting will be
silently ignored.

There's still a gotcha: If you would use the old interactive job support and
not the default builtin new one (qrsh without command which calls the system
rlogind), qlogin which uses the system telnetd and likley ssh(d)) the SGE
setting owuld get overridden since those daemons adhere to the
/etc/security/limits.conf on Linux. They are started after the shepherd sets
those limits.

For SGE 6.1 and earlier the best workaround in my opinion is to set those
limits in the execd startup script. At least this eliminates a different
behavior if the execd is started at system boot time or later by an
interactively logged in root user. As stated above care has to be taken when
the execd startup script is changed, a new execd is installed or the execd
is started directly without using the startup script.


On Sun, 20 Jul 2008, John Leidel wrote:

> I second Joe's motion.  I've done this for quite some time manually by
> creating a set of startup/pre/post wrapper scripts such that...
> for a in `ls $SGE_ROOT/scripts/pre/`; do
>    exec $a
> done;
> ....blah blah blah
> cheers
> john
> On Sun, Jul 20, 2008 at 9:43 AM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>> Hi folks
>>  On a related note, for this same cluster, we were using infiniband. One of
>> the issues with OpenMPI and SGE is that the maximum locked memory (on linux)
>> is set way too low for Infiniband, and it can't lock enough memory.  You can
>> "fix" this with settings in /etc/security/limits.conf, simply add these two
>> lines to the file
>>        *               soft    memlock unlimited
>>        *               hard    memlock unlimited
>> However, it appears that this works for running OpenMPI over Infiniband apps
>> by hand, but not through SGE.  I found that I needed to insert an
>>        ulimit -l unlimited
>> in the SGE execd run script, right near the top, or
>>        qrsh ulimit -l
>> would always return 32 (kilobytes), and the Infiniband based job wouldn't
>> run.
>> I would like to suggest including a line like this in your execd startup
>> script.
>> For the SGE developers, if you could include an environment
>> startup/scripting/tweaking section right before you fire off the main
>> sgeexecd process, this could help with other (future) issues like this.
>>  Might be worth creating an $SGE/execd_environment directory to contain the
>> scripts/settings we need.
>> Just a thought.
>> Joe
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman at scalableinformatics.com
>> web  : http://www.scalableinformatics.com
>>       http://jackrabbit.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax  : +1 866 888 3112
>> cell : +1 734 612 4615

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list