Opened 11 years ago

Last modified 9 years ago

#637 new defect

IZ2916: qrsh large memory consumption in IA64

Reported by: jlopez Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u1
Severity: Keywords: execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2916]

        Issue #:      2916             Platform:     All      Reporter: jlopez (jlopez)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      6.2u1       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     qrsh large memory consumption in IA64
   Status whiteboard:
      Attachments:

     Issue 2916 blocks:
   Votes for issue 2916:


   Opened: Mon Feb 16 07:57:00 -0700 2009 
------------------------


We have found that the qrsh processes using the builtin method are using
more than 500MB per processes in our IA64 cluster.

This means that the memory consumption in
the MASTER node increases rapidly when the number of slaves increases.

Here is an example:
18481 aurelio   15   0  519m 4128 3440 S    0  0.0   0:00.02
qrsh
18482 aurelio   15   0  519m 4128 3440 S    0  0.0   0:00.01
qrsh
18475 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18476 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.02
qrsh
18477 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18478 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18479 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18480 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18483 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.02
qrsh
18484 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18485 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18486 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18487 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.02
qrsh
18488 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.01
qrsh
18489 aurelio   15   0  519m 3968 3296 S    0  0.0   0:00.00
qrsh

And the same job resubmintted but using ssh to expand the processes:
19560 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19561 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19562 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.03
ssh
19563 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.03
ssh
19564 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19565 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19566 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19567 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19568 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19569 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.01
ssh
19570 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.02
ssh
19571 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.03
ssh
19572 aurelio   15   0 12240 5152 3920 S    0  0.0   0:00.04
ssh
19573 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.03
ssh
19574 aurelio   16   0 12240 5152 3920 S    0  0.0   0:00.03 ssh

As it can be seen in the first case the virtual memory consumed by the
job is increased in 7GB.


In some cases the problem is even worse because there are qrsh processes
that are consuming 4GB of virtual memory after several hours running:
25141 csedamsp  15   0 4104m 3984 3296 S    0  0.0   0:00.00
qrsh

25142 csedamsp  16   0 4104m 3984 3296 S    0  0.0   0:00.01
qrsh

25140 csedamsp  15   0 4103m 3968 3296 S    0  0.0   0:00.01
qrsh

25143 csedamsp  15   0 4103m 3968 3296 S    0  0.0   0:00.02 qrsh

We tried to recompile qrsh using Intel compiler and we get the same behavior.

   ------- Additional comments from jlopez Tue Mar 17 03:38:37 -0700 2009 -------
I think we  have found the reason of the large memory consumption of qrsh using
builtin method in IA64. It is all due to the fact that we are using a
stack limit of 256MB.

These are the facts conclusions that I think could be helpful to others
experiencing similar problems:
- qrsh uses threads: 3 in general and 4 if it has to expand the builtin
shell (this can be avoided using the -noshell option). Bash is not
recognised as a supported shell so if the rsh wrapper is written using
bash qrsh would expand the builtin shell unless the -noshell option is
given.
- Memory consumption is about two times the value of stack limit
established in the system (ulimit -s). In case the stack is unlimited it
seems that a reference value of 32MB is used in IA64.
- Optimum memory consumption is obtained with a stack value of 1MB
(ulimit -s 1024). In this case qrsh consumes 4MB if no builtin shell is
expanded or 10MB if the shell is expanded.

The final conclusion is that the best alternative in IA64 is to set the
stack limit to 1024 and write the rsh wrapper using sh.

Change History (0)

Note: See TracTickets for help on using tickets.