Custom Query (1181 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (187 - 189 of 1181)

Ticket Resolution Summary Owner Reporter
#682 fixed IZ3050: 6.2u2_1 qmaster large memory leak steelah1
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050]

        Issue #:      3050             Platform:     Other    Reporter: steelah1 (steelah1)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.2u2       CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     6.2u2_1 qmaster large memory leak
   Status whiteboard:
      Attachments:

     Issue 3050 blocks:
   Votes for issue 3050:


   Opened: Mon Jun 15 15:24:00 -0700 2009 
------------------------


I upgraded 6.0u8 to 6.2u2_1, and I keep getting large memory leaks from sge_qmaster. It approaches %100 percent of the memory within a few
minutes, and submitted jobs just sit in queue wait status. I don't see anything in the messages file
(/local/sge/default/common/spool/qmaster/messages), and simply restarting the daemon doesn't fix the problem. I have to kill the sge_execd
on the execution hosts/compute nodes, and then restart them one at a time every few seconds. Any jobs that are running, I can leave their
sge_execd going, but I have to restart all the other ones. This way the memory leak goes away. Any ideas or info would be greatly
appreciated, as this was working fine before with 6.0u8.

   ------- Additional comments from joga Wed Jun 24 05:24:56 -0700 2009 -------
*** Issue 3051 has been marked as a duplicate of this issue. ***

   ------- Additional comments from joga Wed Jun 24 05:34:19 -0700 2009 -------
please provide some more information:
- architecture (as delivered by $SGE_ROOT/util/arch)
- the scheduler configuration (qconf -ssconf) and global configuration (qconf -sconf)
- what type of jobs are you running, e.g. array jobs, parallel jobs, having special resource requests, etc.
- are you using some special configuration options like access lists, a sharetree, etc.
- how big is your cluster (number of exec hosts), and how many jobs are in the cluster when this happens?

Just guessing, if you have enabled the schedd_job_info in the scheduler configuration, try disabling it.
If you need the information why a job cannot be scheduled and now take it from qstat -j <job_id>,
try qalter -w p <job_id> instead.

   ------- Additional comments from steelah1 Wed Jun 24 07:52:31 -0700 2009 -------
/local/sge/util/arch
lx24-amd64

qconf -ssconf
algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 seqno
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  0:10:0

qconf -sconf
#global:
execd_spool_dir              /local/sge/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           hpcauth@inl.gov,sheljk@inl.gov
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               /usr/local/bin/ssh_qlogin
qlogin_daemon                /usr/sbin/sshd -i
rlogin_daemon                /usr/sbin/sshd -i
rlogin_command               /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

We run mostly parallel and serial jobs, no array jobs, no special requests.

We have some access lists for users for a couple of specific queues, but for the main queue it's wide open, so anyone who can get on the
machine can run jobs.

Our cluster is a combination of 166 dell 1950 dual core and quad core compute nodes running opensuse 11.1, with one login/head node (dell
1950, quadcore, opensuse 11.1)

   ------- Additional comments from steelah1 Wed Jun 24 13:49:01 -0700 2009 -------
Also, I recently changed sched_job_info from true to false using qconf -msconf

   ------- Additional comments from joga Thu Jun 25 09:10:13 -0700 2009 -------
In the given scheduler config, you have the schedd_job_info enabled.
If you disable it, do you still see the problem?

If you can reproduce the issue,
we could try to get a core dump from qmaster when this problem occurs - this might help us to understand where exactly the problem is.
It is possible to get a core dump from a running process via the gcore command usually available on Linux.

So you could either manually call gcore <qmaster_pid> when you see the problem, or use a script I prepared for this purpose:
It monitors the qmaster size, and calls gcore when qmaster reaches a certain size,
and repeats calling gcore for a configureable number of times after certain steps of growth.
When to call gcore, and how often, can be configured at the beginning of the script.
You can download it from the following URL:
http://gridengine.sunsource.net/files/documents/7/202/monitor_qmaster.sh
#690 fixed IZ3072: gui jobs on windows vista only starting when there is a user logged into the system crei
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3072]

        Issue #:      3072             Platform:     All             Reporter: crei (crei)
       Component:     gridengine          OS:        Windows Vista
     Subcomponent:    execution        Version:      6.2u3              CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     gui jobs on windows vista only starting when there is a user logged into the system
   Status whiteboard:
      Attachments:

     Issue 3072 blocks:
   Votes for issue 3072:


   Opened: Wed Jul 1 01:07:00 -0700 2009 
------------------------


It is not possible to start a windows gui job on windows vista hosts without having a user logged in.

When a job is submitted to a vista host where no user is logged in the following error is logged into
the execd messages file:

06/30/2009 15:26:19|  main|host|E|06/30/2009 15:26:19 [1049715:3121]: Getting Logged On User Token failed: Only part of a ReadProcessMemory
or WriteProcessMemory request was completed. (errno=299)


The job was submitted with following command line:
> qsub -l h=host,display_win_gui=true -b yes -shell no /dev/fs/C/WINDOWS/notepad.exe
Your job 7 ("notepad.exe") has been submitted

The job goes into error state when he is dispatched to the host:
############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      7 0.55500 notepad.ex cr114091     Eqw   06/30/2009 15:26:11     1

   ------- Additional comments from crei Wed Jul 1 01:11:58 -0700 2009 -------
Evaluation:

The vista job start code on the helper service differs from the standard implementation.
The problem is that the function GetInteractiveUserToken() can not get the a session token when no user is logged in.

How to Fix:
a) Find out how to startup GUI jobs when no user is logged into the vista host
b) Add a load report consumable which reports if it is possible to startup a windows GUI job on this host and
   request e.g. WGUISupport=true when a job is submitted.

Workaround:
A user must be logged into the windows vista system.
The GUI job also starts up correctly when the screen is logged.
#693 fixed IZ3076: syntax error for empty SGE_QMASTER/EXECD_PORT mpospisil
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3076]

        Issue #:      3076             Platform:     Sun      Reporter: mpospisil (mpospisil)
       Component:     gridengine          OS:        All
     Subcomponent:    install          Version:      6.0         CC:    None defined
        Status:       NEW              Priority:     P4
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    mpospisil (mpospisil)
      QA Contact:     dom
          URL:
       * Summary:     syntax error for empty SGE_QMASTER/EXECD_PORT
   Status whiteboard:
      Attachments:

     Issue 3076 blocks:
   Votes for issue 3076:


   Opened: Mon Jul 6 11:25:00 -0700 2009 
------------------------


When just <enter> is pressed during an unused port selection one gets following screen:

Grid Engine TCP/IP service >sge_qmaster<

----------------------------------------

Please enter an unused port number >>
expr: syntax error

Invalid input. Must be a number.
Hit <RETURN> to continue >>


Using the environment variable

   $SGE_QMASTER_PORT=as port for communication.\n\n

infotext: too few arguments




Grid Engine TCP/IP service >sge_execd<

--------------------------------------

Please enter an unused port number >>
expr: syntax error

Invalid input. Must be a number.
Hit <RETURN> to continue >>


Using the environment variable

   $SGE_EXECD_PORT=as port for communication.\n\n

infotext: too few arguments
Note: See TracQuery for help on using queries.