IZ3050: 6.2u2_1 qmaster large memory leak
|Reported by:||steelah1||Owned by:|
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050]
Issue #: 3050 Platform: Other Reporter: steelah1 (steelah1) Component: gridengine OS: Linux Subcomponent: qmaster Version: 6.2u2 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: 6.2u2_1 qmaster large memory leak Status whiteboard: Attachments: Issue 3050 blocks: Votes for issue 3050: Opened: Mon Jun 15 15:24:00 -0700 2009 ------------------------ I upgraded 6.0u8 to 6.2u2_1, and I keep getting large memory leaks from sge_qmaster. It approaches %100 percent of the memory within a few minutes, and submitted jobs just sit in queue wait status. I don't see anything in the messages file (/local/sge/default/common/spool/qmaster/messages), and simply restarting the daemon doesn't fix the problem. I have to kill the sge_execd on the execution hosts/compute nodes, and then restart them one at a time every few seconds. Any jobs that are running, I can leave their sge_execd going, but I have to restart all the other ones. This way the memory leak goes away. Any ideas or info would be greatly appreciated, as this was working fine before with 6.0u8. ------- Additional comments from joga Wed Jun 24 05:24:56 -0700 2009 ------- *** Issue 3051 has been marked as a duplicate of this issue. *** ------- Additional comments from joga Wed Jun 24 05:34:19 -0700 2009 ------- please provide some more information: - architecture (as delivered by $SGE_ROOT/util/arch) - the scheduler configuration (qconf -ssconf) and global configuration (qconf -sconf) - what type of jobs are you running, e.g. array jobs, parallel jobs, having special resource requests, etc. - are you using some special configuration options like access lists, a sharetree, etc. - how big is your cluster (number of exec hosts), and how many jobs are in the cluster when this happens? Just guessing, if you have enabled the schedd_job_info in the scheduler configuration, try disabling it. If you need the information why a job cannot be scheduled and now take it from qstat -j <job_id>, try qalter -w p <job_id> instead. ------- Additional comments from steelah1 Wed Jun 24 07:52:31 -0700 2009 ------- /local/sge/util/arch lx24-amd64 qconf -ssconf algorithm default schedule_interval 0:0:15 maxujobs 0 queue_sort_method seqno job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 params none reprioritize_interval 0:0:0 halftime 168 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.250000 weight_tickets_functional 0 weight_tickets_share 0 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE max_pending_tasks_per_job 50 halflife_decay_list none policy_hierarchy OFS weight_ticket 0.010000 weight_waiting_time 0.000000 weight_deadline 3600000.000000 weight_urgency 0.100000 weight_priority 1.000000 max_reservation 0 default_duration 0:10:0 qconf -sconf #global: execd_spool_dir /local/sge/default/spool mailer /bin/mail xterm /usr/bin/X11/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail firstname.lastname@example.org,email@example.com set_token_cmd none pag_cmd none token_extend_time none shepherd_cmd none qmaster_params none execd_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs 100 gid_range 20000-20100 qlogin_command /usr/local/bin/ssh_qlogin qlogin_daemon /usr/sbin/sshd -i rlogin_daemon /usr/sbin/sshd -i rlogin_command /usr/bin/ssh rsh_command /usr/bin/ssh rsh_daemon /usr/sbin/sshd -i max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 auto_user_oticket 0 auto_user_fshare 0 auto_user_default_project none auto_user_delete_time 86400 delegated_file_staging false reprioritize false jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w We run mostly parallel and serial jobs, no array jobs, no special requests. We have some access lists for users for a couple of specific queues, but for the main queue it's wide open, so anyone who can get on the machine can run jobs. Our cluster is a combination of 166 dell 1950 dual core and quad core compute nodes running opensuse 11.1, with one login/head node (dell 1950, quadcore, opensuse 11.1) ------- Additional comments from steelah1 Wed Jun 24 13:49:01 -0700 2009 ------- Also, I recently changed sched_job_info from true to false using qconf -msconf ------- Additional comments from joga Thu Jun 25 09:10:13 -0700 2009 ------- In the given scheduler config, you have the schedd_job_info enabled. If you disable it, do you still see the problem? If you can reproduce the issue, we could try to get a core dump from qmaster when this problem occurs - this might help us to understand where exactly the problem is. It is possible to get a core dump from a running process via the gcore command usually available on Linux. So you could either manually call gcore <qmaster_pid> when you see the problem, or use a script I prepared for this purpose: It monitors the qmaster size, and calls gcore when qmaster reaches a certain size, and repeats calling gcore for a configureable number of times after certain steps of growth. When to call gcore, and how often, can be configured at the beginning of the script. You can download it from the following URL: http://gridengine.sunsource.net/files/documents/7/202/monitor_qmaster.sh
Change History (9)
comment:1 Changed 4 years ago by ppoilbarbe
- Severity set to minor
Note: See TracTickets for help on using tickets.