Custom Query (431 matches)
Results (157 - 159 of 431)
Ticket | Resolution | Summary | Owner | Reporter |
---|---|---|---|---|
#681 | fixed | IZ3049: memory leak in qmaster when submitting special jobs | ah_sunsource | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3049] Issue #: 3049 Platform: All Reporter: ah_sunsource (ah_sunsource) Component: gridengine OS: Linux Subcomponent: qmaster Version: 6.2u2 CC: None defined Status: NEW Priority: P4 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: memory leak in qmaster when submitting special jobs Status whiteboard: Attachments: Issue 3049 blocks: Votes for issue 3049: Opened: Thu Jun 11 23:54:00 -0700 2009 ------------------------ Hi, qmaster leaks memory when submitting a parallel job with array tasks and requiring reservation. Example: qsub -t 1-100 -pe mpi 16-32 -R y -l h_cpu=1:00:00,h_vmem=1.2G mpi-job.sh Within seconds qmaster's memory consumption grows to several gigabytes and finally crashes the system. Cheers, Andreas ------- Additional comments from joga Wed Jun 24 05:21:22 -0700 2009 ------- Hi Andreas, I cannot reproduce the issue in my test cluster. Can you please share some more information, e.g. - definition of the mpi pe (qconf -mp mpi) - scheduler and global configuration (qconf -ssconf, qconf -sconf) - definition of the h_cpu and h_vmem complex variables, is h_vmem consumable? - does the mpi-job.sh contain any special comments with additional submit options? Thanks, Joachim ------- Additional comments from joga Thu Jun 25 09:04:15 -0700 2009 ------- If you can reproduce the issue, we could try to get a core dump from qmaster when this problem occurs - this might help us to understand where exactly the problem is. It is possible to get a core dump from a running process via the gcore command, I prepared a small script which monitors the qmaster size, and calls gcore when qmaster reaches a certain size, and repeats calling gcore for a configureable number of times after certain steps of growth. When to call gcore, and how often, can be configured at the beginning of the script. You can download it from the following URL: http://gridengine.sunsource.net/files/documents/7/202/monitor_qmaster.sh ------- Additional comments from ah_sunsource Wed Jul 1 05:51:45 -0700 2009 ------- Hi, I've upgraded to 6.2u3 and cannot reproduce the problem right now. But I will answer your questions, the configuration did not change. Maybe it helps. [oreade38] ~ % qconf -sp mpi pe_name mpi slots 512 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE [oreade38] ~ % qconf -ssconf algorithm default schedule_interval 0:0:10 maxujobs 0 queue_sort_method load job_load_adjustments np_load_avg=1.0 load_adjustment_decay_time 00:07:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 params none reprioritize_interval 0:0:0 halftime 168 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.250000 weight_tickets_functional 1000 weight_tickets_share 3000 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE max_pending_tasks_per_job 50 halflife_decay_list none policy_hierarchy OFS weight_ticket 0.010000 weight_waiting_time 0.000000 weight_deadline 3600000.000000 weight_urgency 0.100000 weight_priority 1.000000 max_reservation 0 default_duration INFINITY [oreade38] ~ % qconf -sconf #global: execd_spool_dir /usr/gridengine/default/spool mailer /bin/mail xterm /usr/bin/xterm load_sensor none prolog root@/usr/gridengine/util/prolog epilog root@/usr/gridengine/util/epilog shell_start_mode posix_compliant login_shells sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:30 max_unheard 00:2:30 reschedule_unknown 00:05:00 loglevel log_info administrator_mail ahaupt@ifh.de set_token_cmd /usr/gridengine/util/set_token_cmd pag_cmd /usr/heimdal/bin/pagsh token_extend_time 24:0:0 shepherd_cmd none qmaster_params none execd_params SHARETREE_RESERVED_USAGE ENABLE_ADDGRP_KILL reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs 100 gid_range 50000-50100 qlogin_command ssh -tt -o GSSAPIDelegateCredentials=no qlogin_daemon /usr/gridengine/util/rshd-wrapper rlogin_command ssh -tt -o GSSAPIDelegateCredentials=no rlogin_daemon /usr/gridengine/util/rshd-wrapper rsh_command ssh -tt -o GSSAPIDelegateCredentials=no rsh_daemon /usr/gridengine/util/rshd-wrapper max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 max_advance_reservations 0 auto_user_oticket 0 auto_user_fshare 0 auto_user_default_project none auto_user_delete_time 86400 delegated_file_staging false reprioritize false jsv_url /usr/gridengine/util/job_verifier.pl jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w [oreade38] ~ % qconf -sc |egrep '(#name|h_cpu|h_vmem)' #name shortcut type relop requestable consumable default urgency h_cpu h_cpu TIME <= YES NO 0:0:0 0 h_vmem h_vmem MEMORY <= YES YES 512M 0 No other SGE-flags have been set within the job script. Cheers, Andreas |
|||
#682 | fixed | IZ3050: 6.2u2_1 qmaster large memory leak | steelah1 | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050] Issue #: 3050 Platform: Other Reporter: steelah1 (steelah1) Component: gridengine OS: Linux Subcomponent: qmaster Version: 6.2u2 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: 6.2u2_1 qmaster large memory leak Status whiteboard: Attachments: Issue 3050 blocks: Votes for issue 3050: Opened: Mon Jun 15 15:24:00 -0700 2009 ------------------------ I upgraded 6.0u8 to 6.2u2_1, and I keep getting large memory leaks from sge_qmaster. It approaches %100 percent of the memory within a few minutes, and submitted jobs just sit in queue wait status. I don't see anything in the messages file (/local/sge/default/common/spool/qmaster/messages), and simply restarting the daemon doesn't fix the problem. I have to kill the sge_execd on the execution hosts/compute nodes, and then restart them one at a time every few seconds. Any jobs that are running, I can leave their sge_execd going, but I have to restart all the other ones. This way the memory leak goes away. Any ideas or info would be greatly appreciated, as this was working fine before with 6.0u8. ------- Additional comments from joga Wed Jun 24 05:24:56 -0700 2009 ------- *** Issue 3051 has been marked as a duplicate of this issue. *** ------- Additional comments from joga Wed Jun 24 05:34:19 -0700 2009 ------- please provide some more information: - architecture (as delivered by $SGE_ROOT/util/arch) - the scheduler configuration (qconf -ssconf) and global configuration (qconf -sconf) - what type of jobs are you running, e.g. array jobs, parallel jobs, having special resource requests, etc. - are you using some special configuration options like access lists, a sharetree, etc. - how big is your cluster (number of exec hosts), and how many jobs are in the cluster when this happens? Just guessing, if you have enabled the schedd_job_info in the scheduler configuration, try disabling it. If you need the information why a job cannot be scheduled and now take it from qstat -j <job_id>, try qalter -w p <job_id> instead. ------- Additional comments from steelah1 Wed Jun 24 07:52:31 -0700 2009 ------- /local/sge/util/arch lx24-amd64 qconf -ssconf algorithm default schedule_interval 0:0:15 maxujobs 0 queue_sort_method seqno job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 params none reprioritize_interval 0:0:0 halftime 168 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.250000 weight_tickets_functional 0 weight_tickets_share 0 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE max_pending_tasks_per_job 50 halflife_decay_list none policy_hierarchy OFS weight_ticket 0.010000 weight_waiting_time 0.000000 weight_deadline 3600000.000000 weight_urgency 0.100000 weight_priority 1.000000 max_reservation 0 default_duration 0:10:0 qconf -sconf #global: execd_spool_dir /local/sge/default/spool mailer /bin/mail xterm /usr/bin/X11/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail hpcauth@inl.gov,sheljk@inl.gov set_token_cmd none pag_cmd none token_extend_time none shepherd_cmd none qmaster_params none execd_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs 100 gid_range 20000-20100 qlogin_command /usr/local/bin/ssh_qlogin qlogin_daemon /usr/sbin/sshd -i rlogin_daemon /usr/sbin/sshd -i rlogin_command /usr/bin/ssh rsh_command /usr/bin/ssh rsh_daemon /usr/sbin/sshd -i max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 auto_user_oticket 0 auto_user_fshare 0 auto_user_default_project none auto_user_delete_time 86400 delegated_file_staging false reprioritize false jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w We run mostly parallel and serial jobs, no array jobs, no special requests. We have some access lists for users for a couple of specific queues, but for the main queue it's wide open, so anyone who can get on the machine can run jobs. Our cluster is a combination of 166 dell 1950 dual core and quad core compute nodes running opensuse 11.1, with one login/head node (dell 1950, quadcore, opensuse 11.1) ------- Additional comments from steelah1 Wed Jun 24 13:49:01 -0700 2009 ------- Also, I recently changed sched_job_info from true to false using qconf -msconf ------- Additional comments from joga Thu Jun 25 09:10:13 -0700 2009 ------- In the given scheduler config, you have the schedd_job_info enabled. If you disable it, do you still see the problem? If you can reproduce the issue, we could try to get a core dump from qmaster when this problem occurs - this might help us to understand where exactly the problem is. It is possible to get a core dump from a running process via the gcore command usually available on Linux. So you could either manually call gcore <qmaster_pid> when you see the problem, or use a script I prepared for this purpose: It monitors the qmaster size, and calls gcore when qmaster reaches a certain size, and repeats calling gcore for a configureable number of times after certain steps of growth. When to call gcore, and how often, can be configured at the beginning of the script. You can download it from the following URL: http://gridengine.sunsource.net/files/documents/7/202/monitor_qmaster.sh |
|||
#690 | fixed | IZ3072: gui jobs on windows vista only starting when there is a user logged into the system | crei | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3072] Issue #: 3072 Platform: All Reporter: crei (crei) Component: gridengine OS: Windows Vista Subcomponent: execution Version: 6.2u3 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: ENHANCEMENT Target milestone: --- Assigned to: pollinger (pollinger) QA Contact: pollinger URL: * Summary: gui jobs on windows vista only starting when there is a user logged into the system Status whiteboard: Attachments: Issue 3072 blocks: Votes for issue 3072: Opened: Wed Jul 1 01:07:00 -0700 2009 ------------------------ It is not possible to start a windows gui job on windows vista hosts without having a user logged in. When a job is submitted to a vista host where no user is logged in the following error is logged into the execd messages file: 06/30/2009 15:26:19| main|host|E|06/30/2009 15:26:19 [1049715:3121]: Getting Logged On User Token failed: Only part of a ReadProcessMemory or WriteProcessMemory request was completed. (errno=299) The job was submitted with following command line: > qsub -l h=host,display_win_gui=true -b yes -shell no /dev/fs/C/WINDOWS/notepad.exe Your job 7 ("notepad.exe") has been submitted The job goes into error state when he is dispatched to the host: ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 7 0.55500 notepad.ex cr114091 Eqw 06/30/2009 15:26:11 1 ------- Additional comments from crei Wed Jul 1 01:11:58 -0700 2009 ------- Evaluation: The vista job start code on the helper service differs from the standard implementation. The problem is that the function GetInteractiveUserToken() can not get the a session token when no user is logged in. How to Fix: a) Find out how to startup GUI jobs when no user is logged into the vista host b) Add a load report consumable which reports if it is possible to startup a windows GUI job on this host and request e.g. WGUISupport=true when a job is submitted. Workaround: A user must be logged into the windows vista system. The GUI job also starts up correctly when the screen is logged. |
Note: See TracQuery
for help on using queries.