Custom Query (431 matches)
Results (127 - 129 of 431)
Ticket | Resolution | Summary | Owner | Reporter |
---|---|---|---|---|
#467 | fixed | IZ2393: qrsh -inherit should access the 'rsh_command'-cache more persevering | brs | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2393] Issue #: 2393 Platform: PC Reporter: brs (brs) Component: gridengine OS: Linux Subcomponent: execution Version: 6.1u4 CC: None defined Status: STARTED Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: roland (roland) QA Contact: pollinger URL: * Summary: qrsh -inherit should access the 'rsh_command'-cache more persevering Status whiteboard: Attachments: Issue 2393 blocks: Votes for issue 2393: Opened: Mon Oct 8 08:57:00 -0700 2007 ------------------------ This is an intermittent problem that occurs with greater frequency as you increase the number of slots used in a parallel environment. It seems to affect only OpenMPI jobs. Jobs would fail with the following messages from OpenMPI from a fairly random number of qrsh processes (sometimes 2 or 3 sometimes almost half -- considering 128 processors -- of the calls would fail): [rcn-ib-0013.rc.usf.edu:32328] ERROR: A daemon on node rcn-ib-0003.rc.usf.edu failed to start as expected. [rcn-ib-0013.rc.usf.edu:32328] ERROR: There may be more information available from [rcn-ib-0013.rc.usf.edu:32328] ERROR: the 'qstat -t' command on the Grid Engine tasks. [rcn-ib-0013.rc.usf.edu:32328] ERROR: If the problem persists, please restart the [rcn-ib-0013.rc.usf.edu:32328] ERROR: Grid Engine PE job [rcn-ib-0013.rc.usf.edu:32328] ERROR: The daemon exited unexpectedly with status 255. Then, a core would be dumped in the user's home directory which was from the qrsh (or qsh) executable. A quick backtrace with gdb revealed that qrsh was failing on a call to execvp. After some investigation, this call was found in source/clients/qsh/qsh.c:789 (as of a cvs checkout from 10/08/2007). Though I am reporting this bug with 6.0u11, I also reproduced it at one point with 6.1u2. I've gotten around this bug for the time being by LD_PRELOADing the following code: /* * Place holder library to fix bug with qsh.c not correctly passing * arguments to execvp() */ #include <stdlib.h> #include <string.h> #include <unistd.h> #include <stdio.h> #include <sys/utsname.h> #include <sys/types.h> #define __USE_GNU #include <dlfcn.h> int execvp(const char *file, char *const argv[]){ int pid, i = 1; const char *file_n; char **args; static int (*func)(const char*, char *const *); struct utsname u_name; uname(&u_name); pid = (int)getpid(); if (file == NULL){ file_n = "/opt/sge-tools/rsh"; fprintf (stderr, "[%s:%d] execvp(): had to modify null arguments\n", u_name.nodename, pid); } else file_n = (const char *)file; if (argv[0] == NULL) args[0] = "/opt/sge-tools/rsh"; else args[0] = (char *)argv[0]; if (argv[i] != NULL){ do { args[i] = (char *)argv[i]; } while (argv[++i] != NULL); } func = (int (*)(const char*, char *const *))dlsym(RTLD_NEXT, "execvp"); return (func(file_n, args)); } Basically, it ensures that the arguments passed to execvp are sane and modifies them if they are not. I also needed a wrapper for rsh (hence /opt/sge-tools/rsh) so that the correct arch would be selected in $SGE_ROOT/utilbin. Here's some general information: qconf -sq rcnib.q qname rcnib.q hostlist @rcnibNodes seq_no 51 load_thresholds NONE suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors 2 qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich-p4 ompi-ddr mpich-schrod rerun FALSE slots 4 tmpdir /opt/sge/tmp shell /bin/bash prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists operators xuser_lists NONE subordinate_list NONE complex_values i_ib=true,p_low=false,t_devel=false projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core 2M s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY qconf -sp ompi-ddr pe_name ompi-ddr slots 1024 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min qconf -sconf global: execd_spool_dir /opt/sge/rc/spool mailer /bin/mail xterm /usr/bin/X11/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail admins@rc.usf.edu set_token_cmd none pag_cmd none token_extend_time none shepherd_cmd none qmaster_params none execd_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs 100 gid_range 10000-10500 qlogin_command telnet qlogin_daemon /usr/sbin/in.telnetd rlogin_daemon /usr/sbin/in.rlogind rsh_command /opt/sge-tools/rsh max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 auto_user_oticket 0 auto_user_fshare 100 auto_user_default_project none auto_user_delete_time 86400 delegated_file_staging false reprioritize 1 And some more detailed output. With the following code LD_PRELOADed, a good call to execvp looks like this: [rcn-ib-0037.rc.usf.edu:8123] execvp(): /opt/sge-tools/rsh /opt/sge-tools/rsh -n -p 34391 rcn-ib-0027.rc.usf.edu exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge-6.0-u11/rc/spool/rcn-ib-0027/active_jobs/4268.1/1.rcn-ib-0027' noshell A bad call will look something like this: [rcn-ib-0037.rc.usf.edu:8150] execvp(): (null) (null) -n -p 34537 rcn-ib-0028.rc.usf.edu exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' '/usr/local/sge-6.0-u11/rc/spool/rcn-ib-0028/active_jobs/4268.1/1.rcn-ib-0028' noshell Here's the code: #include <stdlib.h> #include <string.h> #include <unistd.h> #include <stdio.h> #include <sys/utsname.h> #include <sys/types.h> #define __USE_GNU #include <dlfcn.h> int execvp(const char *file, char *const argv[]){ int pid, i = 1; static int (*func)(const char*, char *const *); struct utsname u_name; uname(&u_name); pid = (int)getpid(); fprintf (stderr, "[%s:%d] execvp(): %s %s ", u_name.nodename, pid, file, argv[0]); do { fprintf(stderr, "%s ", argv[i]); } while (argv[++i] != NULL); fprintf(stderr, "\n"); func = (int (*)(const char*, char *const *))dlsym(RTLD_NEXT, "execvp"); return (func(file, argv)); } I can provide more information if required. ------- Additional comments from brs Mon Oct 8 13:02:57 -0700 2007 ------- Ok, the actual execvp is in source/clients/qrsh/qrsh_starter.c, line 728. Should have looked more carefully. ------- Additional comments from brs Mon Oct 8 13:05:21 -0700 2007 ------- changed ticket summary to reflect correct source file ------- Additional comments from brs Sun Apr 6 10:20:19 -0700 2008 ------- This issue still exists in 6.1u4. Who can look at this issue with me? ------- Additional comments from brs Sun Apr 6 10:28:49 -0700 2008 ------- Would like someone to look at this ticket. Perhaps it was sent to the wrong person. ------- Additional comments from andreas Mon Apr 7 08:55:36 -0700 2008 ------- Moving this to execution subcomponent. ------- Additional comments from brs Mon Apr 28 11:53:04 -0700 2008 ------- So, from more investigation, I think I've narrowed the problem down. In short, using NFS (at least with Solaris 10 as the NFS server) doesn't work reliably. It seems that multiple qrsh -inherit processes are making gdi requests to qmaster for the rsh_command data. I don't know how many gdi requests qmaster can handle simultaneously from a single job, but it appears that some of the qrsh -inherits that are spawned by mpirun can't read the data from qrsh_client_cache and end up getting to the gdi2_get_configuration line before the first qrsh can write his data out. This results in a large-ish parallel job making tons of gdi requests on qmaster. My guess is that some of these gdi requests aren't handled properly and so some qrsh'es end up running execvp without any command string. Long story short: NFS server on Solaris 10 w/ SGE: bad. Let me know what you think. ------- Additional comments from brs Mon Apr 28 11:53:54 -0700 2008 ------- For clarification, I meant NFS as a path for tmpdir. ------- Additional comments from brs Mon Apr 28 11:55:10 -0700 2008 ------- Also, it appears that some type of file locking would completely fix this issue if the gdi hypothesis is true. ------- Additional comments from andreas Tue Apr 29 08:15:24 -0700 2008 ------- The NFS issue I can't evaluate, but I think the code in get_client_name() should be more persevering as to deal with such cases. As it is now qrsh -inherit tries only once to read the cache and does then immediately a GDI-request as fallback. The GDI query is certainly needed as fallback, but doing a couple of retries is certainly not wrong if qrsh could stat(2) the cache file before: if (tmpdir != NULL) { FILE *cache; /* build filename of cache file */ sge_dstring_init(&cache_name_dstring, cache_name_buffer, SGE_PATH_MAX); cache_name = sge_dstring_sprintf(&cache_name_dstring, "%s/%s", tmpdir, QRSH_CLIENT_CACHE); cache = fopen(cache_name, "r"); if (cache != NULL) { char cached_command[SGE_PATH_MAX]; if (fgets(cached_command, SGE_PATH_MAX, cache) != NULL) { if (strcasecmp(cached_command, "builtin") == 0) { g_new_interactive_job_support = true; } FCLOSE(cache); DPRINTF(("found cached client name: %s\n", cached_command)); DEXIT; return strdup(cached_command); } FCLOSE(cache); } } |
|||
#33 | fixed | IZ243: Memory leak in sge_schedd | ernst | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=243] Issue #: 243 Platform: All Reporter: ernst (ernst) Component: gridengine OS: All Subcomponent: kernel Version: 5.3 CC: None defined Status: VERIFIED Priority: P3 Resolution: FIXED Issue type: DEFECT Target milestone: not determined Assigned to: ernst (ernst) QA Contact: andreas URL: * Summary: Memory leak in sge_schedd Status whiteboard: Attachments: Issue 243 blocks: Votes for issue 243: Opened: Mon Apr 29 05:20:00 -0700 2002 ------------------------ Memory leak in sge_schedd According to the code discussion with Patrik Koch we have identified several errors in sge_schedd process: > Code discussion > =============== > > - sge_process_events.c, event_handles_default_scheduler() > line 650: if (is_running)... possible??? > How can it be that tasks of a newly added job are already running? > And a few lines above at the beginning of sgeE_JOB_ADD: > How can the job and the relevant task be already in the joblist? > line 653: at_inc_job_counter() leaves priority_group_list or > user list (PGR_subordinate_list) unsorted! I also would assume that the job/task should not already be in the joblist. I will fix it. > - sge_process_events.c, event_handles_default_scheduler() > line 1559: sgeE_JATASK_DEL > if ja_task is enrolled and the only task of the job -> job is > removed from lists.job_list > if ja_task is not enrolled but the only task of the job -> job is > not removed from lists.job_list ??? I will fix it. (==> Memory leak) > - sge_job_schedd.c, split_job() > line 508: job=NULL; > line 525: remaining tasks -> lCopyElem(job) ==>> LERROR ! > line 567: if (job) -> always false ! > ??? What should be done with remaining tasks? Are they possible? I don't think that remaining tasks are possible in the moment. If they occure, they should stay in the source list if they are not needed for scheduling decisions or if it is not necessary to generate scheduling messages for the "qstat -j" output. I will fix the error you found. ------- Additional comments from ernst Mon Apr 29 05:21:54 -0700 2002 ------- Started. ------- Additional comments from ernst Mon Apr 29 06:33:29 -0700 2002 ------- Fixed. ------- Additional comments from ernst Wed May 8 03:24:11 -0700 2002 ------- Review has been done by Andreas. |
|||
#867 | duplicate | IZ245: Need a naming convention for all resource bundles | rhierlmeier | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=245] Issue #: 245 Platform: Sun Reporter: rhierlmeier (rhierlmeier) Component: hedeby OS: All Subcomponent: util Version: current CC: None defined Status: NEW Priority: P2 Resolution: Issue type: ENHANCEMENT Target milestone: 1.0u5next Assigned to: marcingoldyn (marcingoldyn) QA Contact: rhierlmeier URL: * Summary: Need a naming convention for all resource bundles Status whiteboard: Attachments: Issue 245 blocks: Votes for issue 245: Vote for this issue Opened: Thu Nov 29 02:33:00 -0700 2007 ------------------------ We have a lot of "dead" messages in our resource bundles. We need a testsuite test (or an ant task) which finds unused messages. ------- Additional comments from crei Mon Dec 3 01:56:57 -0700 2007 ------- already submitted *** This issue has been marked as a duplicate of 208 *** ------- Additional comments from rhierlmeier Wed Feb 27 06:42:59 -0700 2008 ------- With the fix of this issue we should cleanup the resource bundles. We need a naming convention for all resource bundles. We suggest that the name of the resource bundle is the name of the package without com.sun.grid.grm. We have to check what naming conventions the resource bundles have. May be dots are not allowed in it. service-impl.properties -> resource bundle for package com.sun.grid.grm.service. ------- Additional comments from rhierlmeier Thu Feb 28 02:47:26 -0700 2008 ------- Reassigned ------- Additional comments from rhierlmeier Thu Feb 28 03:24:37 -0700 2008 ------- It's now a task ------- Additional comments from crei Wed Aug 6 05:03:07 -0700 2008 ------- Dead message detections filed as testsuite issue #220 Still a hedeby task is the resource bundle naming convention task ------- Additional comments from crei Wed Aug 6 05:05:08 -0700 2008 ------- Renamed Issue summary and changed subcomponent - issue is no testsuite issue ------- Additional comments from rhierlmeier Wed Aug 6 06:50:53 -0700 2008 ------- It's a cleanup issue for future releases (ENHANCEMENT). Should be in the subcomponents infrastructure. ------- Additional comments from rhierlmeier Wed Nov 25 07:21:10 -0700 2009 ------- Milestone changed |
Note: See TracQuery
for help on using queries.