Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (106 - 108 of 431)

Ticket Resolution Summary Owner Reporter
#459 fixed IZ2379: manpage of qconf uses "fname" and "file, all should be "fname" reuti
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2379]

        Issue #:      2379             Platform:     Macintosh   Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    man              Version:      6.1u2          CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     manpage of qconf uses "fname" and "file, all should be "fname"
   Status whiteboard:
      Attachments:

     Issue 2379 blocks:
   Votes for issue 2379:


   Opened: Sun Sep 30 11:42:00 -0700 2007 
------------------------


There are entries in the manpage of qconf:

-Aconf file_list <add configurations>
-Ahgrp file <add host group config>
-Mhgrp file <modify host group config.>

better:

-Aconf fname_list <add configurations>
-Ahgrp fname <add host group configuration>
-Mhgrp fname <modify host group configuration>

(other entries also use the "configuration" in full and not abbreviated).
#467 fixed IZ2393: qrsh -inherit should access the 'rsh_command'-cache more persevering brs
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2393]

        Issue #:      2393             Platform:     PC       Reporter: brs (brs)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.1u4       CC:    None defined
        Status:       STARTED          Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     pollinger
          URL:
       * Summary:     qrsh -inherit should access the 'rsh_command'-cache more persevering
   Status whiteboard:
      Attachments:

     Issue 2393 blocks:
   Votes for issue 2393:


   Opened: Mon Oct 8 08:57:00 -0700 2007 
------------------------


This is an intermittent problem that occurs with greater frequency as you
increase the number of slots used in a parallel environment.  It seems to affect
only OpenMPI jobs.

Jobs would fail with the following messages from OpenMPI from a fairly random
number of qrsh processes (sometimes 2 or 3 sometimes almost half -- considering
128 processors -- of the calls would fail):

[rcn-ib-0013.rc.usf.edu:32328] ERROR: A daemon on node rcn-ib-0003.rc.usf.edu
failed to start as expected.
[rcn-ib-0013.rc.usf.edu:32328] ERROR: There may be more information available from
[rcn-ib-0013.rc.usf.edu:32328] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[rcn-ib-0013.rc.usf.edu:32328] ERROR: If the problem persists, please restart the
[rcn-ib-0013.rc.usf.edu:32328] ERROR: Grid Engine PE job
[rcn-ib-0013.rc.usf.edu:32328] ERROR: The daemon exited unexpectedly with status
255.

Then, a core would be dumped in the user's home directory which was from the
qrsh (or qsh) executable.  A quick backtrace with gdb revealed that qrsh was
failing on a call to execvp.  After some investigation, this call was found in
source/clients/qsh/qsh.c:789 (as of a cvs checkout from 10/08/2007).  Though I
am reporting this bug with 6.0u11, I also reproduced it at one point with 6.1u2.

I've gotten around this bug for the time being by LD_PRELOADing the following code:

/*
 * Place holder library to fix bug with qsh.c not correctly passing
 * arguments to execvp()
 */
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/utsname.h>
#include <sys/types.h>
#define __USE_GNU
#include <dlfcn.h>

int execvp(const char *file, char *const argv[]){
        int pid, i = 1;
        const char *file_n;
        char **args;

        static int (*func)(const char*, char *const *);
        struct utsname u_name;

        uname(&u_name);
        pid = (int)getpid();

        if (file == NULL){
                file_n = "/opt/sge-tools/rsh";
                fprintf (stderr, "[%s:%d] execvp(): had to modify null
arguments\n", u_name.nodename, pid);
        } else file_n = (const char *)file;
        if (argv[0] == NULL) args[0] = "/opt/sge-tools/rsh";
        else args[0] = (char *)argv[0];

        if (argv[i] != NULL){
                do {
                        args[i] = (char *)argv[i];
                } while (argv[++i] != NULL);
        }

        func = (int (*)(const char*, char *const *))dlsym(RTLD_NEXT, "execvp");

        return (func(file_n, args));
}

Basically, it ensures that the arguments passed to execvp are sane and modifies
them if they are not.  I also needed a wrapper for rsh (hence
/opt/sge-tools/rsh) so that the correct arch would be selected in $SGE_ROOT/utilbin.

Here's some general information:
qconf -sq rcnib.q
qname                 rcnib.q
hostlist              @rcnibNodes
seq_no                51
load_thresholds       NONE
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            2
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich-p4 ompi-ddr mpich-schrod
rerun                 FALSE
slots                 4
tmpdir                /opt/sge/tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            operators
xuser_lists           NONE
subordinate_list      NONE
complex_values        i_ib=true,p_low=false,t_devel=false
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                2M
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

qconf -sp ompi-ddr
pe_name           ompi-ddr
slots             1024
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

qconf -sconf
global:
execd_spool_dir              /opt/sge/rc/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           admins@rc.usf.edu
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs                100
gid_range                    10000-10500
qlogin_command               telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_daemon                /usr/sbin/in.rlogind
rsh_command                  /opt/sge-tools/rsh
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             100
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 1

And some more detailed output.  With the following code LD_PRELOADed, a good
call to execvp looks like this:

[rcn-ib-0037.rc.usf.edu:8123] execvp(): /opt/sge-tools/rsh /opt/sge-tools/rsh -n
-p 34391 rcn-ib-0027.rc.usf.edu exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/usr/local/sge-6.0-u11/rc/spool/rcn-ib-0027/active_jobs/4268.1/1.rcn-ib-0027'
noshell

A bad call will look something like this:

[rcn-ib-0037.rc.usf.edu:8150] execvp(): (null) (null) -n -p 34537
rcn-ib-0028.rc.usf.edu exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/usr/local/sge-6.0-u11/rc/spool/rcn-ib-0028/active_jobs/4268.1/1.rcn-ib-0028'
noshell


Here's the code:

#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/utsname.h>
#include <sys/types.h>
#define __USE_GNU
#include <dlfcn.h>

int execvp(const char *file, char *const argv[]){
        int pid, i = 1;
        static int (*func)(const char*, char *const *);
        struct utsname u_name;

        uname(&u_name);
        pid = (int)getpid();

        fprintf (stderr, "[%s:%d] execvp(): %s %s ", u_name.nodename, pid, file,
argv[0]);
        do {
                fprintf(stderr, "%s ", argv[i]);
        } while (argv[++i] != NULL);
        fprintf(stderr, "\n");

        func = (int (*)(const char*, char *const *))dlsym(RTLD_NEXT, "execvp");

        return (func(file, argv));
}

I can provide more information if required.

   ------- Additional comments from brs Mon Oct 8 13:02:57 -0700 2007 -------
Ok, the actual execvp is in source/clients/qrsh/qrsh_starter.c, line 728.
Should have looked more carefully.

   ------- Additional comments from brs Mon Oct 8 13:05:21 -0700 2007 -------
changed ticket summary to reflect correct source file

   ------- Additional comments from brs Sun Apr 6 10:20:19 -0700 2008 -------
This issue still exists in 6.1u4.  Who can look at this issue with me?

   ------- Additional comments from brs Sun Apr 6 10:28:49 -0700 2008 -------
Would like someone to look at this ticket.  Perhaps it was sent to the wrong person.

   ------- Additional comments from andreas Mon Apr 7 08:55:36 -0700 2008 -------
Moving this to execution subcomponent.

   ------- Additional comments from brs Mon Apr 28 11:53:04 -0700 2008 -------
So, from more investigation, I think I've narrowed the problem down.  In short,
using NFS (at least with Solaris 10 as the NFS server) doesn't work reliably.
It seems that multiple qrsh -inherit processes are making gdi requests to
qmaster for the rsh_command data.  I don't know how many gdi requests qmaster
can handle simultaneously from a single job, but it appears that some of the
qrsh -inherits that are spawned by mpirun can't read the data from
qrsh_client_cache and end up getting to the gdi2_get_configuration line before
the first qrsh can write his data out.  This results in a large-ish parallel job
making tons of gdi requests on qmaster.  My guess is that some of these gdi
requests aren't handled properly and so some qrsh'es end up running execvp
without any command string.

Long story short: NFS server on Solaris 10 w/ SGE: bad.  Let me know what you think.

   ------- Additional comments from brs Mon Apr 28 11:53:54 -0700 2008 -------
For clarification, I meant NFS as a path for tmpdir.

   ------- Additional comments from brs Mon Apr 28 11:55:10 -0700 2008 -------
Also, it appears that some type of file locking would completely fix this issue
if the gdi hypothesis is true.

   ------- Additional comments from andreas Tue Apr 29 08:15:24 -0700 2008 -------
The NFS issue I can't evaluate, but I think the code in get_client_name() should
be more persevering as to deal with such cases. As it is now qrsh -inherit tries
only once to read the cache and does then immediately a GDI-request as fallback.
The GDI query is certainly needed as fallback, but doing a couple of retries is
certainly not wrong if qrsh could stat(2) the cache file before:

      if (tmpdir != NULL) {
         FILE *cache;

         /* build filename of cache file */
         sge_dstring_init(&cache_name_dstring, cache_name_buffer, SGE_PATH_MAX);
         cache_name = sge_dstring_sprintf(&cache_name_dstring, "%s/%s",
                                          tmpdir, QRSH_CLIENT_CACHE);
         cache = fopen(cache_name, "r");
         if (cache != NULL) {
            char cached_command[SGE_PATH_MAX];

            if (fgets(cached_command, SGE_PATH_MAX, cache) != NULL) {
               if (strcasecmp(cached_command, "builtin") == 0) {
                  g_new_interactive_job_support = true;
               }

               FCLOSE(cache);
               DPRINTF(("found cached client name: %s\n", cached_command));
               DEXIT;
               return strdup(cached_command);
            }
            FCLOSE(cache);
         }
      }
#486 fixed IZ2482: sge_qquota not processed properly Dave Love <d.love@…> clarinet
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2482]

        Issue #:      2482             Platform:     PC       Reporter: clarinet (clarinet)
       Component:     gridengine          OS:        Linux
     Subcomponent:    clients          Version:      6.1u3       CC:
                                                                        [_] ernst
                                                                        [_] Remove selected CCs
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     roland
          URL:
       * Summary:     sge_qquota not processed properly
   Status whiteboard:
      Attachments:

     Issue 2482 blocks:
   Votes for issue 2482:


   Opened: Mon Feb 4 15:28:00 -0700 2008 
------------------------


Parameters in ~/.sge_qquota or system-wide sge_qquota file are not processed
properly, qquota command seems to ignore the first word in the file.

If ~/.sge_qquota contains "-u *", running qquota generates the following
error: 'error: ERROR! invalid option argument "*"'.

If ~/.sge_qquota is empty, running qquota generates the following
error: 'error: ERROR! invalid option argument "pwf"'.

   ------- Additional comments from clarinet Tue Feb 5 03:13:29 -0700 2008 -------
Is not there a problem on line 470 of qquota.c file? Using "++argv" seems fine
for real command line arguments but may fail when processing arguments read
from a file (where there is no argv[0]).

Also, I am not sure if qquota processes duplicate arguments (those that appear
in a file as well as on a command line) correctly.

   ------- Additional comments from andreas Tue Feb 5 03:52:52 -0700 2008 -------
The

   rp = ++argv;

looks strange at all events. With

   rp = argv++;

the code in sge_parse_cmdline_qquota() makes more sense.

Thanks for reporting!
Note: See TracQuery for help on using queries.