Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (10 - 12 of 431)

1 2 3 4 5 6 7 8 9 10 11 12 13 14
Ticket Resolution Summary Owner Reporter
#166 duplicate IZ1010: Job array lack means to get email notification for the total array andreas
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1010]

        Issue #:      1010             Platform:     All           Reporter: andreas (andreas)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      6.0beta2         CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     ernst
          URL:
       * Summary:     Job array lack means to get email notification for the total array
   Status whiteboard:
      Attachments:

     Issue 1010 blocks:
   Votes for issue 1010:


   Opened: Fri Apr 30 05:15:00 -0700 2004 
------------------------


DESCRIPTION:
There is a need for a means to sumit job arrays
in a way allowing email notifications be requested
for the total job.

Workaround:
The following solution helps only in the last
phase of the e-mail
delivery: from the e-mail daemon to the inbox. It
doesn't reduce the
number of e-mails generated by SGE nor the number
of messages that
have to pass through any intermediate e-mail daemons.

So, the solution is to use procmail for e-mail
filtering. procmail is
set to run from .forward and is handed the message
from the
daemon instead of writting it into the inbox.
Procmail has duplicate
mails detection capabilities, explained in 'man
procmailex'. You can
set it to check for the jobid and deliver only one
message per jobid.
procmail keeps a small cache of already seen
matching sequences and so
if the tasks take too long to complete while many
other jobs finish,
it might allow more than one message per jobid;
however the cache size
is configurable, so you can play with it to obtain
the best results.

   ------- Additional comments from sgrell Mon Dec 12 02:55:45 -0700 2005 -------
Changed the Subcomponent.

Stephan
#207 duplicate IZ1321: change default value for pe attribs start/stop_proc_args joga
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1321]

        Issue #:      1321             Platform:     All           Reporter: joga (joga)
       Component:     gridengine          OS:        All
     Subcomponent:    clients          Version:      current          CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     roland
          URL:
       * Summary:     change default value for pe attribs start/stop_proc_args
   Status whiteboard:
      Attachments:

     Issue 1321 blocks:
   Votes for issue 1321:


   Opened: Mon Nov 8 08:17:00 -0700 2004 
------------------------


The default value for the start_proc_args and
stop_proc_args of parallel environment currently is
/bin/true

We should better use NONE as default value instead, as
- /bin/true isn't available on all platforms (darwin)
- it implies starting a binary during the startup
of a parallel job that has no effect on the job itself

   ------- Additional comments from sgrell Mon Dec 12 02:42:08 -0700 2005 -------
Changed subcomponent.

Stephan

   ------- Additional comments from joga Tue May 16 23:56:47 -0700 2006 -------
Raising priority.

/bin/true is not available on all platforms (e.g. on darwin, it is /usr/bin/true).
Jobs submitted into a parallel environment created with the default settings
will fail.
#236 duplicate IZ1531: filtering with qhost broken olle
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1531]

   Issue #: 1531   Platform: Sun   Reporter: olle (olle)
   Component: gridengine   OS: Solaris
   Subcomponent: clients   Version: 6.0u3   CC:
   [_] crei
   [_] uddeborg
   [_] Remove selected CCs
   Status: VERIFIED   Priority: P3
   Resolution: FIXED   Issue type: DEFECT
     Target milestone: ---
   Assigned to: sgrell (sgrell)
   QA Contact: roland
   URL:
   * Summary: Resource filtering with qhost broken
   Status whiteboard:
   Attachments:
   Date/filename:                               Description:                                                                                          Submitted by:
   Wed Mar 30 08:10:00 -0700 2005: qhost.out    Output from 'qhost -l arch=xxx' (text/plain)                                                          olle
   Wed Mar 30 08:25:00 -0700 2005: qstat.out.gz Output from 'qstat -F' (application/x-gzip)                                                           olle
   Fri Apr 15 05:36:00 -0700 2005: 1            Output from qping (text/plain)                                                                        olle
   Wed Jun 8 23:43:00 -0700 2005: p             Having slept on it: Would not this be a simpler way to achieve the same effect in qhost? (text/plain) uddeborg
     Issue 1531 blocks:
   Votes for issue 1531:

   Opened: Tue Mar 29 11:24:00 -0700 2005 
------------------------


We are running qmaster on a lx24-x86 machine and exec hosts on a few different
platforms.

It appears that 'qhost -l <some resource>=<some value>' will never filter out
Solaris hosts, even if they don't match the resource requirement.

E.g., the following command lists our Solaris hosts, but no hosts of other
architectures:

elmira/users/olle[122]% qhost -l arch=xxx
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
cdg                     sol-sparc       2  0.03  640.0M  202.0M    2.0G   10.0M
devilslake              sol-sparc64     2  0.01    8.0G    1.5G   16.0G     0.0
elmira                  sol-sparc64     2  2.09    4.0G    1.2G    8.0G    1.0M
resistencia             sol-sparc64     2  0.01    8.0G    1.2G   16.0G    1.0M
sundance                sol-sparc64     2  0.01    4.0G  845.0M    4.0G     0.0
turin                   sol-sparc64     2  0.02    4.0G  919.0M    8.0G     0.0

   ------- Additional comments from sgrell Wed Mar 30 05:21:41 -0700 2005 -------
is already submitted.

Stephan

*** This issue has been marked as a duplicate of 1306 ***

   ------- Additional comments from olle Wed Mar 30 05:33:28 -0700 2005 -------
Is it really the same bug?

In our case the resource is properly defined both in the complex definition and
on the actual execution hosts.

It also works on other platforms than Solaris. Tested on HP-UX, Linux and AIX.

   ------- Additional comments from sgrell Wed Mar 30 06:56:27 -0700 2005 -------
Hi, it is probably not the same issue. I did a couple tests and cannot replicate
the problem. I had maintrunk and u3 masters on sol-sparc64 and lx26-x86. I
executed the qhost command under solaris64 and lx26-x86. It works.

Could you post a "qstat -F". I assume, that it is a configuration issue.

my qhost output:

qhost -l arch=xxx
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -



Stephan

   ------- Additional comments from olle Wed Mar 30 07:28:10 -0700 2005 -------
Now this is getting really weird.

Running the same command again will now list our lx24-amd64 machines, but only
AMD Opterons, no Intel EM64T. The Solaris machines are now filtered out as they
should be.

Nothing in the configuration has changed since yesterday.

I will attach the output from todays 'qhost -l arch=xxx' and a 'qstat -F'.

I get a feeling that this might disappear if I restart qmaster.

   ------- Additional comments from olle Wed Mar 30 07:40:42 -0700 2005 -------
The EM64T machine was not included in the @allhost group, and thus not running
any queues, which might explain why it wasn't listed.

A stupid question. What file type should I use for plain text when creating an
attachment?

   ------- Additional comments from olle Wed Mar 30 08:10:40 -0700 2005 -------
Created an attachment (id=43)
Output from 'qhost -l arch=xxx'

   ------- Additional comments from olle Wed Mar 30 08:25:19 -0700 2005 -------
Created an attachment (id=44)
Output from 'qstat -F'

   ------- Additional comments from olle Thu Mar 31 11:57:39 -0700 2005 -------
I have restarted qmaster and the problem is gone.

Before the restart the original behaviour was back, i.e. the resource
filtering was broken on all Solaris hosts.

Shall we close this bug or keep it open with less priority?

At least the headline is misleading since it seems to be more related to the
qmaster, actually running on Linux, than the exec hosts running on Solaris. (I
will remove the Solaris part of the headline.)

   ------- Additional comments from olle Thu Apr 14 14:46:16 -0700 2005 -------
Sorry for the late comment, but the problem was back agian within a few hours
after the restart of qmaster. It is definitely looping over the sets of host
of various architectures we have in the cluster.

An obesrvation that could be relevant, but doesn't have to be, is that one of
the sge_qmaster threads is constantly running at 100% even when there is no
load on the batch system. It runs at just a few percentage after restart, but
when has been up for a while it gets allways up att 100% and never goes down
again before killed.

   ------- Additional comments from andreas Fri Apr 15 05:08:11 -0700 2005 -------
When the problem occurs next time please try running

 # $SGE_ROOT/bin/sol-sparc64/qping -f <qmaster-host> <qmaster-port> qmaster 1

e.g. the "info" line in qping -f output unveils threads that are blocked

info:  EDT: R (0.15) | TET: R (1.17) | MT: R (0.15) | SIGT: R (1004.63) | ok

my case only qmaster signal thread was inactive over long time which is fine.
Important information in that output is all other threads are alive.

   ------- Additional comments from olle Fri Apr 15 05:32:29 -0700 2005 -------
I can't see anything suspected in the qping output, but I will attach it in case
someone else can get some information from it.

One other observation is that the qmaster always runs with one thread at 100%
even when no jobs are running. If the qmaster is restarted it will run with no
load for a while, but then always go up at 100% again after some hours.

   ------- Additional comments from olle Fri Apr 15 05:36:41 -0700 2005 -------
Created an attachment (id=46)
Output from qping

   ------- Additional comments from olle Fri Apr 15 05:38:35 -0700 2005 -------
Output from top (u sgeadmin):

25729 sgeadmin  25   0  269M 269M  8220 R    24.4  6.9 21033m   1 sge_qmaster
25731 sgeadmin  15   0 27100  26M  1996 S     1.0  0.6  1293m   3 sge_schedd
25724 sgeadmin  15   0  269M 269M  8220 S     0.1  6.9 145:04   0 sge_qmaster
25727 sgeadmin  15   0  269M 269M  8220 S     0.1  6.9  98:59   0 sge_qmaster
25717 sgeadmin  25   0  269M 269M  8220 S     0.0  6.9   0:01   0 sge_qmaster
25728 sgeadmin  25   0  269M 269M  8220 S     0.0  6.9   0:00   0 sge_qmaster

   ------- Additional comments from andreas Fri Apr 15 05:50:23 -0700 2005 -------
How many execd's do you have in your cluster? The qping -f indicates
qmaster has up to 114 open connections. Have you checked whether you're
suffering from #1517?

   ------- Additional comments from olle Fri Apr 15 05:59:50 -0700 2005 -------
The qmaster is running on a RHEL3 machine which I think should have a limit of
1024 descriptors as default.

% qconf -sel|wc -l
    127

   ------- Additional comments from andreas Fri Apr 15 06:07:49 -0700 2005 -------
A lower limit might be effective.
Please do a

 # grep "qmaster will" $SGE_ROOT/default/spool/qmaster/messages


   ------- Additional comments from olle Fri Apr 15 06:18:51 -0700 2005 -------
This might be something.

I can look into the source code of course, but in case you already know how this
calculation is done or where to read about it, it would be interesting to share
the knowledge.

03/31/2005 10:20:14|qmaster|wake|I|qmaster will use max. 1004 file descriptors
for communication
03/31/2005 10:20:14|qmaster|wake|I|qmaster will accept max. 99 dynamic event clients


   ------- Additional comments from andreas Fri Apr 15 06:23:48 -0700 2005 -------
Well. Actually this looks good. The number of available fd's should be
sufficient. The 99 limitation is of relevance only if you are heavily using
either "qsub -sync y" or libdrmaa.so. I assume this is not the case with your
cluster.

   ------- Additional comments from uddeborg Thu Jun 2 09:10:49 -0700 2005 -------
An update (from the same site as the original report):

We are now testing 6.0u5_rc1 as master and selected execution servers, and also
as client commands.

Qmaster is no longer spinning att 100%.  That issue seems to be resolved in a
different thread.  (We now have two very busy threads, but not 100% of a CPU each.)

We still see this problem.  Exactly what hosts are shown when we ask for a
nonexistent architecture varies over time.  But we do get a list most of the time.

   ------- Additional comments from uddeborg Fri Jun 3 09:17:31 -0700 2005 -------
Adding some tracing information: Immediately after get_all_lists() in qhost.c,
the extra hosts already have the EH_tagged field set.  The selection later on
correctly matches only the relevant hosts.  But since tags are only set there,
not cleared, both the correctly matched and those "pretagged" are printed.

   ------- Additional comments from sgrell Sun Jun 5 23:48:11 -0700 2005 -------
Thanks for analysing this bug. The bug happens,when verify_suitable_queues in the
qmaster is called. If qsub -v w and similar ones are called for a pe job, that
method calles sge_select_parallel_environment, which tags the hosts, which are
suitable. There  is no code in the function verify_suitable_queues to clean the
settings.

There are tree possible solutions for this bug. One can
- clean the tag field right after the sge_select_parallel_environment, was called,
- change sge_select_parallel_environment cleanup after it is done
- clean the tag fields in the qhost call.

The call sge_select_parallel_environment allready cleans  the tags before it
starts the processing. It might be a good idea to move that code to cleanup
afterwards.

The following code will to it:

   lListElem *host=NULL;
   for_each(host, host_list) {
      lSetUlong(host, EH_tagged, 0);
   }

   ------- Additional comments from sgrell Mon Jun 6 01:07:29 -0700 2005 -------
loogin into it.

Stephan

   ------- Additional comments from sgrell Mon Jun 6 01:08:47 -0700 2005 -------
Fixed in maintrunk and for u5.

Stephan

   ------- Additional comments from uddeborg Wed Jun 8 09:25:46 -0700 2005 -------
I rebuilt (only) qhost from the u5 branch, and it seems to work fine now.

   ------- Additional comments from uddeborg Wed Jun 8 23:43:48 -0700 2005 -------
Created an attachment (id=63)
Having slept on it: Would not this be a simpler way to achieve the same effect in qhost?
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Note: See TracQuery for help on using queries.