Opened 14 years ago

Closed 9 years ago

#236 closed defect (duplicate)

IZ1531: filtering with qhost broken

Reported by: olle Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u3
Severity: minor Keywords: Sun Solaris clients
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1531]

   Issue #: 1531   Platform: Sun   Reporter: olle (olle)
   Component: gridengine   OS: Solaris
   Subcomponent: clients   Version: 6.0u3   CC:
   [_] crei
   [_] uddeborg
   [_] Remove selected CCs
   Status: VERIFIED   Priority: P3
   Resolution: FIXED   Issue type: DEFECT
     Target milestone: ---
   Assigned to: sgrell (sgrell)
   QA Contact: roland
   URL:
   * Summary: Resource filtering with qhost broken
   Status whiteboard:
   Attachments:
   Date/filename:                               Description:                                                                                          Submitted by:
   Wed Mar 30 08:10:00 -0700 2005: qhost.out    Output from 'qhost -l arch=xxx' (text/plain)                                                          olle
   Wed Mar 30 08:25:00 -0700 2005: qstat.out.gz Output from 'qstat -F' (application/x-gzip)                                                           olle
   Fri Apr 15 05:36:00 -0700 2005: 1            Output from qping (text/plain)                                                                        olle
   Wed Jun 8 23:43:00 -0700 2005: p             Having slept on it: Would not this be a simpler way to achieve the same effect in qhost? (text/plain) uddeborg
     Issue 1531 blocks:
   Votes for issue 1531:

   Opened: Tue Mar 29 11:24:00 -0700 2005 
------------------------


We are running qmaster on a lx24-x86 machine and exec hosts on a few different
platforms.

It appears that 'qhost -l <some resource>=<some value>' will never filter out
Solaris hosts, even if they don't match the resource requirement.

E.g., the following command lists our Solaris hosts, but no hosts of other
architectures:

elmira/users/olle[122]% qhost -l arch=xxx
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
cdg                     sol-sparc       2  0.03  640.0M  202.0M    2.0G   10.0M
devilslake              sol-sparc64     2  0.01    8.0G    1.5G   16.0G     0.0
elmira                  sol-sparc64     2  2.09    4.0G    1.2G    8.0G    1.0M
resistencia             sol-sparc64     2  0.01    8.0G    1.2G   16.0G    1.0M
sundance                sol-sparc64     2  0.01    4.0G  845.0M    4.0G     0.0
turin                   sol-sparc64     2  0.02    4.0G  919.0M    8.0G     0.0

   ------- Additional comments from sgrell Wed Mar 30 05:21:41 -0700 2005 -------
is already submitted.

Stephan

*** This issue has been marked as a duplicate of 1306 ***

   ------- Additional comments from olle Wed Mar 30 05:33:28 -0700 2005 -------
Is it really the same bug?

In our case the resource is properly defined both in the complex definition and
on the actual execution hosts.

It also works on other platforms than Solaris. Tested on HP-UX, Linux and AIX.

   ------- Additional comments from sgrell Wed Mar 30 06:56:27 -0700 2005 -------
Hi, it is probably not the same issue. I did a couple tests and cannot replicate
the problem. I had maintrunk and u3 masters on sol-sparc64 and lx26-x86. I
executed the qhost command under solaris64 and lx26-x86. It works.

Could you post a "qstat -F". I assume, that it is a configuration issue.

my qhost output:

qhost -l arch=xxx
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -



Stephan

   ------- Additional comments from olle Wed Mar 30 07:28:10 -0700 2005 -------
Now this is getting really weird.

Running the same command again will now list our lx24-amd64 machines, but only
AMD Opterons, no Intel EM64T. The Solaris machines are now filtered out as they
should be.

Nothing in the configuration has changed since yesterday.

I will attach the output from todays 'qhost -l arch=xxx' and a 'qstat -F'.

I get a feeling that this might disappear if I restart qmaster.

   ------- Additional comments from olle Wed Mar 30 07:40:42 -0700 2005 -------
The EM64T machine was not included in the @allhost group, and thus not running
any queues, which might explain why it wasn't listed.

A stupid question. What file type should I use for plain text when creating an
attachment?

   ------- Additional comments from olle Wed Mar 30 08:10:40 -0700 2005 -------
Created an attachment (id=43)
Output from 'qhost -l arch=xxx'

   ------- Additional comments from olle Wed Mar 30 08:25:19 -0700 2005 -------
Created an attachment (id=44)
Output from 'qstat -F'

   ------- Additional comments from olle Thu Mar 31 11:57:39 -0700 2005 -------
I have restarted qmaster and the problem is gone.

Before the restart the original behaviour was back, i.e. the resource
filtering was broken on all Solaris hosts.

Shall we close this bug or keep it open with less priority?

At least the headline is misleading since it seems to be more related to the
qmaster, actually running on Linux, than the exec hosts running on Solaris. (I
will remove the Solaris part of the headline.)

   ------- Additional comments from olle Thu Apr 14 14:46:16 -0700 2005 -------
Sorry for the late comment, but the problem was back agian within a few hours
after the restart of qmaster. It is definitely looping over the sets of host
of various architectures we have in the cluster.

An obesrvation that could be relevant, but doesn't have to be, is that one of
the sge_qmaster threads is constantly running at 100% even when there is no
load on the batch system. It runs at just a few percentage after restart, but
when has been up for a while it gets allways up att 100% and never goes down
again before killed.

   ------- Additional comments from andreas Fri Apr 15 05:08:11 -0700 2005 -------
When the problem occurs next time please try running

 # $SGE_ROOT/bin/sol-sparc64/qping -f <qmaster-host> <qmaster-port> qmaster 1

e.g. the "info" line in qping -f output unveils threads that are blocked

info:  EDT: R (0.15) | TET: R (1.17) | MT: R (0.15) | SIGT: R (1004.63) | ok

my case only qmaster signal thread was inactive over long time which is fine.
Important information in that output is all other threads are alive.

   ------- Additional comments from olle Fri Apr 15 05:32:29 -0700 2005 -------
I can't see anything suspected in the qping output, but I will attach it in case
someone else can get some information from it.

One other observation is that the qmaster always runs with one thread at 100%
even when no jobs are running. If the qmaster is restarted it will run with no
load for a while, but then always go up at 100% again after some hours.

   ------- Additional comments from olle Fri Apr 15 05:36:41 -0700 2005 -------
Created an attachment (id=46)
Output from qping

   ------- Additional comments from olle Fri Apr 15 05:38:35 -0700 2005 -------
Output from top (u sgeadmin):

25729 sgeadmin  25   0  269M 269M  8220 R    24.4  6.9 21033m   1 sge_qmaster
25731 sgeadmin  15   0 27100  26M  1996 S     1.0  0.6  1293m   3 sge_schedd
25724 sgeadmin  15   0  269M 269M  8220 S     0.1  6.9 145:04   0 sge_qmaster
25727 sgeadmin  15   0  269M 269M  8220 S     0.1  6.9  98:59   0 sge_qmaster
25717 sgeadmin  25   0  269M 269M  8220 S     0.0  6.9   0:01   0 sge_qmaster
25728 sgeadmin  25   0  269M 269M  8220 S     0.0  6.9   0:00   0 sge_qmaster

   ------- Additional comments from andreas Fri Apr 15 05:50:23 -0700 2005 -------
How many execd's do you have in your cluster? The qping -f indicates
qmaster has up to 114 open connections. Have you checked whether you're
suffering from #1517?

   ------- Additional comments from olle Fri Apr 15 05:59:50 -0700 2005 -------
The qmaster is running on a RHEL3 machine which I think should have a limit of
1024 descriptors as default.

% qconf -sel|wc -l
    127

   ------- Additional comments from andreas Fri Apr 15 06:07:49 -0700 2005 -------
A lower limit might be effective.
Please do a

 # grep "qmaster will" $SGE_ROOT/default/spool/qmaster/messages


   ------- Additional comments from olle Fri Apr 15 06:18:51 -0700 2005 -------
This might be something.

I can look into the source code of course, but in case you already know how this
calculation is done or where to read about it, it would be interesting to share
the knowledge.

03/31/2005 10:20:14|qmaster|wake|I|qmaster will use max. 1004 file descriptors
for communication
03/31/2005 10:20:14|qmaster|wake|I|qmaster will accept max. 99 dynamic event clients


   ------- Additional comments from andreas Fri Apr 15 06:23:48 -0700 2005 -------
Well. Actually this looks good. The number of available fd's should be
sufficient. The 99 limitation is of relevance only if you are heavily using
either "qsub -sync y" or libdrmaa.so. I assume this is not the case with your
cluster.

   ------- Additional comments from uddeborg Thu Jun 2 09:10:49 -0700 2005 -------
An update (from the same site as the original report):

We are now testing 6.0u5_rc1 as master and selected execution servers, and also
as client commands.

Qmaster is no longer spinning att 100%.  That issue seems to be resolved in a
different thread.  (We now have two very busy threads, but not 100% of a CPU each.)

We still see this problem.  Exactly what hosts are shown when we ask for a
nonexistent architecture varies over time.  But we do get a list most of the time.

   ------- Additional comments from uddeborg Fri Jun 3 09:17:31 -0700 2005 -------
Adding some tracing information: Immediately after get_all_lists() in qhost.c,
the extra hosts already have the EH_tagged field set.  The selection later on
correctly matches only the relevant hosts.  But since tags are only set there,
not cleared, both the correctly matched and those "pretagged" are printed.

   ------- Additional comments from sgrell Sun Jun 5 23:48:11 -0700 2005 -------
Thanks for analysing this bug. The bug happens,when verify_suitable_queues in the
qmaster is called. If qsub -v w and similar ones are called for a pe job, that
method calles sge_select_parallel_environment, which tags the hosts, which are
suitable. There  is no code in the function verify_suitable_queues to clean the
settings.

There are tree possible solutions for this bug. One can
- clean the tag field right after the sge_select_parallel_environment, was called,
- change sge_select_parallel_environment cleanup after it is done
- clean the tag fields in the qhost call.

The call sge_select_parallel_environment allready cleans  the tags before it
starts the processing. It might be a good idea to move that code to cleanup
afterwards.

The following code will to it:

   lListElem *host=NULL;
   for_each(host, host_list) {
      lSetUlong(host, EH_tagged, 0);
   }

   ------- Additional comments from sgrell Mon Jun 6 01:07:29 -0700 2005 -------
loogin into it.

Stephan

   ------- Additional comments from sgrell Mon Jun 6 01:08:47 -0700 2005 -------
Fixed in maintrunk and for u5.

Stephan

   ------- Additional comments from uddeborg Wed Jun 8 09:25:46 -0700 2005 -------
I rebuilt (only) qhost from the u5 branch, and it seems to work fine now.

   ------- Additional comments from uddeborg Wed Jun 8 23:43:48 -0700 2005 -------
Created an attachment (id=63)
Having slept on it: Would not this be a simpler way to achieve the same effect in qhost?

Change History (1)

comment:1 Changed 9 years ago by dlove

  • Resolution set to duplicate
  • Severity set to minor
  • Status changed from new to closed

IZ1306 is fixed.

Note: See TracTickets for help on using tickets.