Opened 50 years ago

Last modified 9 years ago

#934 new task

IZ704: problems with cloud sim host when host resolves hostname to long one

Reported by: zwierzak Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u5
Severity: minor Keywords: patch Sun cloud_adapter
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=704]

        Issue #:      704                 Platform:     Sun         Reporter: zwierzak (zwierzak)
       Component:     hedeby                 OS:        All
     Subcomponent:    cloud_adapter       Version:      1.0u5          CC:    None defined
        Status:       STARTED             Priority:     P3
      Resolution:                        Issue type:    TASK
                                      Target milestone: 1.0u5next
      Assigned to:    torsten (torsten)
      QA Contact:     adoerr
          URL:
       * Summary:     problems with cloud sim host when host resolves hostname to long one
   Status whiteboard:
      Attachments:
                      Date/filename:                                Description:                         Submitted by:
                      Fri Nov 27 06:31:00 -0700 2009: simhost.patch untested proposed patch (text/plain) torsten


     Issue 704 blocks:
   Votes for issue 704:     Vote for this issue


   Opened: Fri Nov 27 06:17:00 -0700 2009 
------------------------


   Description:

   When sim resources are added to system and hedeby will set resourceHostname
   property to long format. I can happen that resource will go to error state when
   removed.

   > /sdmadm shist -r res#15
   > time_stamp              type                        service_name   resource
             description
   >
   ------------------------------------------------------------------------------------------------------------------------------------------
   > 11/27/2009 12:19:08.794 RESOURCE_ADD                simcloud_max   sim1(res#15)
   > 11/27/2009 12:19:08.798 RESOURCE_ADDED              simcloud_max
   sim1(res#15)         New virtual resource added
   > 11/27/2009 12:19:08.800 RESOURCE_PROPERTIES_CHANGED simcloud_max
   sim1(res#15)         [[U:usage=inf->1]]
   > 11/27/2009 12:19:29.542 RESOURCE_REMOVE             simcloud_max   sim1(res#15)
   > 11/27/2009 12:19:29.546 RESOURCE_PROPERTIES_CHANGED simcloud_max
   sim1(res#15)         [[U:annotation=Starting up virtual resource]]
   > 11/27/2009 12:19:29.556 RESOURCE_PROPERTIES_CHANGED simcloud_max
   sim1(res#15)         [[U:annotation=Starting up simhost]]
   > 11/27/2009 12:19:31.216 RESOURCE_PROPERTIES_CHANGED simcloud_max
   myhost.COM(res#15) [[I:resourceHostname=myhost.COM], [I:simhost=true]]
   >                         RESOURCE_REMOVED            simcloud_max
   myhost.COM(res#15) Virtual resource successfully started
   > 11/27/2009 12:19:31.248 RESOURCE_ADD                spare_pool
   myhost.COM(res#15)
   > 11/27/2009 12:19:31.251 RESOURCE_ADDED              spare_pool
   myhost.COM(res#15)
   > 11/27/2009 12:19:31.257 RESOURCE_PROPERTIES_CHANGED spare_pool
   myhost.COM(res#15) [[U:usage=inf->1]]
   > 11/27/2009 12:22:00.213 RESOURCE_REMOVE             spare_pool
   myhost.COM(res#15)
   > 11/27/2009 12:22:00.218 RESOURCE_REMOVED            spare_pool
   myhost.COM(res#15)
   > 11/27/2009 12:22:00.288 RESOURCE_ADD                simcloud_max
   myhost.COM(res#15)
   > 11/27/2009 12:22:00.297 RESOURCE_PROPERTIES_CHANGED simcloud_max
   myhost.COM(res#15) [[U:annotation=Shutting down resource]]
   > 11/27/2009 12:22:00.299 RESOURCE_PROPERTIES_CHANGED simcloud_max
   myhost.COM(res#15) [[U:usage=inf->1]]
   > 11/27/2009 12:22:00.301 RESOURCE_PROPERTIES_CHANGED simcloud_max
   myhost.COM(res#15) [[U:annotation=Shutting down simhost]]
   > 11/27/2009 12:22:00.738 RESOURCE_PROPERTIES_CHANGED simcloud_max
   myhost.COM(res#15) [[U:annotation=Step 'Shutting down simhost' failed (see
   'Shutting_down_resource-2009-11-27_12:22:00-res#15.log')]]
   >                         RESOURCE_ERROR              simcloud_max
   myhost.COM(res#15) Step 'Shutting down simhost' failed (see
   'Shutting_down_resource-2009-11-27_12:22:00-res#15.log')

   >cat/log/simcloud_max/Shutting_down_resource-2009-11-27_12:22:00-res#15.log
   > 11/27/2009 12:22:00|W|     stderr: after init_logging debug=3
   >                      |             Setting variable
   SPOOL_DIR=/spool/simcloud_max/gef
   >                      |             Setting variable
   RES_resourceHostname=myhost.COM
   >                      |             Setting variable RES_simhost=true
   >                      |             Setting variable
   SIMHOSTS_FILE_NAME=sdm_ec2.simhosts [optional variable set to default value]
   >                      |             Setting variable SIMHOST_SLEEP_TIME=0
   [optional variable set to default value]
   >                      |             Initializing simhosts
   >                      |             Host 'myhost.COM' is not in simhosts file
   '/spool/simcloud_max/gef/sdm_ec2.simhosts'
   >                      |
   > 11/27/2009 12:22:00|E|Step 'Shutting down simhost': Exit code=FAILED_NO_UNDO
   cat /spool/simcloud_max/gef/sdm_ec2.simhosts
   > myhost


   Evaluation:

   P3 task, cloud sim host is just testing purpose, it's not included into release

   Workaround:
   Update/edit simhost file with the values from resourceHostname resource property

   Suggested fix:
   (by TB)
   As solution, I'd suggest to keep the hostname, as it was stored in the simhosts
   file (and passed out of the script) in a separate resource property that is then
   later used for searching in the simhost file again. This way we can be sure that
   SDM leaves the name alone and the script finds the resource again.

   Analysis:

   And the problem is that we add entry to this file in script
   "gef_startup_simhost.sh":

   > for res in `ypcat hosts | awk '{print $2}'` ; do
   >    if is_simhost $res ; then
   >       trace "simhost '$res' already used, trying next ..."
   >       continue
   >    fi
   >
   >    trace "picked simhost '$res'"
   >    echo $res >> $SIMHOSTS_FILE
   ypcat hosts is returning short name.... and resourceHostname property is set
   depending on host (short or long)

   When we remove sim resource, script "gef_shutdown_simhost.sh" is taking
   resourceHostname and checks in file (so mismatch is obvious)

   > # check if simhost is known
   > #    => remove host from SIMHOSTS_FILE
   > if is_simhost $RES_resourceHostname ; then
   >    tmp_file=$SIMHOSTS_FILE.new
   >    # copy to preserve file permission

   is_simhost procedure....

   > is_simhost()
   > {
   >    if [ ! -n "$1" ]; then
   >       fatal "Internal error: is_simhost() called without required parameter"
   >    fi
   >
   >    # grep in SIMHOSTS_FILE (one host per line)
   >    simhost_pattern="^$1\$"
   >    grep -s "$simhost_pattern" $SIMHOSTS_FILE > /dev/null 2>&1
   >
   >    # return exit status of grep to indicate success/failure
   >    return $?
   > }

   Test:

   Testsuite test hedeby_cloud_max_hosts detects issue when hedeby host resolves
   hostnames to long names

   ETC: 4PD
               ------- Additional comments from torsten Fri Nov 27 06:30:40 -0700 2009 -------
   ETC is too high, in my opinion. Can be done in 1 PD and just needs to be tested
   with existing test in correct environment (which does resolving to long hostnames).
               ------- Additional comments from torsten Fri Nov 27 06:31:24 -0700 2009 -------
   Created an attachment (id=104)
   untested proposed patch

               ------- Additional comments from zwierzak Fri Nov 27 06:43:48 -0700 2009 -------
   ETC depends on who/when is fixing the bug. Who is available for review and so
   on. And ETC should be reviewed by engineer is changing the issue state to
   STARTED,really working on it. If someone is so great that he thinks can fix
   issues in 0.5 or 1PD with sticking to development process he can change the
   estimations then.
               ------- Additional comments from torsten Fri Nov 27 07:13:21 -0700 2009 -------
   Changed milestone to 1.0u5next.

   Rys, I agree that the ETC needs to be reviewed by whoever is working on the issue.

   However, this is a two line fix in two files (see attachment). We have an
   existing test where the problem came up given a specific environment. What needs
   to be done is to run the test in this environment with and without the fix,
   prepare a review paper, checkin and close the issue.

   I just don't see how this should add up to 4 PD. By all means, I don't want to
   discuss whether this is 0.5 PD or 1 PD or even 1.5 PD. But if I read 4 PD and
   see the issue, I think that I must have missed something ...
               ------- Additional comments from torsten Fri Feb 5 00:35:06 -0700 2010 -------
   started

Attachments (1)

104 (3.0 KB) - added by trac 9 years ago.

Download all attachments as: .zip

Change History (2)

Changed 9 years ago by trac

comment:1 Changed 9 years ago by dlove

  • Keywords patch added; removed
  • Severity set to minor
Note: See TracTickets for help on using tickets.