Opened 11 years ago

Last modified 9 years ago

#922 new defect

IZ661: Resources located in the GE adapter that become unavailable are not reported correctly and cause error at service restart

Reported by: afisch Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u3_Beta-1
Severity: Keywords: Sun gridengine_adapter
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=661]

        Issue #:      661                      Platform:     Sun            Reporter: afisch (afisch)
       Component:     hedeby                      OS:        All
     Subcomponent:    gridengine_adapter       Version:      1.0u3_Beta-1      CC:    None defined
        Status:       NEW                      Priority:     P3
      Resolution:                             Issue type:    DEFECT
                                           Target milestone: 1.0u5next
      Assigned to:    torsten (torsten)
      QA Contact:     rhierlmeier
          URL:
       * Summary:     Resources located in the GE adapter that become unavailable are not reported correctly and cause error at service restart
   Status whiteboard:
      Attachments:


     Issue 661 blocks:
   Votes for issue 661:                     Vote for this issue


   Opened: Tue May 19 04:48:00 -0700 2009 
------------------------


   Description

   The problem happens with resources located in the GE adapter that have an SGE
   Execd installed and are assigned as exec hosts. If the resource crashes (machine
   becomes resolvable)  sge will stop to report a load value and the queue instance
   state will go to "au".
   However nothing seems to be  reported to the GE adapter in that moment. The
   resource remains in assigned state. After 10 minutes the resource finally goes
   into the expected error state. Moving the chrashed resource within the 10
   minutes will set the resource into error state with the following annotation:

      step 'Install execd' failed: Script copy_sge_root_to_cloud.sh failed with
   status 2 (for details see log file /var/spool/sdm/sdmcloud/log/geadapter/xxx.log)

   Restarting the service is from now on not possible. The following problem will
   be found at startup in the rp_vm-0.log:

     Service geadapter: Starting Grid Engine service
     Host xxx is not resolvable
     Internal error during startup: resource is not a host resource
     Componentgeadapter: Error in startup procedure: Service geadapter: Unexpected
   error in state transition UnknownStateHandler[UNKNOWN] ->
   StartingStateHandler[STARTING]: Internal error during startup: resource is not a
   host resource


   Be aware that this issue happens for any  host resources. However the problem
   can be reproduced with cloud resources more easily as they are usually
   exclusively owned by a single user.

   Evaluation:
   The problem is considered a P3 bug. The problem can only arise if resources "get
   lost". The delayed error state problem then causes a 10 minute freeze of all
   resource states. After restarting the service is not possible to recover from
   this problem on SDM side. Removing spooling data does not help because the SGE
   cluster continues to report the existence of the resource. However a workaround
   exist by manually removing the crashed host from the SGE cluster.

   Fix/Workaround
   Fix:
   The service should directly report that the resource is unavailable if the SGE
   cluster does.
   The service should be able to start even if the resource in the griden gine
   cluster is lost.  If a resource is lost, it should skip it but start up and
   report the problem in the logs.

   Workaround:
   The startup of the geadapter fails because SGE still refers to the lost host:
   0) The qstat command shows the lost host in the all.q:
     qstat -f
   1) Remove all jobs that are now pending in all.q@<host>:
     qdel -f <job_id>
   2) Remove host from the host list
     qconf -mhgrp @allhosts (if it is the only host set the hosts to "none")
   3) Remove host from the all.q configuration
     qconf -mq all.q
   or
     qconf -purge queue '*' all.q@<hostname>
   4) Remove the exec host from the host list
     qconf -de <host name>
   5) Now do an sdmadm update_component -c <ge_adaper name>

   Analysis:
   The delayed error state problem is rooted in the JGDI communication. The load
   values for the exec host reported from JGDI are processed in HostImpl.update().
   Before this is done, the host is checked if it has become static. For this
   purpose the executor has to be reached. If the resource is unavailable, there is
   a timeout of 10 minutes (see how to test) before the system realizes that the
   executor is not reachable anymore. The thread that executes these update events
   is stuck for these 10 min. So no updates will happen on any resource during this
   time.

   The problem is caused by a JMXConnectorFactory.connect(url) call in
   ConnectionProxy (separate issue, see link). However in this problem can be fixed
   by restructuring the event processing. The static test in  HostImpl.update()
   should only be done if load values have been reported. In case that no load
   values are reported the host should be set to static without contacting the
   corresponding managed host.


   By default takes some time before SGE detects a lost resource and report it.
   The error can be reproduced faster if the execd load_report_time and max_unheard
   is raised via "qconf -mconf". EG:
   load_report_time             00:00:20
   max_unheard                  00:00:30
   Be aware that max_unheard has to be bigger than load_report_time

   The service startup problem is rooted in
   GeServiceAdapterImpl.getAvailableResourceAdapters(232). For the unresolvable
   Exec host a Hostname object is created that throws an Illegal state exception.
   This should be transformed into a log message and the host should be ignored
   (continue) The resource will appear to the service as not reported.

   How to test:
   The reproducibility of the timeout problem depends very much on the way the
   unreachable problem is setup. If the resource is a cloud host, following
   scenarios are possible to create similar problems but they result in different
   symptoms (tested on Open Solaris):

   1) Removing a running cloud host from the /etc/hosts file, but leave vpn client
   and ec2 instance running. ==> no 10min timeout problem
   2) Stopping the VPN client on the masterhost, but do not remove cloud host from
   /etc/hosts file and leave ec2 instance running. ==> no 10min timeout problem
   3) Shutdown ec2 instance, but do not remove cloud host from /etc/hosts file and
   leave the VPN client running ==> described problem!!

   It seems that it is the VPN client that is not giving up on the host.
   The service startup problem can be reproduced by scenario 1).

   As this might be difficult to establish these kind of error scenarios via junit
   or testsuite tests, a well documented manual test would be needed that describes
   both error cases.



   ETC:
   3PD
               ------- Additional comments from afisch Tue May 19 08:19:51 -0700 2009 -------
   Mistake in description:
   The sentence "..If the resource crashes (machine becomes resolvable).." should
   be: "..If the resource crashes (machine goes down and becomes unreachable)
               ------- Additional comments from afisch Tue May 19 08:29:16 -0700 2009 -------
   related to issue 663
               ------- Additional comments from rhierlmeier Wed Nov 25 07:21:11 -0700 2009 -------
   Milestone changed
               ------- Additional comments from torsten Thu Nov 26 08:26:52 -0700 2009 -------
   changed subcomponent to GE adapter.
               ------- Additional comments from rhierlmeier Fri Nov 27 05:16:57 -0700 2009 -------
   With the introduction of the bound resource concept the problem described in the
   issue has been partially solved. The resource will go correctly into ERROR state
   if the connection to ec2 breaks down. However it takes more then 5 minutes until
   GEAdapter detects it because of the long timeout for the static host check still
   exists. This check is not necessary at all if the execd is not available.

   Once the connection to ec2 host has recovered the resource goes back into
   assigned state.

Change History (0)

Note: See TracTickets for help on using tickets.