Opened 12 years ago

Last modified 10 years ago

#784 new defect

IZ3246: sge_execd dies within a couple hours

Reported by: opoplawski Owned by:
Priority: high Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: PC Windows execution


[Imported from gridengine issuezilla]

        Issue #:      3246             Platform:     PC              Reporter: opoplawski (opoplawski)
       Component:     gridengine          OS:        Windows Vista
     Subcomponent:    execution        Version:      6.2u5              CC:    None defined
        Status:       NEW              Priority:     P1
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
       * Summary:     sge_execd dies within a couple hours
   Status whiteboard:

     Issue 3246 blocks:
   Votes for issue 3246:

   Opened: Thu Mar 4 15:23:00 -0700 2010 

I'm running 6.2u5 on Windows Server 2008 R2 64-bit.  I had some trouble
at first getting load info, bu I discovered that if I recompile
sge_execd from source, I'm able to get it to run and start the
loadsensor and report loads.  However, it randomly dies.  Running in "dl
5" mode I get the following output:

119388   3713         main     SENDING LOAD AND REPORTS
119389   3713         main      REPORT_LOAD
119390   3713         main --> execd_add_load_report() {
119391   3713         main --> sge_build_load_report() {
119392   3713         main --> sge_get_loadavg() {
119393   3713         main --> sge_get_pids() {
119394   3713         main --> sge_peopen() {
119395   3713         main <-- sge_peopen() ../libs/uti/sge_stdio.c 287 }

Change History (3)

comment:1 Changed 11 years ago by dlove

  • Priority changed from highest to high
  • Severity set to minor

comment:2 Changed 10 years ago by aylee

I've built Windows binaries with sources from 2011-06-04 and experience the same error on Windows Server 2008 R2 (64bit). Interestingly up to now my 32bit Windows Server 2008 setup runs stable with the very same binaries...

During operation I've occasionally seen sge_execd output

error: "sge_peopen()" failed: "Resource temporarily unavailable"
error: can't get processes from ps command

but execd kept running...

In the crash case and dl 5 I get something like this:

 23290   1999         main <-- ptf_update_job_usage() ../daemons/execd/ptf.c 1624 }
 23291   1999         main --> clean_up_old_jobs() {
 23292   1999         main --> sge_get_pids() {
 23293   1999         main --> sge_peopen() {
 23294   1999         main <-- sge_peopen() ../libs/uti/sge_stdio.c 310 }

and just like the original submitter stated a "Killed" on the console.

comment:3 Changed 10 years ago by aylee

I drilled down on this error and found that in my case it always seemed to happen when daemons/execd/load_avg.c calls

svc_running = sge_get_pids(pids, 1, "SGE_Helper_Service.exe", PSCMD);

Also the problem was far more likely to appear when the machine had a lot of load.

In our setup all Windows nodes have the helper service installed so as a hotfix I'm setting svc_running = 1; and have seen no crash since!

Obviously that isn't a general fix...

Note: See TracTickets for help on using tickets.