Opened 11 years ago

Last modified 9 years ago

#517 new defect

IZ2570: shepherd dies after a sge_execd restart

Reported by: goncalo Owned by:
Priority: normal Milestone:
Component: sge Version: 6.1u3
Severity: Keywords: execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2570]

        Issue #:      2570             Platform:     Other    Reporter: goncalo (goncalo)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      6.1u3       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     shepherd dies after a sge_execd restart
   Status whiteboard:
      Attachments:

     Issue 2570 blocks:
   Votes for issue 2570:


   Opened: Thu May 8 03:35:00 -0700 2008 
------------------------


We observe the following behaviour in SLC4:

1) A recent change in our fabric management system was restarting sge_execd in
all queue instances, and afterwards, we were getting messages like:

05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86)
05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF
05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs
05/02/2008 18:51:36|execd|lflip19|I|found directory of job "active_jobs/330153.1"
05/02/2008 18:51:36|execd|lflip19|I|shepherd for job active_jobs/330153.1 has
pid "16717" and is not alive
05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd for job
330153.1: "exit_status" file is empty
05/02/2008 18:51:36|execd|lflip19|E|can't open usage file
"active_jobs/330153.1/usage" for job 330153.1: No such file or directory

2) I don't know if execd has some mechanism to recover running jobs before the
controlled shutdown... If it has, it is not working properly, at least, in our
configuration, because the running jobs dissapears from SGE, at least
apparently, because the shepherd dies but the processes started by the shepherd
continue in the machine... Looking to the tree of process (pstree -Gap) after
the execd controlled shutdown, you would see the user script starting directly
from "init,1":

init,1
│─sh,21479 /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333897
│ └─bootstrap.k2181,21813 -w /tmp/bootstrap.k21810 /home/cms067/ ce02.lip.pt
/home/cms067/.globus/job/ce02.lip.pt/22189.1210069337/x509_up ...
│ ├─bootstrap.k2181,21817 -w /tmp/bootstrap.k21810 /home/cms067/ ce02.lip.pt ...
│ ├─bootstrap.k2181,21974 -w /tmp/bootstrap.k21810 /home/cms067/ ce02.lip.pt ...
│ └─sh,22013 -c...
│ └─jobwrapper,22014 /opt/lcg/libexec/jobwrapper
/home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
│ └─globus-tmp.lfli,22015
/home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
│ └─globus-tmp.lfli,22134
/home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2...
│ └─time,22135 -p perl -e...
│ └─sh,22137 -c ...
│ └─jobExecutor,22138 2
│ ├─ch_tool,22149
│ │ └─programExecutor,22152 1
│ │ ├─BossRuntime_cra,22158 ./BossRuntime_crabjob
│ │ ├─CMSSW.sh,22162 CMSSW.sh 2 ...
│ │ │ └─cmsRun,22330 -j crab_fjr.xml -p pset.cfg
│ │ ├─programExecutor,22159 1
│ │ ├─tee,22160 /tmp//BossTeePipe-crabjob22152
│ │ └─tee,22161 /tmp//BossTeePipe-crabjob22152
│ └─dbUpdator,22144 1_2_1 746b7b06-ab68-4ae2-a9fe-0c7a94c6cf42 RTConfig.clad
│ └─sleep,31466 30

3) The problems (bug?) which I think it may exist is connected to:
a) If execd has some mechanisms to recover running jobs before a restart, it
doesn't seem to be working properly...
b) If execd does not has some mechanisms to recover running jobs before a
controlled shutdown, then it should kill all active jobs (and child processes)
properly. This doesn't seem the case also.

4) Check the following example... This is the tree of processes for a normal job...

init,1
├─sge_execd,9328
│ ├─sge_shepherd,16984 -bg
│ │ └─sh,16986 /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
│ │ └─bootstrap.P1737,17375 -w /tmp/bootstrap.P17371 /home/cms067/ ce02.lip.pt ...
│ │ ├─bootstrap.P1737,17384 -w /tmp/bootstrap.P17371 /home/cms067/ ce02.lip.pt ...
│ │ ├─bootstrap.P1737,17558 -w /tmp/bootstrap.P17371 /home/cms067/ ce02.lip.pt ...
│ │ └─sh,17653 -c...
│ │ └─jobwrapper,17656 /opt/lcg/libexec/jobwrapper
/home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
│ │ └─globus-tmp.lfli,17658
/home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
│ │ └─globus-tmp.lfli,18122
/home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2...
│ │ └─time,18123 -p perl -e...
│ │ └─sh,18125 -c ...
│ │ └─jobExecutor,18127 15
│ │ ├─ch_tool,18144
│ │ │ └─programExecutor,18147 1
│ │ │ ├─BossRuntime_cra,18155 ./BossRuntime_crabjob
│ │ │ ├─CMSSW.sh,18159 CMSSW.sh 15 ...
│ │ │ │ └─cmsRun,18441 -j crab_fjr.xml -p pset.cfg
│ │ │ ├─programExecutor,18156 1
│ │ │ ├─tee,18157 /tmp//BossTeePipe-crabjob18147
│ │ │ └─tee,18158 /tmp//BossTeePipe-crabjob18147
│ │ └─dbUpdator,18137 1_15_1 6a4a2a4d-a0af-4f88-a8c2-d810a4a6bfa6 RTConfig.clad
│ │ └─sleep,30576 30


If you try to see the different processes IDs, you see that sge_execd and
shepherd share the same SID

PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
1 9328 9328 1801 ? -1 S 0 16:10 /usr/local/sge/V61u3/bin/lx26-x86/sge_execd
9328 16984 16984 1801 ? -1 S 0 0:00 sge_shepherd-333815 -bg
9328 21478 21478 1801 ? -1 S 0 0:00 sge_shepherd-333897 -bg

but shepherd does not share the same SID with the user job...

[root@lflip32 ~]# ps -jxa | grep
/usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815
16984 16986 16986 16986 ? -1 S 2067 0:00 -sh
/usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815

Can this be the reasons for situation described in 3 b) ?

   ------- Additional comments from goncalo Mon Jun 23 09:17:46 -0700 2008 -------
Hi,
I see no activity on this issue.
Does anyone else confirmed this bug? I think it is VERY important to solve it...
Cheers
Gonçalo

Change History (0)

Note: See TracTickets for help on using tickets.