Opened 13 years ago
Last modified 10 years ago
#517 new defect
IZ2570: shepherd dies after a sge_execd restart
Reported by: | goncalo | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.1u3 |
Severity: | Keywords: | execution | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2570]
Issue #: 2570 Platform: Other Reporter: goncalo (goncalo) Component: gridengine OS: All Subcomponent: execution Version: 6.1u3 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: pollinger (pollinger) QA Contact: pollinger URL: * Summary: shepherd dies after a sge_execd restart Status whiteboard: Attachments: Issue 2570 blocks: Votes for issue 2570: Opened: Thu May 8 03:35:00 -0700 2008 ------------------------ We observe the following behaviour in SLC4: 1) A recent change in our fabric management system was restarting sge_execd in all queue instances, and afterwards, we were getting messages like: 05/02/2008 18:51:36|execd|lflip19|I|starting up GE 6.1u3 (lx26-x86) 05/02/2008 18:51:36|execd|lflip19|I|successfully started PDC and PTF 05/02/2008 18:51:36|execd|lflip19|I|checking for old jobs 05/02/2008 18:51:36|execd|lflip19|I|found directory of job "active_jobs/330153.1" 05/02/2008 18:51:36|execd|lflip19|I|shepherd for job active_jobs/330153.1 has pid "16717" and is not alive 05/02/2008 18:51:36|execd|lflip19|E|abnormal termination of shepherd for job 330153.1: "exit_status" file is empty 05/02/2008 18:51:36|execd|lflip19|E|can't open usage file "active_jobs/330153.1/usage" for job 330153.1: No such file or directory 2) I don't know if execd has some mechanism to recover running jobs before the controlled shutdown... If it has, it is not working properly, at least, in our configuration, because the running jobs dissapears from SGE, at least apparently, because the shepherd dies but the processes started by the shepherd continue in the machine... Looking to the tree of process (pstree -Gap) after the execd controlled shutdown, you would see the user script starting directly from "init,1": init,1 │─sh,21479 /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333897 │ └─bootstrap.k2181,21813 -w /tmp/bootstrap.k21810 /home/cms067/ ce02.lip.pt /home/cms067/.globus/job/ce02.lip.pt/22189.1210069337/x509_up ... │ ├─bootstrap.k2181,21817 -w /tmp/bootstrap.k21810 /home/cms067/ ce02.lip.pt ... │ ├─bootstrap.k2181,21974 -w /tmp/bootstrap.k21810 /home/cms067/ ce02.lip.pt ... │ └─sh,22013 -c... │ └─jobwrapper,22014 /opt/lcg/libexec/jobwrapper /home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2... │ └─globus-tmp.lfli,22015 /home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2... │ └─globus-tmp.lfli,22134 /home/cms067/globus-tmp.lflip32.21813.0/globus-tmp.lflip32.21813.2... │ └─time,22135 -p perl -e... │ └─sh,22137 -c ... │ └─jobExecutor,22138 2 │ ├─ch_tool,22149 │ │ └─programExecutor,22152 1 │ │ ├─BossRuntime_cra,22158 ./BossRuntime_crabjob │ │ ├─CMSSW.sh,22162 CMSSW.sh 2 ... │ │ │ └─cmsRun,22330 -j crab_fjr.xml -p pset.cfg │ │ ├─programExecutor,22159 1 │ │ ├─tee,22160 /tmp//BossTeePipe-crabjob22152 │ │ └─tee,22161 /tmp//BossTeePipe-crabjob22152 │ └─dbUpdator,22144 1_2_1 746b7b06-ab68-4ae2-a9fe-0c7a94c6cf42 RTConfig.clad │ └─sleep,31466 30 3) The problems (bug?) which I think it may exist is connected to: a) If execd has some mechanisms to recover running jobs before a restart, it doesn't seem to be working properly... b) If execd does not has some mechanisms to recover running jobs before a controlled shutdown, then it should kill all active jobs (and child processes) properly. This doesn't seem the case also. 4) Check the following example... This is the tree of processes for a normal job... init,1 ├─sge_execd,9328 │ ├─sge_shepherd,16984 -bg │ │ └─sh,16986 /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815 │ │ └─bootstrap.P1737,17375 -w /tmp/bootstrap.P17371 /home/cms067/ ce02.lip.pt ... │ │ ├─bootstrap.P1737,17384 -w /tmp/bootstrap.P17371 /home/cms067/ ce02.lip.pt ... │ │ ├─bootstrap.P1737,17558 -w /tmp/bootstrap.P17371 /home/cms067/ ce02.lip.pt ... │ │ └─sh,17653 -c... │ │ └─jobwrapper,17656 /opt/lcg/libexec/jobwrapper /home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2... │ │ └─globus-tmp.lfli,17658 /home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2... │ │ └─globus-tmp.lfli,18122 /home/cms067/globus-tmp.lflip32.17375.0/globus-tmp.lflip32.17375.2... │ │ └─time,18123 -p perl -e... │ │ └─sh,18125 -c ... │ │ └─jobExecutor,18127 15 │ │ ├─ch_tool,18144 │ │ │ └─programExecutor,18147 1 │ │ │ ├─BossRuntime_cra,18155 ./BossRuntime_crabjob │ │ │ ├─CMSSW.sh,18159 CMSSW.sh 15 ... │ │ │ │ └─cmsRun,18441 -j crab_fjr.xml -p pset.cfg │ │ │ ├─programExecutor,18156 1 │ │ │ ├─tee,18157 /tmp//BossTeePipe-crabjob18147 │ │ │ └─tee,18158 /tmp//BossTeePipe-crabjob18147 │ │ └─dbUpdator,18137 1_15_1 6a4a2a4d-a0af-4f88-a8c2-d810a4a6bfa6 RTConfig.clad │ │ └─sleep,30576 30 If you try to see the different processes IDs, you see that sge_execd and shepherd share the same SID PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 9328 9328 1801 ? -1 S 0 16:10 /usr/local/sge/V61u3/bin/lx26-x86/sge_execd 9328 16984 16984 1801 ? -1 S 0 0:00 sge_shepherd-333815 -bg 9328 21478 21478 1801 ? -1 S 0 0:00 sge_shepherd-333897 -bg but shepherd does not share the same SID with the user job... [root@lflip32 ~]# ps -jxa | grep /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815 16984 16986 16986 16986 ? -1 S 2067 0:00 -sh /usr/local/sge/V61u3/default/spool/lflip32/job_scripts/333815 Can this be the reasons for situation described in 3 b) ? ------- Additional comments from goncalo Mon Jun 23 09:17:46 -0700 2008 ------- Hi, I see no activity on this issue. Does anyone else confirmed this bug? I think it is VERY important to solve it... Cheers Gonçalo
Note: See
TracTickets for help on using
tickets.