[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Jan 24 09:01:07 GMT 2005


Hello,

I just worked on a similar problem and was able to bring it down
to a spooling issue on the execd side. The execd spool dir got
changed from NFS spooling to local spooling. This generated an
incomplete set of dirs in the spool dir.

I do not know, if your problem is triggered by the same issue or I
just got lucky not to run into the t-state problem again. The spool dir
issue is number 103.

Is your problem related to 103?

We also found a but in file staging. If the file does not exist, a job
will disapear and an email will be send. Do you use file staging for
your jobs?

Cheers,
Stephan

Christian Bolliger wrote:

>Hello
>Sorry for bringing up a problem again. Using SGE 6.0u3, I previously 
>thought that the problem was linked to the filehandle problem in 6.0u2.
>
>Jobs in our Myrinet section tend to disappear in the starting phase 
>(seems that also gbit mpich jobs are affected). They will be taken in to 
>'t' state and than quit without any output (users call it 't' problem). 
>Jobs using more CPUs are more likely to disappear.
>It is not limited to specific exec hosts. It seems to be a kind of 
>racing condition.
>
>This problem really hinders production, some users are demanding PBS :( .
>
>Many thanks for helping
>Christian
>
>PS: I will also open an issue, but there might be other users with this 
>problem
>
>Logs:
>messages:01/21/2005 18:32:11|qmaster|master1|W|job 42972.1 failed on 
>host node0072a.mbit.mh.hpc.unizh.ch in recognising job because: execd 
>doesn't know this job
>messages:01/21/2005 18:32:14|qmaster|master1|E|execd 
>node0072a.mbit.mh.hpc.unizh.ch reports running state for job 
>(42972.1/master) in queue "long-myri.q at node0072a.mbit.mh.hpc.unizh.ch" 
>while job is in state 65536
>messages:01/21/2005 
>18:34:14|qmaster|master1|E|execd at node0072a.mbit.mh.hpc.unizh.ch reports 
>running job (42972.1/master) in queue 
>"long-myri.q at node0072a.mbit.mh.hpc.unizh.ch" that was not supposed to be 
>there - killing
>
>qacct -j
>==============================================================
>qname        long-myri.q        
>hostname     UNKNOWN            
>group        UNKNOWN            
>owner        UNKNOWN            
>project      id                 
>department   id                 
>jobname      test-long-4-11     
>jobnumber    42972              
>taskid       undefined
>account      sge                
>priority     0                  
>qsub_time    Thu Jan  1 01:00:00 1970
>start_time   -/-
>end_time     -/-
>granted_pe   mpich-gm           
>slots        4                  
>failed       21  : in recognising job
>exit_status  0                  
>ru_wallclock 0           
>ru_utime     0           
>ru_stime     0           
>ru_maxrss    0                  
>ru_ixrss     0                  
>ru_ismrss    0                  
>ru_idrss     0                  
>ru_isrss     0                  
>ru_minflt    0                  
>ru_majflt    0                  
>ru_nswap     0                  
>ru_inblock   0                  
>ru_oublock   0                  
>ru_msgsnd    0                  
>ru_msgrcv    0                  
>ru_nsignals  0                  
>ru_nvcsw     0                  
>ru_nivcsw    0                  
>cpu          0           
>mem          0.000            
>io           0.000            
>iow          0.000            
>maxvmem      0.000
>
>qacct -j of identical job which run:
>chribo at master1:~/mpich-test> qacct -j 42971
>==============================================================
>qname        long-myri.q        
>hostname     node0117a.mbit.mh.hpc.unizh.ch
>group        i2702              
>owner        chribo             
>project      id                 
>department   id                 
>jobname      test-long-4-11     
>jobnumber    42971              
>taskid       undefined
>account      sge                
>priority     0                  
>qsub_time    Fri Jan 21 18:26:21 2005
>start_time   Fri Jan 21 18:30:00 2005
>end_time     Fri Jan 21 18:30:09 2005
>granted_pe   mpich-gm           
>slots        4                  
>failed       0   
>exit_status  0                  
>ru_wallclock 9           
>ru_utime     0           
>ru_stime     0           
>ru_maxrss    0                  
>ru_ixrss     0                  
>ru_ismrss    0                  
>ru_idrss     0                  
>ru_isrss     0                  
>ru_minflt    20339              
>ru_majflt    0                  
>ru_nswap     0                  
>ru_inblock   0                  
>ru_oublock   0                  
>ru_msgsnd    0                  
>ru_msgrcv    0                  
>ru_nsignals  0                  
>ru_nvcsw     661                
>ru_nivcsw    202                
>cpu          0           
>mem          0.001            
>io           0.000            
>iow          0.000            
>maxvmem      142.223M
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list