[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Christian Bolliger christian.bolliger at id.unizh.ch
Fri Jan 21 20:28:04 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello
Sorry for bringing up a problem again. Using SGE 6.0u3, I previously 
thought that the problem was linked to the filehandle problem in 6.0u2.

Jobs in our Myrinet section tend to disappear in the starting phase 
(seems that also gbit mpich jobs are affected). They will be taken in to 
't' state and than quit without any output (users call it 't' problem). 
Jobs using more CPUs are more likely to disappear.
It is not limited to specific exec hosts. It seems to be a kind of 
racing condition.

This problem really hinders production, some users are demanding PBS :( .

Many thanks for helping
Christian

PS: I will also open an issue, but there might be other users with this 
problem

Logs:
messages:01/21/2005 18:32:11|qmaster|master1|W|job 42972.1 failed on 
host node0072a.mbit.mh.hpc.unizh.ch in recognising job because: execd 
doesn't know this job
messages:01/21/2005 18:32:14|qmaster|master1|E|execd 
node0072a.mbit.mh.hpc.unizh.ch reports running state for job 
(42972.1/master) in queue "long-myri.q at node0072a.mbit.mh.hpc.unizh.ch" 
while job is in state 65536
messages:01/21/2005 
18:34:14|qmaster|master1|E|execd at node0072a.mbit.mh.hpc.unizh.ch reports 
running job (42972.1/master) in queue 
"long-myri.q at node0072a.mbit.mh.hpc.unizh.ch" that was not supposed to be 
there - killing

qacct -j
==============================================================
qname        long-myri.q        
hostname     UNKNOWN            
group        UNKNOWN            
owner        UNKNOWN            
project      id                 
department   id                 
jobname      test-long-4-11     
jobnumber    42972              
taskid       undefined
account      sge                
priority     0                  
qsub_time    Thu Jan  1 01:00:00 1970
start_time   -/-
end_time     -/-
granted_pe   mpich-gm           
slots        4                  
failed       21  : in recognising job
exit_status  0                  
ru_wallclock 0           
ru_utime     0           
ru_stime     0           
ru_maxrss    0                  
ru_ixrss     0                  
ru_ismrss    0                  
ru_idrss     0                  
ru_isrss     0                  
ru_minflt    0                  
ru_majflt    0                  
ru_nswap     0                  
ru_inblock   0                  
ru_oublock   0                  
ru_msgsnd    0                  
ru_msgrcv    0                  
ru_nsignals  0                  
ru_nvcsw     0                  
ru_nivcsw    0                  
cpu          0           
mem          0.000            
io           0.000            
iow          0.000            
maxvmem      0.000

qacct -j of identical job which run:
chribo at master1:~/mpich-test> qacct -j 42971
==============================================================
qname        long-myri.q        
hostname     node0117a.mbit.mh.hpc.unizh.ch
group        i2702              
owner        chribo             
project      id                 
department   id                 
jobname      test-long-4-11     
jobnumber    42971              
taskid       undefined
account      sge                
priority     0                  
qsub_time    Fri Jan 21 18:26:21 2005
start_time   Fri Jan 21 18:30:00 2005
end_time     Fri Jan 21 18:30:09 2005
granted_pe   mpich-gm           
slots        4                  
failed       0   
exit_status  0                  
ru_wallclock 9           
ru_utime     0           
ru_stime     0           
ru_maxrss    0                  
ru_ixrss     0                  
ru_ismrss    0                  
ru_idrss     0                  
ru_isrss     0                  
ru_minflt    20339              
ru_majflt    0                  
ru_nswap     0                  
ru_inblock   0                  
ru_oublock   0                  
ru_msgsnd    0                  
ru_msgrcv    0                  
ru_nsignals  0                  
ru_nvcsw     661                
ru_nivcsw    202                
cpu          0           
mem          0.001            
io           0.000            
iow          0.000            
maxvmem      142.223M

-- 
=============================================================================
Christian Bolliger                 
IT Services                      | http://www.id.unizh.ch/
Central Systems / HPC   	 | http://www.matterhorn.unizh.ch/
University of  Zuerich           | E-Mail: christian.bolliger at id.unizh.ch
Winterthurerstr. 190             | Tel: +41 (0)1 63 56775
CH-8057 Zuerich; Switzerland     | Fax: +41 (0)1 63 54505
Mime/S CA:                https://www.ca.unizh.ch/client/




More information about the gridengine-users mailing list