[GE users] execd doesn't know this job (disappering jobs, 't' problem)

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Tue Jan 25 12:46:39 GMT 2005


H Christian,

I had a look at message files and accounting files and it looks to me,
that you have a execd spool issue. You might have gotten there differently
than in 103 described... but your jobs are disappearing, because there are
spool problems:

Example:
01/23/2005 20:29:31|execd|node0034a|E|can't find active jobs directory
"active_jobs/46964.1" for reaping job 46964
01/23/2005 20:29:31|execd|node0034a|E|ERROR: unlinking
"jobs/00/0004/6964.1": No such file or directory
01/23/2005 20:29:31|execd|node0034a|E|can not remove file job spool
file: jobs/00/0004/6964.1
01/23/2005 20:29:31|execd|node0034a|E|can't remove directory
"active_jobs/46964.1": opendir(active_jobs/46964.1) failed: No such file
or directory
01/23/2005 20:29:31|execd|node0034a|E|ja-task "46964.1" is unknown -
reporting it to qmaster

and the account file has incomplete data for this job.

I was able to fix the problem by:

- shutting down the execd
- making sure that the following dirs in the execd spool dir exist:
    - active_jobs 
    - job_scripts 
    - jobs
- making sure, that the admin_user has read and write permissions for
the dirs.
- restarting the execd.

In the other case I was working on, was usually the job_scripts dir missing,
sometimes the active_jobs dir.

Can you please evaluate your local spool dirs? I hope, that it
fixes your problem.

Cheers,
Stephan



Christian Bolliger wrote:

>As it seems it came from u2. But u1 was only in test use with a limit  
>number of nodes.
>
>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>  
>
>>Did you have the problems right from the beginning or did it come with
>>u2? Did you use u1?
>>
>>
>>Have you configured the admin email? When ever something improtant
>>goes wrong in the execd, it tries to send an email.
>>
>>Stephan
>>
>>Christian Bolliger wrote:
>>
>>    
>>
>>>Sorry forgot that, it is lx26-amd64 (Suse 9.2).
>>>The binaries were compiled by us.
>>>      
>>>
>>Hm.. now I know of 2 grids using lx2?-amd64, with this problem...
>>
>>    
>>
>>>Christian Bolliger
>>>
>>>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>
>>>      
>>>
>>>>Christian Bolliger wrote:
>>>>
>>>>        
>>>>
>>>>>Hello Stephan
>>>>>
>>>>>Wie have a Cluster with 256 nodes. 128 node Myrinet, 128 node gbit. 
>>>>>The bug is only relevant on Myrinet, it also occurs on the Gbit 
>>>>>queue but there we have mostly sequential jobs.
>>>>>SGE setup:
>>>>>- NFS reduced setup with local executables and local spool but 
>>>>>distrbution configuration (for shadow masters).
>>>>>- Using Berkeley DB as spool mechanism on a distinguished server 
>>>>>(single point of failure).
>>>>>          
>>>>>
>>>>
>>>>
>>>>Which operation systems are you using?
>>>>- We did see the problem only lx24-amd64. But that might be, because 
>>>>only those machines
>>>>were used for the large parallel jobs.
>>>>
>>>>        
>>>>
>>>>>The files will be sent directly, they are to big for the list.
>>>>>          
>>>>>
>>>>
>>>>
>>>>Thank you very much. I hope, I will find something.
>>>>
>>>>Stephan
>>>>
>>>>        
>>>>
>>>>>Christian
>>>>>
>>>>>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>
>>>>>          
>>>>>
>>>>>>Christian Bolliger wrote:
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>In that case it is not the same problem as 103 (I overlooked 
>>>>>>>something). It is a statistical problem, it just appears 
>>>>>>>sometimes, more or less equally distributed over all hosts.
>>>>>>>The problem of creating $SGE_ROOT/$SGE_CELL/spool/$hostname has 
>>>>>>>been solved by our installation procedure.
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>That means, that this bug ist still in there. I was hopping, that 
>>>>>>it is just the spool dirs. It looks, as if it is not only that.
>>>>>>Do you have some more information for me?  How is your grid set 
>>>>>>up? What are the operation systems that you
>>>>>>are using? Is this problem operating system specific?
>>>>>>
>>>>>>Can you send me:
>>>>>>- your acct file
>>>>>>- qmaster messages file
>>>>>>- one or two execd message files, on which that problem happend?
>>>>>>
>>>>>>Thank you very much.
>>>>>>Stephan
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Thanks
>>>>>>>Christian
>>>>>>>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>Christian Bolliger wrote:
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Hello Stephan
>>>>>>>>>Could be the same problem. I have the execd still on 6.0u2, I 
>>>>>>>>>will update them to 6.0u3 and test again.
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>This is not fixed in u3. A simple check of the spool dirs and 
>>>>>>>>creating
>>>>>>>>the missing dirs will do it.
>>>>>>>>
>>>>>>>>Cheers,
>>>>>>>>Stephan
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Thanks
>>>>>>>>>Christian
>>>>>>>>>
>>>>>>>>>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>Hello,
>>>>>>>>>>
>>>>>>>>>>I just worked on a similar problem and was able to bring it down
>>>>>>>>>>to a spooling issue on the execd side. The execd spool dir got
>>>>>>>>>>changed from NFS spooling to local spooling. This generated an
>>>>>>>>>>incomplete set of dirs in the spool dir.
>>>>>>>>>>
>>>>>>>>>>I do not know, if your problem is triggered by the same issue 
>>>>>>>>>>or I
>>>>>>>>>>just got lucky not to run into the t-state problem again. The 
>>>>>>>>>>spool dir
>>>>>>>>>>issue is number 103.
>>>>>>>>>>
>>>>>>>>>>Is your problem related to 103?
>>>>>>>>>>
>>>>>>>>>>We also found a but in file staging. If the file does not 
>>>>>>>>>>exist, a job
>>>>>>>>>>will disapear and an email will be send. Do you use file 
>>>>>>>>>>staging for
>>>>>>>>>>your jobs?
>>>>>>>>>>
>>>>>>>>>>Cheers,
>>>>>>>>>>Stephan
>>>>>>>>>>
>>>>>>>>>>Christian Bolliger wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>>Hello
>>>>>>>>>>>Sorry for bringing up a problem again. Using SGE 6.0u3, I 
>>>>>>>>>>>previously thought that the problem was linked to the 
>>>>>>>>>>>filehandle problem in 6.0u2.
>>>>>>>>>>>
>>>>>>>>>>>Jobs in our Myrinet section tend to disappear in the starting 
>>>>>>>>>>>phase (seems that also gbit mpich jobs are affected). They 
>>>>>>>>>>>will be taken in to 't' state and than quit without any 
>>>>>>>>>>>output (users call it 't' problem). Jobs using more CPUs are 
>>>>>>>>>>>more likely to disappear.
>>>>>>>>>>>It is not limited to specific exec hosts. It seems to be a 
>>>>>>>>>>>kind of racing condition.
>>>>>>>>>>>
>>>>>>>>>>>This problem really hinders production, some users are 
>>>>>>>>>>>demanding PBS :( .
>>>>>>>>>>>
>>>>>>>>>>>Many thanks for helping
>>>>>>>>>>>Christian
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 
>>>>>>>>>>>    
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>            
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>        
>>>>
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>    
>>
>
>
>  
>



More information about the gridengine-users mailing list