[GE users] some MPI jobs disappear

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Tue May 10 16:27:36 BST 2005


Hello,

there is an issue with u3 and u2, which can lead to vanished pe jobs. Is 
issue 1430 what
you experience?

Cheers,
Stephan

Steven Ruby wrote:

>I'm having this same issue. I am using automounted NFS spool dir though,
>but I don't see any unmounts of the automount at the time of or before
>the errors.
>
>
>sr
>
>--------
>"Give me an army of West Point graduates and i'll win a battle. Give me
>a handful of Texas Aggies and i'll win the war."
>        -- Gen. George S. Patton
>
>-----Original Message-----
>From: Karen Brazier [mailto:karen.brazier at durham.ac.uk] 
>Sent: Tuesday, May 10, 2005 9:56 AM
>To: users at gridengine.sunsource.net
>Subject: [GE users] some MPI jobs disappear
>
>Hi Reuti,
>
>Yes, the spool directories are local to the nodes.  The active_jobs
>directory is owned by user and group 'codine' (sorry!) and has 755
>permissions.  I've had no reports that non-MPI jobs are affected, and
>even MPI jobs are only affected sometimes.  The startmpi.sh script from
>before is identical to the one I have now.  The only non-standard bits
>I found during the upgrade were in the start-up script in /etc/init.d,
>to
>handle failover to a standby master.
>
>Many thanks,
>Karen
>
>On Tue, 10 May 2005, Reuti wrote:
>
>  
>
>>Hi Karen,
>>
>>do you have a local spool directory on the nodes, or is it central on
>>    
>>
>a
>  
>
>>NFS server? Only MPI jobs are affected? Had you modified the
>>    
>>
>startmpi.sh
>  
>
>>before the upgrade?
>>
>>CU - Reuti
>>
>>
>>Karen Brazier wrote:
>>    
>>
>>>Hi,
>>>
>>>I've recently upgraded from 6.0u1 to 6.0u3, which means that the
>>>      
>>>
>memory
>  
>
>>>leak in schedd is cured but now some MPI jobs are failing to start
>>>      
>>>
>and
>  
>
>>>others produce an error message on exit.
>>>
>>>A summary of symptoms for the jobs that fail is:
>>>
>>>. startmpi.sh can't read (or can't find) the pe_hostfile
>>>. jobs fail with error "21 : in recognising job"
>>>. qacct has very little info: it gives qsub time as 1 Jan 01:00:00
>>>      
>>>
>1970
>  
>
>>>  and has 'UNKNOWN' for hostname, group and owner
>>>. SGE messages file reports, e.g:
>>>05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
>>>node087.beowulf.cluster in recognising job because: execd doesn't
>>>      
>>>
>know
>  
>
>>>this job
>>>05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
>>>reports running state for job (18560.1/master) in queue
>>>"ether.q at node087.beowulf.cluster" while job is in state 65536
>>>05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
>>>reports running job (18560.1/master) in queue
>>>"ether.q at node087.beowulf.cluster" that was not supposed to be there
>>>      
>>>
>-
>  
>
>>>killing
>>>
>>>Other MPI jobs complete, but produce error "100 : assumedly after
>>>      
>>>
>job" and
>  
>
>>>the messages file states:  05/09/2005
>>>      
>>>
>15:28:01|qmaster|hamilton|E|tightly
>  
>
>>>integrated parallel task 18578.1 task 1.node025 failed - killing job
>>>
>>>
>>>Can anyone point me towards the problem?
>>>
>>>Many thanks,
>>>Karen
>>>
>>>
>>>
>>>      
>>>
>---------------------------------------------------------------------
>  
>
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list