[GE users] some MPI jobs disappear

Steven Ruby steven.ruby at wni.com
Tue May 10 16:32:47 BST 2005


Just fyi, this is a sun v20z N1GEu3 cluster with myrinet on rhel 3.0 


sr

--------
"Give me an army of West Point graduates and i'll win a battle. Give me
a handful of Texas Aggies and i'll win the war."
        -- Gen. George S. Patton

-----Original Message-----
From: Stephan Grell - Sun Germany - SSG - Software Engineer
[mailto:stephan.grell at sun.com] 
Sent: Tuesday, May 10, 2005 10:28 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] some MPI jobs disappear

Hello,

there is an issue with u3 and u2, which can lead to vanished pe jobs. Is

issue 1430 what
you experience?

Cheers,
Stephan

Steven Ruby wrote:

>I'm having this same issue. I am using automounted NFS spool dir
though,
>but I don't see any unmounts of the automount at the time of or before
>the errors.
>
>
>sr
>
>--------
>"Give me an army of West Point graduates and i'll win a battle. Give me
>a handful of Texas Aggies and i'll win the war."
>        -- Gen. George S. Patton
>
>-----Original Message-----
>From: Karen Brazier [mailto:karen.brazier at durham.ac.uk] 
>Sent: Tuesday, May 10, 2005 9:56 AM
>To: users at gridengine.sunsource.net
>Subject: [GE users] some MPI jobs disappear
>
>Hi Reuti,
>
>Yes, the spool directories are local to the nodes.  The active_jobs
>directory is owned by user and group 'codine' (sorry!) and has 755
>permissions.  I've had no reports that non-MPI jobs are affected, and
>even MPI jobs are only affected sometimes.  The startmpi.sh script from
>before is identical to the one I have now.  The only non-standard bits
>I found during the upgrade were in the start-up script in /etc/init.d,
>to
>handle failover to a standby master.
>
>Many thanks,
>Karen
>
>On Tue, 10 May 2005, Reuti wrote:
>
>  
>
>>Hi Karen,
>>
>>do you have a local spool directory on the nodes, or is it central on
>>    
>>
>a
>  
>
>>NFS server? Only MPI jobs are affected? Had you modified the
>>    
>>
>startmpi.sh
>  
>
>>before the upgrade?
>>
>>CU - Reuti
>>
>>
>>Karen Brazier wrote:
>>    
>>
>>>Hi,
>>>
>>>I've recently upgraded from 6.0u1 to 6.0u3, which means that the
>>>      
>>>
>memory
>  
>
>>>leak in schedd is cured but now some MPI jobs are failing to start
>>>      
>>>
>and
>  
>
>>>others produce an error message on exit.
>>>
>>>A summary of symptoms for the jobs that fail is:
>>>
>>>. startmpi.sh can't read (or can't find) the pe_hostfile
>>>. jobs fail with error "21 : in recognising job"
>>>. qacct has very little info: it gives qsub time as 1 Jan 01:00:00
>>>      
>>>
>1970
>  
>
>>>  and has 'UNKNOWN' for hostname, group and owner
>>>. SGE messages file reports, e.g:
>>>05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
>>>node087.beowulf.cluster in recognising job because: execd doesn't
>>>      
>>>
>know
>  
>
>>>this job
>>>05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
>>>reports running state for job (18560.1/master) in queue
>>>"ether.q at node087.beowulf.cluster" while job is in state 65536
>>>05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
>>>reports running job (18560.1/master) in queue
>>>"ether.q at node087.beowulf.cluster" that was not supposed to be there
>>>      
>>>
>-
>  
>
>>>killing
>>>
>>>Other MPI jobs complete, but produce error "100 : assumedly after
>>>      
>>>
>job" and
>  
>
>>>the messages file states:  05/09/2005
>>>      
>>>
>15:28:01|qmaster|hamilton|E|tightly
>  
>
>>>integrated parallel task 18578.1 task 1.node025 failed - killing job
>>>
>>>
>>>Can anyone point me towards the problem?
>>>
>>>Many thanks,
>>>Karen
>>>
>>>
>>>
>>>      
>>>
>---------------------------------------------------------------------
>  
>
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list