[GE users] some MPI jobs disappear

Steven Ruby steven.ruby at wni.com
Tue May 10 16:15:00 BST 2005


I'm having this same issue. I am using automounted NFS spool dir though,
but I don't see any unmounts of the automount at the time of or before
the errors.


sr

--------
"Give me an army of West Point graduates and i'll win a battle. Give me
a handful of Texas Aggies and i'll win the war."
        -- Gen. George S. Patton

-----Original Message-----
From: Karen Brazier [mailto:karen.brazier at durham.ac.uk] 
Sent: Tuesday, May 10, 2005 9:56 AM
To: users at gridengine.sunsource.net
Subject: [GE users] some MPI jobs disappear

Hi Reuti,

Yes, the spool directories are local to the nodes.  The active_jobs
directory is owned by user and group 'codine' (sorry!) and has 755
permissions.  I've had no reports that non-MPI jobs are affected, and
even MPI jobs are only affected sometimes.  The startmpi.sh script from
before is identical to the one I have now.  The only non-standard bits
I found during the upgrade were in the start-up script in /etc/init.d,
to
handle failover to a standby master.

Many thanks,
Karen

On Tue, 10 May 2005, Reuti wrote:

> Hi Karen,
>
> do you have a local spool directory on the nodes, or is it central on
a
> NFS server? Only MPI jobs are affected? Had you modified the
startmpi.sh
> before the upgrade?
>
> CU - Reuti
>
>
> Karen Brazier wrote:
> > Hi,
> >
> > I've recently upgraded from 6.0u1 to 6.0u3, which means that the
memory
> > leak in schedd is cured but now some MPI jobs are failing to start
and
> > others produce an error message on exit.
> >
> > A summary of symptoms for the jobs that fail is:
> >
> > . startmpi.sh can't read (or can't find) the pe_hostfile
> > . jobs fail with error "21 : in recognising job"
> > . qacct has very little info: it gives qsub time as 1 Jan 01:00:00
1970
> >   and has 'UNKNOWN' for hostname, group and owner
> > . SGE messages file reports, e.g:
> > 05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
> > node087.beowulf.cluster in recognising job because: execd doesn't
know
> > this job
> > 05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
> > reports running state for job (18560.1/master) in queue
> > "ether.q at node087.beowulf.cluster" while job is in state 65536
> > 05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
> > reports running job (18560.1/master) in queue
> > "ether.q at node087.beowulf.cluster" that was not supposed to be there
-
> > killing
> >
> > Other MPI jobs complete, but produce error "100 : assumedly after
job" and
> > the messages file states:  05/09/2005
15:28:01|qmaster|hamilton|E|tightly
> > integrated parallel task 18578.1 task 1.node025 failed - killing job
> >
> >
> > Can anyone point me towards the problem?
> >
> > Many thanks,
> > Karen
> >
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list