[GE users] some MPI jobs disappear

Karen Brazier karen.brazier at durham.ac.uk
Tue May 10 16:52:36 BST 2005


YES, that looks very familiar.  Thanks!  I'll take a look at the
workaround and get back to you if I can't fix it.

Thanks again,
--Karen

On Tue, 10 May 2005, Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

> Hello,
>
> there is an issue with u3 and u2, which can lead to vanished pe jobs. Is
> issue 1430 what
> you experience?
>
> Cheers,
> Stephan
>
> Steven Ruby wrote:
>
> >I'm having this same issue. I am using automounted NFS spool dir though,
> >but I don't see any unmounts of the automount at the time of or before
> >the errors.
> >
> >
> >sr
> >
> >--------
> >"Give me an army of West Point graduates and i'll win a battle. Give me
> >a handful of Texas Aggies and i'll win the war."
> >        -- Gen. George S. Patton
> >
> >-----Original Message-----
> >From: Karen Brazier [mailto:karen.brazier at durham.ac.uk]
> >Sent: Tuesday, May 10, 2005 9:56 AM
> >To: users at gridengine.sunsource.net
> >Subject: [GE users] some MPI jobs disappear
> >
> >Hi Reuti,
> >
> >Yes, the spool directories are local to the nodes.  The active_jobs
> >directory is owned by user and group 'codine' (sorry!) and has 755
> >permissions.  I've had no reports that non-MPI jobs are affected, and
> >even MPI jobs are only affected sometimes.  The startmpi.sh script from
> >before is identical to the one I have now.  The only non-standard bits
> >I found during the upgrade were in the start-up script in /etc/init.d,
> >to
> >handle failover to a standby master.
> >
> >Many thanks,
> >Karen
> >
> >On Tue, 10 May 2005, Reuti wrote:
> >
> >
> >
> >>Hi Karen,
> >>
> >>do you have a local spool directory on the nodes, or is it central on
> >>
> >>
> >a
> >
> >
> >>NFS server? Only MPI jobs are affected? Had you modified the
> >>
> >>
> >startmpi.sh
> >
> >
> >>before the upgrade?
> >>
> >>CU - Reuti
> >>
> >>
> >>Karen Brazier wrote:
> >>
> >>
> >>>Hi,
> >>>
> >>>I've recently upgraded from 6.0u1 to 6.0u3, which means that the
> >>>
> >>>
> >memory
> >
> >
> >>>leak in schedd is cured but now some MPI jobs are failing to start
> >>>
> >>>
> >and
> >
> >
> >>>others produce an error message on exit.
> >>>
> >>>A summary of symptoms for the jobs that fail is:
> >>>
> >>>. startmpi.sh can't read (or can't find) the pe_hostfile
> >>>. jobs fail with error "21 : in recognising job"
> >>>. qacct has very little info: it gives qsub time as 1 Jan 01:00:00
> >>>
> >>>
> >1970
> >
> >
> >>>  and has 'UNKNOWN' for hostname, group and owner
> >>>. SGE messages file reports, e.g:
> >>>05/09/2005 09:18:57|qmaster|hamilton|W|job 18560.1 failed on host
> >>>node087.beowulf.cluster in recognising job because: execd doesn't
> >>>
> >>>
> >know
> >
> >
> >>>this job
> >>>05/09/2005 09:19:03|qmaster|hamilton|E|execd node087.beowulf.cluster
> >>>reports running state for job (18560.1/master) in queue
> >>>"ether.q at node087.beowulf.cluster" while job is in state 65536
> >>>05/09/2005 09:19:43|qmaster|hamilton|E|execd at node087.beowulf.cluster
> >>>reports running job (18560.1/master) in queue
> >>>"ether.q at node087.beowulf.cluster" that was not supposed to be there
> >>>
> >>>
> >-
> >
> >
> >>>killing
> >>>
> >>>Other MPI jobs complete, but produce error "100 : assumedly after
> >>>
> >>>
> >job" and
> >
> >
> >>>the messages file states:  05/09/2005
> >>>
> >>>
> >15:28:01|qmaster|hamilton|E|tightly
> >
> >
> >>>integrated parallel task 18578.1 task 1.node025 failed - killing job
> >>>
> >>>
> >>>Can anyone point me towards the problem?
> >>>
> >>>Many thanks,
> >>>Karen
> >>>
> >>>
> >>>
> >>>
> >>>
> >---------------------------------------------------------------------
> >
> >
> >>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >>
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list