No subject


Wed Jan 12 20:38:46 GMT 2011


Mon Apr  3 08:50:31 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:51:13 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:51:55 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:52:37 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf

The message repeats at least once per minute and will go on for
weeks if there is no intervention.  This doesn't prevent jobs from
running so it can go unnoticed except for the fact that the message
files become very large.

I can stop these messages with a 'rcsge softstop -execd' followed by a
'rcsge start -execd' on the problem node(s).


One circumstance I've found that will trigger these unending messages
is when the rank 0 process of an mpich program aborts without calling
mpi_finalize.  When the job's slave processes notice that the master
process is gone the slaves abort too and often (but not always)
these messages start filling the slave node message files. 

Below is a simple program and script that usually triggers this here
(I say "usually" because sometimes it runs without causing the
problem, sometimes just a few slave nodes show the problem, sometimes
all the slave nodes).

The program is a slight modification of one of the fortran example
programs that comes with mpich.  The modification allows a
30 second sleep when the input value is negative - just to make sure
there is some time separation between job initation and termination.

c**********************************************************************
c   pi.f - compute pi by integrating f(x) = 4/(1 + x**2)     
c     
c   Each node: 
c    1) receives the number of rectangles used in the approximation.
c    2) calculates the areas of it's rectangles.
c    3) Synchronizes for a global summation.
c   Node 0 prints the result.
c
c  Variables:
c
c    pi  the calculated result
c    n   number of points of integration.  
c    x           midpoint of each rectangle's interval
c    f           function to integrate
c    sum,pi      area of rectangles
c    tmp         temporary scratch space for global summation
c    i           do loop index
c****************************************************************************
      program main

      include 'mpif.h'

      double precision  PI25DT
      parameter        (PI25DT = 3.141592653589793238462643d0)

      double precision  mypi, pi, h, sum, x, f, a
      integer n, myid, numprocs, i, rc
c                                 function to integrate
      f(a) = 4.d0 / (1.d0 + a*a)

      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
      print *, 'Process ', myid, ' of ', numprocs, ' is alive'

      sizetype   = 1
      sumtype    = 2
      
 10   if ( myid .eq. 0 ) then
         write(6,98)
 98      format('Enter the number of intervals: (0 quits)')
         read(5,99) n
 99      format(i10)
      endif
      
      call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

c                                 check for quit signal
      if ( n .eq. 0 ) goto 30
      if ( n .lt. 0 ) then
         call sleep(30)           ! 30 secs
         go to 10
      endif

c                                 calculate the interval size
      h = 1.0d0/n

      sum  = 0.0d0
      do 20 i = myid+1, n, numprocs
         x = h * (dble(i) - 0.5d0)
         sum = sum + f(x)
 20   continue
      mypi = h * sum

c                                 collect all the partial sums
      call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
     $     MPI_COMM_WORLD,ierr)

c                                 node 0 prints the answer.
      if (myid .eq. 0) then
         write(6, 97) pi, abs(pi - PI25DT)
 97      format('  pi is approximately: ', F18.16,
     +          '  Error is: ', F18.16)
      endif

      goto 10

 30   call MPI_FINALIZE(rc)
      stop
      end
------

The sge script to run this is:

#!/bin/bash

#$ -N pi3
#$ -cwd
#$ -pe mpich 4
#$ -v COMMD_PORT
#$ -V

mpirun -np $NSLOTS -machinefile $TMPDIR/machines pi3 <<end
100
-1
100
end

When the rank 0 process reads the 'end' it aborts.

The job starts on node 173 and activates slaves on nodes 173, 158, 157
and 197.

Here's what appears in the output file:

/sharedsys/gridengine/bin/glinux/qrsh -inherit -nostdin
n157.cluster.chem.dal.ca /home/pmacinnis/misc/mpich/rmlog/pi3
n173.cluster.chem.dal.ca 47003 \-p4amslave \-p4yourname
n157.cluster.chem.dal.ca \-p4rmrank 2
/sharedsys/gridengine/bin/glinux/qrsh -inherit -nostdin
n197.cluster..chem.dal.ca /home/pmacinnis/misc/mpich/rmlog/pi3
n173.cluster.chem.dal.ca 47003 \-p4amslave \-p4yourname
n197.cluster.chem.dal.ca \-p4rmrank 3
 Process            0  of            4  is alive
Enter the number of intervals: (0 quits)
 Process            1  of            4  is alive
 Process            2  of            4  is alive
 Process            3  of            4  is alive
  pi is approximately: 3.1416009869231249  Error is: 0.0000083333333318
Enter the number of intervals: (0 quits)
Enter the number of intervals: (0 quits)
  pi is approximately: 3.1416009869231249  Error is: 0.0000083333333318
Enter the number of intervals: (0 quits)
p1_29286:  p4_error: net_recv read:  probable EOF on socket: 1
p3_1712:  p4_error: net_recv read:  probable EOF on socket: 1
p2_15447:  p4_error: net_recv read:  probable EOF on socket: 1
bm_list_4708: (34.752555) wakeup_slave: unable to interrupt slave 0 pid
4707
bm_list_4708: (34.752637) wakeup_slave: unable to interrupt slave 0 pid
4707
bm_list_4708: (34.752671) wakeup_slave: unable to interrupt slave 0 pid
4707

Here's the error output file:

forrtl: severe (24): end-of-file during read, unit 5, file stdin
Image              PC        Routine            Line        Source
pi3                080BA6E8  Unknown               Unknown  Unknown
pi3                080BA1E0  Unknown               Unknown  Unknown
pi3                080B8DB1  Unknown               Unknown  Unknown
pi3                0808A464  Unknown               Unknown  Unknown
pi3                0808A907  Unknown               Unknown  Unknown
pi3                0807C59C  Unknown               Unknown  Unknown
pi3                0804B346  Unknown               Unknown  Unknown
pi3                0804B1BD  Unknown               Unknown  Unknown
libc.so.6          40085A47  Unknown               Unknown  Unknown
pi3                0804B051  Unknown               Unknown  Unknown
:can't open environment file: No such file or directory:can't open
environment file: No such file or directory:can't open environment file:
No such file or directory

Here's the message file for slave node 158:

Mon Apr  3 08:48:25 2006|execd|n158|I|SIGNAL jid: 18001 jatask: 1 signal:
KILL
Mon Apr  3 08:48:25 2006|execd|n158|E|can't remove directory
"active_jobs/18001.1": rmdir(active_jobs/18001.1/1.n158) failed: Directory
not empty
Mon Apr  3 08:48:27 2006|execd|n158|E|mailer exited with exit status = 11
Mon Apr  3 08:49:07 2006|execd|n158|E|acknowledge for unknown job
18001.1/master
Mon Apr  3 08:49:07 2006|execd|n158|E|incorrect config file for job
18001.1
Mon Apr  3 08:49:07 2006|execd|n158|E|ERROR: unlinking
"jobs/00/0001/8001.1": No such file or directory
Mon Apr  3 08:49:07 2006|execd|n158|E|can not remove job spool file:
jobs/00/0001/8001.1
Mon Apr  3 08:49:07 2006|execd|n158|E|can't remove directory
"active_jobs/18001.1": opendir(active_jobs/18001.1) failed: No such file
or directory
Mon Apr  3 08:49:07 2006|execd|n158|E|ja-task "18001.1" is unknown -
reporting it to qmaster
Mon Apr  3 08:49:49 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:49:49 2006|execd|n158|E|acknowledge for unknown job
18001.1/master
Mon Apr  3 08:49:49 2006|execd|n158|E|can't find active jobs directory
"active_jobs/18001.1" for reaping job 18001
Mon Apr  3 08:49:49 2006|execd|n158|E|ERROR: unlinking
"jobs/00/0001/8001.1": No such file or directory
Mon Apr  3 08:49:49 2006|execd|n158|E|can not remove job spool file:
jobs/00/0001/8001.1
Mon Apr  3 08:49:49 2006|execd|n158|E|can't remove directory
"active_jobs/18001.1": opendir(active_jobs/18001.1) failed: No such file
or directory
Mon Apr  3 08:50:31 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:51:13 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:51:55 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
Mon Apr  3 08:52:37 2006|execd|n158|E|could not find job report for job
18001.1 task 1.n158 contained in job usage from ptf
 ...

Does anyone else have this problem or know a solution?

Paul








---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list