[GE users] PE Slots Problem

Brian R. Smith brs at usf.edu
Thu Aug 9 14:52:36 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I sent this message out when the mailing lists were down so I'm sending 
it again.  Also, I'd like to add the fact that I did ensure that any 
error states on hosts with PE mpi.p4 were cleared, in case anyone is 
wondering.

...

Hi all,

We're on GridEngine 6.0-u8 (yeah, i know, but we'll be upgrading to 6.1
in the next couple of weeks).  Its been fairly trouble-free but I've
just run into an interesting problem.  Perhaps someone can shed some
light.

A user has submitted an 8-processor job to an 8-way opteron box.  The
queue for this box has been configured to support the parallel
environment mpi.p4 as we see here:

[root at host ~]# qconf -sq smp.8.q | grep pe_list
pe_list               mpi.shm mpi.p4 ompi.tcp ompi openmp

Also, the PE itself is configured like so:

[root at host ~]# qconf -sp mpi.p4
pe_name           mpi.p4
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile  \
                  /usr/local/priv/mpi/bin/mpirun
stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

We can see from qstat -r that the PE is only being requested by this
particular job:

[root at host ~]# qstat -r | grep mpi.p4
       Requested PE:     mpi.p4 8

and that only 8 slots are being requested.

When this job is submitted, it sits in the queue and qstat -j reports

[root at host ~]# qstat -j
==============================================================
job_number:                 41428
exec_file:                  job_scripts/41428
submission_time:            Thu Aug  2 11:16:25 2007
...
cannot run in PE "mpi.p4" because it only offers 0 slots

After blowing through a bunch of other queues with lower seq_no, it hits
smp.8.q.  There are enough slots on the queue to satisfy the job, but
the scheduler claims that there aren't enough provided by the PE.  I was
tempted to look for some global value for PE slots but that seems a bit
ridiculous.  Has anyone seen this before?  Is this a bug that was
corrected in a later release?  Did I miss something obvious?

In case you are curious, my complete PE requests, in this case, are the
following:

[root at irce qmaster]# qstat -r | grep PE
       Requested PE:     ompi.ib 1
       Granted PE:       ompi.ib 1
       Requested PE:     ompi.ib 4
       Granted PE:       ompi.ib 4
       Requested PE:     ompi.ib 40
       Granted PE:       ompi.ib 40
       Requested PE:     ompi.mx 10
       Granted PE:       ompi.mx 10
       Requested PE:     mpi.mx 4
       Granted PE:       mpi.mx 4
       Requested PE:     ompi.mx 14
       Granted PE:       ompi.mx 14
       Requested PE:     ompi.mx 8
       Granted PE:       ompi.mx 8
       Requested PE:     ompi.mx 14
       Requested PE:     ompi.mx 12
       Requested PE:     ompi.mx 12
       Requested PE:     ompi.mx 12
       Requested PE:     ompi.mx 12
       Requested PE:     ompi.mx 12
       Requested PE:     ompi.mx 14
       Requested PE:     mpi.p4 8
       Requested PE:     ompi.mx 10
       Requested PE:     ompi.tcp 8

I'd appreciate any suggestions!

Thanks,
Brian Smith

-- 
Brian R. Smith
HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. LIB618
Office Phone: +1 813 974-1467
Mobile Phone: +1 813 230-3441
Organization URL: http://rc.usf.edu


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list