[GE users] Infiniband does not work with SGE+MPI on Linux

vaclam1 at fel.cvut.cz vaclam1 at fel.cvut.cz
Thu Dec 18 12:38:11 GMT 2008


Hi,

I have a problem with running parallel programs using MPI+Infiniband.
When I run a parallel program using SGE over Infiniband then the
program never ends or runs very very long !

For example:
-----------------------------------------------------------------
frontend$> cat parallel_job.sh
#!/bin/sh
# -S /bin/sh
# -cwd
# -e .
# -o .
# -V

INFINIBAND="true"

MY_PARALLEL_PROGRAM="./example1"

if [[ ${INFINIBAND} = "true" ]]
then
   # InfiniBand
   # It is same:
   # mpirun --mca btl openib,self -np $NSLOTS ${MY_PARALLEL_PROGRAM}
   mpirun -np $NSLOTS ${MY_PARALLEL_PROGRAM}
else
   # Ethernet
   mpirun --mca btl tcp,self -np $NSLOTS ${MY_PARALLEL_PROGRAM}
fi


frontend$> qsub -pe ompi N -q vip.q parallel_job.sh
-----------------------------------------------------------------


If N >= X then parallel program never ends.  Some are processes
waiting at the barrier on messages. The messages are send to the
waiting processes but messages are not delivered (the sending process
really sends the messages).

If N <  X then parallel program runs very very long.

Value of N (X) depend on specific parallel program. The problem
occures only when the messages are sent using the Infiniband.



When I run parallel programs using SGE over the Ethernet then programs
runs good.  When I run parallel programs (over Infiniband) directly
from the command line using openMPI then the programs runs good and
quick.

For example:
-----------------------------------------------------------------
frontend$> mpirun --hostfile hostfile -np 12 ./example1

frontend$> cat hostfile
node-003 slots=4 max-slots=4
node-005 slots=4 max-slots=4
node-008 slots=4 max-slots=4
node-010 slots=4 max-slots=4
node-012 slots=4 max-slots=4
node-014 slots=4 max-slots=4

-----------------------------------------------------------------

we have:

SGE 6.2

6 nodes (6*4 cpu = 6 * Dual-Core AMD Opteron(tm) Processor 2218)
InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor  
compatibility mode) (rev a0)
SuSE SLES 10

openMPI 1.2.7
(config: ./configure --prefix=/home/openmpi-1.2.7/ \
                      --enable-mpirun-prefix-by-default \
                      --enable-mpi-threads --with-threads \
                      --with-loadleveler=/opt/ibmll/LoadL/full/
                      --with-openib=/usr/lib64/
)

OFED 1.3 (Cisco_OFED-1.3-fcs.sles10.iso)

The configuration of the SGE is the following:

===============================================================================
frontend$> qconf -sconf
#global:
execd_spool_dir              /opt/sge6_2/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       /opt/sge6_2/default/common/prolog.sh
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           sgeadmin at star.star
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 H_MEMORYLOCKED=unlimited
reporting_params             accounting=true reporting=false \
                              flush_time=00:00:15 joblog=false  
sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20004
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             100
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 0

===============================================================================

(all nodes are same)
frontend$> qconf -sconf node-003
#node-003.star:
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm

===============================================================================

frontend$> cat /opt/sge6_2/default/common/prolog.sh
#!/bin/sh

CHYBOVY_KOD_EXIT=100

ORIGINAL_ECHO="/bin/echo"

FRONTA_SERIAL=serial.q
FRONTA_FAST=fast.q
FRONTA_SHORT=short.q
FRONTA_STANDARD=standard.q
FRONTA_LONG=long.q

MAX_CPU_SERIAL=1
MAX_CPU_FAST=4
MAX_CPU_SHORT=8
MAX_CPU_STANDARD=4
MAX_CPU_LONG=12

error_cpus()
{
   $ORIGINAL_ECHO ""
   $ORIGINAL_ECHO "ERROR: You set bad values of slots (cpu, process) !"
   $ORIGINAL_ECHO ""
   $ORIGINAL_ECHO "Max. slots for queue is:"
   $ORIGINAL_ECHO "(queue) fast:     max. slots = ${MAX_CPU_FAST}"
   $ORIGINAL_ECHO "(queue) short:    max. slots = ${MAX_CPU_SHORT}"
   $ORIGINAL_ECHO "(queue) standard: max. slots = ${MAX_CPU_STANDARD}"
   $ORIGINAL_ECHO "(queue) long:     max. slots = ${MAX_CPU_LONG}"
   $ORIGINAL_ECHO ""

   exit ${CHYBOVY_KOD_EXIT}
}

case "${QUEUE}" in
      "${FRONTA_SERIAL}")
          if [ "${NSLOTS}" -lt "1" ] || [ "${NSLOTS}" -gt "${MAX_CPU_SERIAL}" ]
          then
               error_cpus
          fi
          ;;
....
      *)
          ;;
esac

===============================================================================

frontend$> qconf -shgrp @allhosts
group_name @allhosts
hostlist node-014.star node-012.star node-010.star node-008.star  
node-005.star \
          node-003.star

===============================================================================

frontend$> qconf -sp ompi
pe_name            ompi
slots              24
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

===============================================================================

(all queues are same - different only in h_rt, seq_no, user_lists)
frontend$> qconf -sq vip.q
qname                 vip.q
hostlist              @allhosts
seq_no                5
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               ompi
rerun                 FALSE
slots                  
2,[node-012.star=4],[node-008.star=4],[node-003.star=4], \
                       [node-014.star=4],[node-010.star=4],[node-005.star=4]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            vip_users
xuser_lists           NONE
subordinate_list      2slots_per_host.q=3
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY

===============================================================================

frontend$> qconf -ssconf
algorithm                         default
schedule_interval                 0:0:5
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  3
flush_finish_sec                  3
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         400000
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  INFINITY

===============================================================================

frontend$> qconf -srqs max_slots_per_host
{
    name         max_slots_per_host
    description  "Na kazdem hostu mohou bezet maximalne 4 procesy"
    enabled      TRUE
    limit        hosts {*} to slots=04
}

===============================================================================

Any suggestions or ideas?

Milan

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93161

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list