[GE users] Recurring problem with SGE 6.u10 & OpenMPI

flengyel flengyel at gc.cuny.edu
Mon Jul 6 02:23:31 BST 2009

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


    I have a recurring problem with OpenMPI and SGE 6.0u10. I once ran the simple test program
    below in April; now I have the following error messages:

    Jobs can not run because they have no access to pe
        22943,    22979,    22983,    22952,    22980,    22981

    Jobs can not run because available slots combined under PE are not in range of job
        22943,    22979,    22983,    22952,    22980,    22981

    The parallel execution environment is defined as follows:

    qconf -sp mpich
    pe_name           mpich
    slots             999
    user_lists        Research
    xuser_lists       NONE
    start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
    stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
    allocation_rule   $fill_up
    control_slaves    TRUE
    job_is_first_task FALSE
    urgency_slots     min

    My account is in the Research ACL.

    The job submitted was

    #$ -S /bin/bash
    #$ -q x86_64.q
    #$ -N sum
    #$ -pe mpich 2
    #$ -cwd

    export PATH=/home/nept/apps64/openmpi/bin:$PATH
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
    mpirun --mca oob_tcp_listen_mode listen_thread --prefix \
    /home/nept/apps64/openmpi -v -np $NSLOTS -machinefile $TMPDIR/machines ./sum

    It ran in April -- now I'm getting the above messages.

    Why would the job have no access to the parallel execution environment? How is access
    determined? What needs to be done so that the jobs have access? The pe is listed in the
    queue specified above:

    $ qconf -sq x86_64.q
    qname                 x86_64.q
    hostlist              @coreduos
    seq_no                0
    load_thresholds       np_load_avg=4.0
    suspend_thresholds    NONE
    nsuspend              1
    suspend_interval      00:05:00
    priority              0
    min_cpu_interval      00:05:00
    processors            UNDEFINED
    qtype                 BATCH INTERACTIVE
    ckpt_list             NONE
    pe_list               gauss mpich namd
    rerun                 FALSE
    slots                 4
    tmpdir                /tmp
    shell                 /bin/bash
    prolog                NONE
    epilog                NONE
    shell_start_mode      posix_compliant
    starter_method        NONE
    suspend_method        NONE
    resume_method         NONE
    terminate_method      NONE
    notify                00:00:60
    owner_list            NONE
    user_lists            Research deadlineusers
    xuser_lists           NONE
    subordinate_list      NONE
    complex_values        NONE
    projects              NONE
    xprojects             NONE
    calendar              NONE
    initial_state         default
    s_rt                  INFINITY
    h_rt                  INFINITY
    s_cpu                 INFINITY
    h_cpu                 INFINITY
    s_fsize               INFINITY
    h_fsize               INFINITY
    s_data                INFINITY
    h_data                INFINITY
    s_stack               INFINITY
    h_stack               INFINITY
    s_core                INFINITY
    h_core                INFINITY
    s_rss                 INFINITY
    h_rss                 INFINITY
    s_vmem                INFINITY
    h_vmem                INFINITY

Many thanks,


More information about the gridengine-users mailing list