[GE users] Rmpi under SGE

arnuschky arne.brutschy at ulb.ac.be
Fri Dec 17 17:12:35 GMT 2010

On Fri, 2010-12-17 at 15:02 +0100, reuti wrote:
> Am 17.12.2010 um 12:58 schrieb arnuschky:
> > Ah. My previsously was slightly premature, Rmpi jobs with > 20 slaves
> > still fail (even with Reuti's fixes):
> > 
> >        $ cat test-mpi-17942.e3480568
> >        error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
> >        error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
> You are now using the plain -builtin- startup method? Does it happen on all hosts for such a job?

Yes, here's my current config:
        $ qconf -sconf
        execd_spool_dir              /opt/gridengine/default/spool
        mailer                       /bin/mail
        xterm                        /usr/bin/X11/xterm
        load_sensor                  none
        prolog                       none
        epilog                       none
        shell_start_mode             posix_compliant
        login_shells                 sh,ksh,csh,tcsh
        min_uid                      0
        min_gid                      0
        user_lists                   none
        xuser_lists                  none
        projects                     none
        xprojects                    none
        enforce_project              false
        enforce_user                 auto
        load_report_time             00:00:40
        max_unheard                  00:05:00
        reschedule_unknown           00:00:00
        loglevel                     log_warning
        administrator_mail           root at headnode
        set_token_cmd                none
        pag_cmd                      none
        token_extend_time            none
        shepherd_cmd                 none
        qmaster_params               none
        execd_params                 H_MEMORYLOCKED=infinity
        reporting_params             accounting=true reporting=false \
                                     flush_time=00:00:15 joblog=false sharelog=00:00:00
        finished_jobs                100
        gid_range                    20000-20100
        max_aj_instances             2000
        max_aj_tasks                 75000
        max_u_jobs                   35192
        max_jobs                     25000
        auto_user_oticket            0
        auto_user_fshare             0
        auto_user_default_project    none
        auto_user_delete_time        86400
        delegated_file_staging       false
        qlogin_command               builtin
        qlogin_daemon                builtin
        rlogin_command               builtin
        rlogin_daemon                builtin
        rsh_command                  builtin
        rsh_daemon                   builtin
        reprioritize                 0
        jsv_url                      none
        jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
        $ qconf -sp mpich_fu
        pe_name            mpich_fu
        slots              128
        user_lists         NONE
        xuser_lists        NONE
        start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
        stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
        allocation_rule    $fill_up
        control_slaves     TRUE
        job_is_first_task  FALSE
        urgency_slots      min
        accounting_summary FALSE

> Maybe it's something special on some nodes und would for some hosts happen for less slots too.

I don't think that the nodes are different. I reinstalled all of them
yesterday. I tested on 2 different generations of nodes separately (2x2
cores and 2x4 cores per node). The problem just seems to be more likely
the more slots (and thus nodes) I use. But in one generation the nodes
are identical, all using a single switch.



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list