[GE users] shepherd problem

Philippe Caussignac philippe.caussignac at epfl.ch
Fri Mar 16 17:23:15 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

I have a cluster made of 6 bidual-core Woodcrest nodes, myrinet mx, Suse 
10.2 OS, SGE 6.09.
I uses the tight integration of myrinet mpich-mx (compile with the rsh 
command option) through the $SGE_ROOT/mpi/startmpi.sh and 
$SGE_ROOT/mpi/stopmpi.sh.

When I submit jobs with 8 processors, everything is OK. For jobs with 
9-11 processors it's sometimes OK, sometimes not. For jobs with 12 and 
more processors, it never works.

The error message in the error log of sge is:

error:
cannot get connection to "shepherd" at host "node06"
error:
cannot get connection to "shepherd" at host "node02"
error:
cannot get connection to "shepherd" at host "node03"

No idea what to do, except installing sge5.3 which works perfectly on 
another myrinet cluster.

-- 

                            Philippe Caussignac
                            EPFL FSB IMB LCVMM (Station 8)
                            CH-1015 LAUSANNE (Switzerland)
                            email: Philippe.Caussignac at epfl.ch
                            Phone: (41) 21 693 25 78
                            Fax:   (41) 21 693 55 30

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list