[GE users] MPICH2 tight integration
reuti at staff.uni-marburg.de
Thu Aug 13 16:17:01 BST 2009
Am 13.08.2009 um 16:22 schrieb skylar2:
> reuti wrote:
>> Am 13.08.2009 um 01:44 schrieb skylar2:
>>> Has anyone gotten the MPICH2 tight integration working? I'm
>>> the stuff in the wiki:
>>> I installed the mpich2_mpd package in $SGE_ROOT and setup my
>>> environment like so:
>>> pe_name mpich2_mpd
>>> slots 999
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /net/gs/vol3/software/sge/mpich2_mpd/
>>> startmpich2.sh \
>>> -catch_rsh $pe_hostfile \
>>> stop_proc_args /net/gs/vol3/software/sge/mpich2_mpd/
>>> stopmpich2.sh \
>>> -catch_rsh \
>>> allocation_rule $round_robin
>>> control_slaves TRUE
>>> job_is_first_task FALSE
>>> urgency_slots min
>>> When I try to run a job it fails during the mpd boot up:
>> was your first test already with such a high number of nodes? Is it
>> working with only 4/8/16... slots on 1/2/4 nodes?
>> a) Are all nodes the same, or can it be that some have a firewall?
> They are all the same, and on a private network so no firewall. I also
> verified that I could mpdboot and mpirun outside SGE, and that
> worked fine.
>> b) It may be, that the startup sequence takes just too long and you
>> have to increase the "sleep" time in the startmpich2.sh But then the
>> startup would be really long.
> I doubled the SLEEPTIME and it still didn't work even with only 24
> slots, so I don't think that was the issue.
The point where I'm confused is, that in your last email I saw:
startmpich2.sh: got only 8 of 23 nodes, aborting
Looks like it's working on some nodes. Did you get slots from
different queues, i.e. is the PE bound to more than one queue?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users