[GE users] Intel MPI 3.1 tight integration

reuti reuti at staff.uni-marburg.de
Wed Nov 5 19:55:17 GMT 2008


I got the MPICH2 mpd working, but it's still far from a Tight  
Integration for now. Another hint: the mpdboot will need this  
argument:  --totalnum=$NHOSTS otherwise I always get only a daemon on  
node of the jobscript, nothing on the slaves.

But although the additonal group id is attached to the daemons, noone  
is counting its usage as the shepherd left already. So ru_* is wrong  
and also cpu,mem,io is missing. I started the mpdboot in  
start_proc_args, so it's also missing (the counting shepherd) on the  
master node of the parallel job. But even when the mpdboot is started  
in the job script, then the slave nodes have noone counting the usage  
or even abort the job in case of a qdel.

I'll try to avoid the that the daemons will daemonize (just patching  
the mpd.py and removing the -d therein). This means, something like  
the helper program from my tight PVM integration is necessary, which  
will fork-off qrsh - even in the local case and keeps the shepherd  
waiting.

To be continued...

-- Reuti


Am 04.11.2008 um 18:48 schrieb Reuti:

> Here for the impatient the integration with several daemons even  
> per user:
>
> export MPD_CON_EXT="sge_$JOB_ID.$TASK_ID"
>
> before the mpdboot. The qrsh call will also need this variable, so  
> the qrsh-wrapper must also get the -V switch to supply this to the  
> slave nodes.
>
> I don't know, whether this also applies to Intel MPI. I'll  
> investigate further and update the Howto, when I come to a clean  
> solution.
>
> -- Reuti
>
>
> Am 04.11.2008 um 17:19 schrieb Reuti:
>
>> Hi Daniel,
>>
>> Am 03.11.2008 um 15:42 schrieb Daniel Templeton:
>>
>>> The mpd daemons do daemonize.  That means that the qrsh -inherit  
>>> returns before the actual work gets done, but shouldn't SGE pick  
>>> up the usage anyway using the GID?
>>
>> you are right, the GID is indeed still attached to the mpd and  
>> when ENABLE_ADDGRP_KILL is set, also the daemon is gone after a job.
>>
>> But there are some limitations:
>>
>> a) besides ru_wallclock, all ru_* entries in the accounting  
>> records are missing, as the process was detached from the shepherd.
>>
>> b) having two jobs of the same user on a node, the second job will  
>> kill the first mpd.py instances but leaving the binaries running  
>> when you start a mpd.py by mpdboot per job.
>>
>> There is only one entry in /tmp for each user. E.g. in LAM/MPI  
>> they added special strings containing "sge" and the $JOB_ID to  
>> have dedicated directories per job, when they discover that they  
>> are running under SGE and need a daemon per job.
>>
>> c) As a result of b), the first started job can only have one  
>> mpirun/mpiexec, as the mpd.py for this one is gone. Further tasks  
>> would be created as kids of the mpd.py of the second job, giving  
>> wrong accounting. Furthermore, the ending 2nd job will also remove  
>> the task of the second step of the first job.
>>
>> What would be necessary, would be a dedicated port per mpd.py to  
>> connect to the right mpd.py of this job, and the mpd.py not  
>> forking into daemon land.
>>
>> For me, these are still too many limitations to include this  
>> startup method in the Howto. I can try to ask the MPICH(2) team  
>> and Intel, whether they could supply any solution.
>>
>> -- Reuti
>>
>>
>>>   I readily admit that SGE PEs are not my strong suit.
>>>
>>> There is a switch to make the mpd daemons not daemonize, but then  
>>> you have to do some dancing around how to let mpdboot run  
>>> multiple qrsh -inherit calls in the background and still be able  
>>> to read the first line of input from them (the port number)  
>>> without having input buffering get in the way.
>>>
>>> Daniel
>>>
>>> Reuti wrote:
>>>> Am 03.11.2008 um 14:54 schrieb Daniel Templeton:
>>>>
>>>>> Actually, I've done a tight integration, and it's pretty easy.   
>>>>> The mpdboot command takes a -r parameter that gives the name of  
>>>>> the "rsh" to execute.  Just create a script that strips out the  
>>>>> -x and -n from the arguments and runs qrsh -inherit instead of  
>>>>> rsh, and pass that script to mpdboot with -r.  (You may also  
>>>>> want to shortcut out the Python version check...)  You'll also  
>>>>> need a PE starter that creates an appropriate machines file.
>>>>
>>>> In contrast to MPICH(2) the mpd daemons are not forking into  
>>>> daemonland any longer? Besides this, I found the creation of  
>>>> more and more processgroups by the Python script in MPICH(2)  
>>>> being the handicap.
>>>>
>>>> Is it also working with two jobs of the same user on a node?
>>>>
>>>> No shutdown necessary?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> My scripts below should work with Intel MPI 3.1 or 3.2.
>>>>>
>>>>> Daniel
>>>>>
>>>>> % cat startpe.sh
>>>>> #!/bin/sh
>>>>>
>>>>> hfile=$TMP/mpd.hosts
>>>>> touch $hfile
>>>>>
>>>>> cat $PE_HOSTFILE | while read line; do
>>>>>  host=`echo $line | cut -d' ' -f1 | cut -d'.' -f1`
>>>>>  cores=`echo $line | cut -d' ' -f2`
>>>>>
>>>>>  while [ $cores -gt 0 ]; do
>>>>>    echo $host >> $hfile
>>>>>    cores=`expr $cores - 1`
>>>>>  done
>>>>> done
>>>>>
>>>>> exit 0
>>>>> % cat qrsh-inherit.pl
>>>>> #!/usr/bin/perl
>>>>>
>>>>> # Shortcircuit python version check
>>>>> if (grep /^\s*-x\s*$/, @ARGV) {
>>>>>  print "2.4\n";
>>>>>  exit 0;
>>>>> }
>>>>>
>>>>> # Strip out -n and -x
>>>>> @ARGV = grep !/^\s*-[nx]\s*$/, @ARGV;
>>>>>
>>>>> exec "qrsh", "-inherit", @ARGV;
>>>>>
>>>>>
>>>>> Daniel De Marco wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to integrate Intel MPI with gridengine. From what I  
>>>>>> found on
>>>>>> the list archives it seems tight integration is impossible.  
>>>>>> What about loose intergation, did anyone try it? Any comments/ 
>>>>>> pointers?
>>>>>>
>>>>>> Thanks, Daniel.
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ----
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- 
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88140

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list