[GE users] mpich2_smpd - sge - solaris

Yann JOBIC jobic at polytech.univ-mrs.fr
Thu Sep 4 08:46:24 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti a écrit :
> Hi,
>
> Am 03.09.2008 um 17:20 schrieb Yann JOBIC:
>
>> Reuti a écrit :
>>> Hi,
>>>
>>> Am 03.09.2008 um 14:03 schrieb Yann JOBIC:
>>>
>>>> I used the great howto for a tight integration of mpich2 and sge 
>>>> made by reuti :
>>>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html 
>>>>
>>>>
>>>> I'm using solaris 10, x86 and sparc. The job are correctly spawned 
>>>> on 4 nodes :
>>>>
>>>> Just on one node :
>>>> Sara06-jobic% ptree 21175
>>>> 455   /opt/sge/bin/sol-amd64/sge_execd
>>>>  21169 sge_shepherd-2576 -bg
>>>>    21170 /opt/sge/utilbin/sol-amd64/rshd -l
>>>>      21171 /opt/sge/utilbin/sol-amd64/qrsh_starter 
>>>> /opt/sge/Huit/spool/Sara06/active_jobs/
>>>>        21172 tcsh -c /opt/lib/mpich2/bin/smpd -port 22576 -d 0
>>>>          21173 /opt/lib/mpich2/bin/smpd -port 22576 -d 0
>>>>            21174 /opt/lib/mpich2/bin/smpd -port 22576 -d 0
>>>>              21175 /home/jobic/sge/exemple/./hello
>>>>
>>>> However, when the job is finished, there's still the smpd running :
>>>> Sara06-jobic% ptree 21171
>>>> 455   /opt/sge/bin/sol-amd64/sge_execd
>>>>  21169 sge_shepherd-2576 -bg
>>>>    21170 /opt/sge/utilbin/sol-amd64/rshd -l
>>>>      21171 /opt/sge/utilbin/sol-amd64/qrsh_starter 
>>>> /opt/sge/Huit/spool/Sara06/active_jobs/
>>>>        21172 tcsh -c /opt/lib/mpich2/bin/smpd -port 22576 -d 0
>>>>          21173 /opt/lib/mpich2/bin/smpd -port 22576 -d 0
>>>>
>>>> With a qdel, i can delete them :
>>>>  2576  4 mpich2_test               jobic      09/03/2008 13:38:00 
>>>> Huit       (stalled)
>>>> It's just taking some time.
>>>
>>> you mean, the jobscript is in some way halted? If it finishes, it 
>>> should call the defined stop_proc_args of the PE to shut down the 
>>> daemons. You defined the stop-proc-args also in the outlined way?
>>>
>>> -- Reuti
>>>
>> Thanks for the fast answer.
>>
>> I defined this for the pe :
>> homard-jobic% qconf -sp mpich2_smpd
>> pe_name           mpich2_smpd
>> slots             56
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /opt/sge/mpich2_smpd/startmpich2.sh -catch_rsh 
>> $pe_hostfile \
>>                  /opt/lib/mpich2
>> stop_proc_args    /opt/sge/mpich2_smpd/stopmpich2.sh -catch_rsh 
>> /opt/lib/mpich2
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>> I found in the error file the line :
>> /opt/sge/mpich2_smpd/stopmpich2.sh: line 126: tac: command not found
>
> it will just list a file in reverse order (cat <-> tac) , as the 
> shutdown of the daemons should shut down the one on the master node of 
> the parallel job at last.
>
> You could install it from: http://directory.fsf.org/project/textutils/ 
> I hope it will compile on Solaris.
>
> There is a nice Howto at IBM about these tools: 
> http://www.ibm.com/developerworks/edu/l-dw-linux-gnutex-i.html (you 
> have to register, but it's free).
>
> -- Reuti
>
>
>> It should come from here. How can i fix it ?
>>
>> Many thanks,
>>
>> Yann
>>
Hi,


I fixed the probleme. I just used tail -r, in order to replace tac.

Here are my modifications of your scripts, if you're interested :

 > diff startmpich2.sh  startmpich2.sh.ori
1c1
< #!/bin/bash -f
---
 > #!/bin/sh -f
 > diff  stopmpich2.sh  stopmpich2.sh.ori
1c1
< #!/bin/bash -f
---
 > #!/bin/sh -f
120c120
<       for host in `tail -r $machines | uniq`; do
---
 >       for host in `tac $machines | uniq`; do


In solaris, sh is not bash ;-)

I also modified aimk in order to take into account the sol-amd64 
architecture. Here are my modifications :
 > diff aimk aimk.ori
168,179d167
< case sol-amd64:
<         set CC   = "gcc"
<    if ( $CC == gcc ) then
<       set CFLAGS   = "-DSOLARIS -O -Wall -Wstrict-prototypes -Werror 
$DEBUG_FLAG $CFLAGS"
<    else
<       set CFLAGS = "-DSOLARIS -O $DEBUG_FLAG $CFLAGS"
<    endif
<    set LIBS = "$LIBS -lsocket -lnsl"        
<         set STATIC   = "-nostartfiles"
<    breaksw
<

Many thanks for your help and for your howto,

Yann


-- 
___________________________

Yann JOBIC
HPC engineer
Polytech Marseille DME
IUSTI-CNRS UMR 6595
Technopole de Chateau Gombert
5 rue Enrico Fermi
13453 Marseille cedex 13
Tel : (33) 4 91 10 69 39
  ou  (33) 4 91 10 69 43
Fax : (33) 4 91 10 69 69 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list