[GE users] tight intergration problem

Jean-Paul Minet minet at cism.ucl.ac.be
Fri Jan 27 10:11:25 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

Some things do work (qrsh wrapper), others don't (qstat -ext and qdel; please 
see below)

>> Maybe we should re-install mpich with the proper rsh (i.e. without  
>> the full path)... still this wouldn't explain why the rsh wrapping  
>> works under certain conditions with the same user binary code.
>>
> 
> Well, this would mean also to recompile - or at least relink - your  
> application, as the .a libs are already in the executable. - Reuti

reconfigured/recompiled mpich, and relinked application... and it doesn't work.

Digging into mpich sources, we found that having P4_RSHCOMMAND set will modify 
application behavior *at run time*... and this (i.e. including in ~/bin/mpirun 
or in the SGE submit script P4_RSHCOMMAND="rsh") indeed works ;-)  So we have 
now on SGE MASTER:

root      5612  5589  0 09:14 ?        00:00:00 bash 
/var/spool/sge/lmexec-64/job_scripts/2488
root      5615  5612  0 09:14 ?        00:00:00 /bin/sh 
/usr/local/mpich-eth-intel/bin/mpirun -np 2 -machinefile 
/tmp/2488.1.all.q/machines abinip_eth
root      5699  5615  0 09:14 ?        00:00:57 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI5615 -p4wd /home
root      5700  5699  0 09:14 ?        00:00:00 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI5615 -p4wd /home
root      5701  5699  0 09:14 ?        00:00:00 
/gridware/sge/bin/lx24-amd64/qrsh -inherit -nostdin lmexec-61 
/home/pan/minet/abinit/parallel_eth/abinip_et
root      5709  5701  0 09:14 ?        00:00:00 
/gridware/sge/utilbin/lx24-amd64/rsh -n -p 32937 lmexec-61 exec 
'/gridware/sge/utilbin/lx24-amd64/qrsh_star

and on SGE slave
sgeadmin  3397  3469  0 09:18 ?        00:00:00 sge_shepherd-2488 -bg
root      3398  3397  0 09:18 ?        00:00:00 
/gridware/sge/utilbin/lx24-amd64/rshd -l
root      3399  3398  0 09:18 ?        00:00:00 
/gridware/sge/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/lmexec-61/active_jobs/2488.1/1.lmexec-61
root      3400  3399 99 09:18 ?        00:01:03 
/home/pan/minet/abinit/parallel_eth/abinip_eth lmexec-64 32844 -p4amslave 
-p4yourname lmexec-61 -p4rmrank 1
root      3401  3400  0 09:18 ?        00:00:00 
/home/pan/minet/abinit/parallel_eth/abinip_eth lmexec-64 32844 -p4amslave 
-p4yourname lmexec-61 -p4rmrank 1

Now, the qstat -j usage line is updated with proper values:

lemaitre /home/pan/minet/abinit/parallel_eth # qstat -j 2488
...
parallel environment:  mpich range: 2
usage    1:                 cpu=00:10:08, mem=169.88418 GBs, io=0.00000, 
vmem=671.281M, maxvmem=671.309M
scheduling info:            queue instance "all.q at lmexec-66" dropped because it 
is full
...

but the qstat -ext reports wrong value:

2488 0.02271 0.02271 Test_abini root         NA               defaultdep r 
0:00:06:17 105.69577 0.00000 11289     0     0    27 11261 0.01  all.q at lmexec-64 
                    2

Now, issuing a qdel of this running job will properly stop slave process, but on 
master node, remains a defunct:

root      5699     1 99 09:14 ?        00:13:11 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI5615 -p4wd /home
root      5700  5699  0 09:14 ?        00:00:00 
/home/pan/minet/abinit/parallel_eth/abinip_eth -p4pg 
/home/pan/minet/abinit/parallel_eth/PI5615 -p4wd /home
root      5701  5699  0 09:14 ?        00:00:00 [qrsh] <defunct>

Have you an idea where does this come from ?  mpich?

jp

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list