[GE users] allocation or info leaking

Reuti reuti at staff.uni-marburg.de
Wed May 18 17:44:10 BST 2005


Hi,

if Charm++ is using MPI instead of it's own built-in communication, the 
charmrun seems only to be script to call mpirun. Unfortunately, nowhere
a hostfile is mentioned or honored (with the correct setting for "-np" 
of mpirun to $NSLOTS).

If you already have a tight MPI integration into SGE: can you try to 
start your program with mpirun instead (ususal mpirun parameters 
included), as the charmrun is not of much use in such a setup I think.

Cheers - Reuti


lukacm at pdx.edu wrote:
> Hello,
> 
> this is concerning the SGE 5.3.
> While using NAMd (which uses charmrun compiled against some selected mpi) the
> following situation occurs. Assume i request 8 parallel slots using qsub:
> qsub -pe mpich 8 namd_submit.sh data.namd.
> after the program is started the output of the qstat gives something like this:
> ----------------------------------------------------------------------------
> compute-0-0.2q       BIP   0/2       0.93     lx24-amd64 d
> ----------------------------------------------------------------------------
> compute-0-0.q        BIP   0/1       0.93     lx24-amd64
> ----------------------------------------------------------------------------
> compute-0-0.qq       BIP   0/1       0.93     lx24-amd64
> ----------------------------------------------------------------------------
> compute-0-1.2q       BIP   0/2       0.89     lx24-amd64 d
> ----------------------------------------------------------------------------
> compute-0-1.q        BIP   0/1       0.89     lx24-amd64
> ----------------------------------------------------------------------------
> compute-0-1.qq       BIP   0/1       0.89     lx24-amd64
> ----------------------------------------------------------------------------
> compute-0-2.2q       BIP   0/2       0.84     lx24-amd64 d
> ----------------------------------------------------------------------------
> compute-0-2.q        BIP   1/1       0.84     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-2.qq       BIP   1/1       0.84     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-3.2q       BIP   0/2       0.00     lx24-amd64 d
> ----------------------------------------------------------------------------
> compute-0-3.q        BIP   1/1       0.00     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-3.qq       BIP   1/1       0.00     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 MASTER
>             0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-4.2q       BIP   0/2       0.00     lx24-amd64 d
> ----------------------------------------------------------------------------
> compute-0-4.q        BIP   1/1       0.00     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-4.qq       BIP   1/1       0.00     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-5.2q       BIP   0/2       0.80     lx24-amd64 d
> ----------------------------------------------------------------------------
> compute-0-5.q        BIP   1/1       0.80     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> ----------------------------------------------------------------------------
> compute-0-5.qq       BIP   1/1       0.80     lx24-amd64
>     131     0 NAMD_test  lukacm       r     05/18/2005 09:01:18 SLAVE
> 
> 
> however if i do
> cluster-fork ps -u lukacm i obtain the following:
> compute-0-0:
>   PID TTY          TIME CMD
> 18176 ?        00:00:00 sh
> 18177 ?        00:03:36 namd2
> 18186 ?        00:00:00 sh
> 18187 ?        00:03:34 namd2
> 18228 ?        00:00:00 sshd
> 18230 ?        00:00:00 ps
> compute-0-1:
>   PID TTY          TIME CMD
> 19034 ?        00:00:00 sh
> 19035 ?        00:03:35 namd2
> 19036 ?        00:00:00 sh
> 19038 ?        00:03:39 namd2
> 19086 ?        00:00:00 sshd
> 19088 ?        00:00:00 ps
> compute-0-2:
>   PID TTY          TIME CMD
> 23690 ?        00:00:00 sh
> 23692 ?        00:03:40 namd2
> 23698 ?        00:00:00 sh
> 23699 ?        00:03:38 namd2
> 23730 ?        00:00:00 sshd
> 23732 ?        00:00:00 ps
> compute-0-3:
>   PID TTY          TIME CMD
>  3147 ?        00:00:00 csh
>  3173 ?        00:00:00 charmrun
>  3224 ?        00:00:00 sshd
>  3226 ?        00:00:00 ps
> compute-0-4:
>   PID TTY          TIME CMD
>   971 ?        00:00:00 sshd
>   973 ?        00:00:00 ps
> compute-0-5:
>   PID TTY          TIME CMD
> 15898 ?        00:00:00 sh
> 15899 ?        00:00:00 sh
> 15900 ?        00:03:40 namd2
> 15901 ?        00:03:39 namd2
> 15946 ?        00:00:00 sshd
> 15948 ?        00:00:00 ps
> 
> which means that SGE is starting some threads but they are not located where SGE
> is showing they are. For example the node compute-0-4 is showing that SGE
> started some threads on itbut in reality those threads are somewhere else. Is
> it a problem of tight-integration or is it NAMD specific problem? anyone having
> some issue?
> 
> thanks
> 
> martin
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list