[GE users] dropped because it is full

Hugo Darío Barrera hbarrera at iciq.es
Tue Jun 27 12:59:47 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

hi, 

this is getting better, 
i have set up hosts from tekla001 to tekla018.
Every job I run it start at tekla001, although i select to start at tekla010 
i.e.
i deleted any reference to hosts tekla001 and tekla002, so no hostgroup, no 
host, no cluster queue has any reference to those hosts. If i do a qstat -f i 
get no tekla001 or tekla002 host.
So I run the job, and if i go to see the current master (of the job) i see a 
process running at tekla001 and tekla002 too (besides the others nodes)
So as I see, it should be any bad reference in the DB. So I will try another 
fresh install. 

Anyway if someone has any clue, I ll appreciate.


On Tuesday 27 June 2006 10:50, Hugo Darío Barrera wrote:
> HI,
>
> thanks for the answer,
>
> yes, I have a PE called "short" and "big1"
>
> Im running this simple script:
>
> #!/bin/bash
>
> #$ -N a
> #
> # pe request
> #$ -pe big2 8
>
>
> cd /home/mante070/Co1
> /usr/bin/mpirun -np 8 -machinefile /scratch/machines /usr/local/bin/vasp
>
>
> in /scratch/machines i have the name for all nodes.
>
> now I have:
>
> tekla001 to tekla018
>
> i send a job to be run in nodes tekla011 to tekla018 but although qstat -t
> shows: 22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla011          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla012          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla013          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla014          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla015          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla016          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla017          MASTER
>
> big2 at tekla017          SLAVE
>      22 0.55500 a          mante        r     06/27/2006 10:36:56
> big2 at tekla018          SLAVE
>
>
> if i go to "Master" node (tekla017 in this case), and run ps aux |grep vasp
> i get:
> mante     3697  0.0  0.0   3108  1588 ?        SN   10:35
> 0:00 /bin/sh /usr/bin/mpirun -np
> 8 -machinefile /scratch/machines /usr/local/bin/vasp
> mante     3903 90.5  8.4 214640 175344 ?       RNs  10:35
> 6:46 /usr/local/bin/vasp -p4pg /home/mante070/Co1/PI3697 -p4wd
> /home/mante070/Co1 mante     3904  0.0  0.1  23800  3772 ?        SN  
> 10:35
> 0:00 /usr/local/bin/vasp -p4pg /home/mante070/Co1/PI3697 -p4wd
> /home/mante070/Co1 mante     3905  0.0  0.1   5740  2332 ?        SN  
> 10:35   0:00 /usr/bin/ssh tekla001. -l mante -n /usr/local/bin/vasp
> tekla017 41547 \-p4amslave \-p4yourname tekla001. \-p4rmrank 1
> mante     3906  0.0  0.1   5740  2336 ?        SN   10:35   0:00
> /usr/bin/ssh tekla002. -l mante -n /usr/local/bin/vasp tekla017 41547
> \-p4amslave \-p4yourname tekla002. \-p4rmrank 2
> mante     3907  0.0  0.1   5744  2336 ?        SN   10:35   0:00
> /usr/bin/ssh tekla003. -l mante -n /usr/local/bin/vasp tekla017 41547
> \-p4amslave \-p4yourname tekla003. \-p4rmrank 3
> mante     3908  0.0  0.1   5740  2332 ?        SN   10:35   0:00
> /usr/bin/ssh tekla004. -l mante -n /usr/local/bin/vasp tekla017 41547
> \-p4amslave \-p4yourname tekla004. \-p4rmrank 4
> mante     3909  0.0  0.1   5744  2336 ?        SN   10:35   0:00
> /usr/bin/ssh tekla005. -l mante -n /usr/local/bin/vasp tekla017 41547
> \-p4amslave \-p4yourname tekla005. \-p4rmrank 5
> mante     3910  0.0  0.1   5740  2332 ?        SN   10:35   0:00
> /usr/bin/ssh tekla006. -l mante -n /usr/local/bin/vasp tekla017 41547
> \-p4amslave \-p4yourname tekla006. \-p4rmrank 6
> mante     3911  0.0  0.1   5740  2332 ?        SN   10:35   0:00
> /usr/bin/ssh tekla007. -l mante -n /usr/local/bin/vasp tekla017 41547
> \-p4amslave \-p4yourname tekla007. \-p4rmrank 7
>
>
> so its really taking the first nodes instead of the nodes that are
> configured in the hostgroup configuration
>
>
> Btw, i have a fresh install of sge.
>
> Tnx
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list