[GE users] Problems with multiple network interfaces

reuti reuti at staff.uni-marburg.de
Sat Aug 8 14:51:26 BST 2009


Hi,

Am 08.08.2009 um 11:57 schrieb jordib:

> I have a problem with dual network interfaces cluster.
> The execution nodes have 2 NICS, the first one (eth0) it's for  
> shared data and the second (eth1) it's for compute.

I would suggest to change the order. If I understand you correctly,  
you want to use eth0 for NFS, and eth1 for MPI and SGE. This would  
mean to instruct the various MPI libraries not to use the main  
interface of the nodes, but the other (and also for both directions  
of the communication). This needs a special setup which varies from  
library to library.


> We have 1 node that acts as a sgemaster and 2 more that are submit  
> hosts. All this nodes have only one NIC, and they use static routes  
> in order to contact with the execution hosts.
>
> I've followed the instructions of : http://gridengine.sunsource.net/ 
> howto/multi_intrfcs.html
>
> In my host_aliases only appears the execution nodes:
> ciqtc01-1       g1node1
> ciqtc01-10      g1node10
> ciqtc01-11      g1node11
> ciqtc01-12      g1node12
> ....

The file host_aliases is shared across the cluster in /usr/sge or alike?


> The first column it's for eth1 (compute) and the second it's for  
> eth0 (data).
> The /etc/hosts includes all this hostnames with it's particular IP.
>
> if I run qstat -f, sge shows this output:
> queuename                      qtype resv/used/tot. load_avg  
> arch          states
> ---------------------------------------------------------------------- 
> -----------
> all.q at g1node1                 BIP   0/0/1          -NA-     - 
> NA-          au
> ---------------------------------------------------------------------- 
> -----------
>
> If I run qhost, appear the aliases
> sgemaster:~# qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> ciqtc01-1               lx24-amd64      4  0.00    7.8G  472.8M     
> 7.4G     0.0
> ciqtc01-10              lx24-amd64      4  0.00    7.8G   71.8M     
> 7.4G     0.0

This looks correct, as the ciqtc01-1 is the one you want to use.

Which hostnames are in your hostgroup? I would assume it must be  
ciqtc01-1 and queues created for this alias.

-- Reuti

>
> If I try to see the node configuration, appear the aliases
> sgemaster:~# qconf -se g1node1 | grep hostname
> hostname              ciqtc01-1
>
> but when I try to submit a job, this is rejected because the alias  
> is unknown for sge at this level.
> sgemaster:~# qrsh -q interactive.q at g1node1
> Job was rejected because job requests unknown queue  
> "interactive.q at ciqtc01-1"
>
> Any suggestion?
>
> Many thanks,
>
>    Jordi
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=211475
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211497

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list