[GE users] Frontend as exec host

reuti reuti at staff.uni-marburg.de
Mon Apr 12 10:19:10 BST 2010


Hi,

Am 09.04.2010 um 00:52 schrieb dasf:

> Ladies and Gentlemen,
> 
> I am trying to add the frontend as an execution host but I am having some weird problems.

the frontend is the login- and/or file-server for the complete cluster? Often it would be a matter of performance: when you make this machine an exechost the interactive sessions and/or file-server operation will be slowed down. But anyway: this looks like the sgeexecd gets suspended or killed as soon as a job starts. Anything in /var/log/messages? oom-killer entries or alike? Does it also happen when there is only one job running on the frontend? Maybe you can define a lower amount of slots than cores installed in this machine.

-- Reuti

> I added as many suggested on the web, i.e., running ./exec_host at the frontend and accepting all questions. Everything looked fine, as below. The cluster.local was added, I see the load, slots available, etc...
> 
> But  as soon as I add jobs to fill up the entire cluster, problems appear. First, no jobs start to run on the frontend. Second, the load_avg changes to -NA-.
> 
> Any clue on how to fix it?
> 
> Thanks!
> 
> Demetrio
> 
> 
> [demetrio at cluster ~]$ qstat -f
> queuename                      qtype resv/used/tot. load_avg arch          states
> ---------------------------------------------------------------------------------
> all.q at cluster.local            BIP   0/0/8          0.01     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-0.local        BIP   0/4/4          -NA-     lx26-amd64    au
>      2 0.55500 script.cmd demetrio     r     04/07/2010 16:45:38     1        
>      4 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
>      5 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
>      3 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
> ---------------------------------------------------------------------------------
> all.q at compute-0-1.local        BIP   0/0/16         0.00     lx26-amd64    
> 
> 
> 
> 
> 
> [demetrio at cluster ~]$ qstat -f 
> queuename                      qtype resv/used/tot. load_avg arch          states
> ---------------------------------------------------------------------------------
> all.q at cluster.local            BIP   0/0/8          -NA-     lx26-amd64    au
> ---------------------------------------------------------------------------------
> all.q at compute-0-0.local        BIP   0/4/4          -NA-     lx26-amd64    au
>      2 0.55500 script.cmd demetrio     r     04/07/2010 16:45:38     1        
>      4 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
>      5 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
>      3 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
> ---------------------------------------------------------------------------------
> all.q at compute-0-1.local        BIP   0/16/16        4.98     lx26-amd64    
>     19 0.55500 script.cmd demetrio     r     04/08/2010 19:42:10     1        
>     20 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
> qstat -f
> queuename                      qtype resv/used/tot. load_avg arch          states
> ---------------------------------------------------------------------------------
> all.q at cluster.local            BIP   0/0/8          -NA-     lx26-amd64    au
> ---------------------------------------------------------------------------------
> all.q at compute-0-0.local        BIP   0/4/4          -NA-     lx26-amd64    au
>      2 0.55500 script.cmd demetrio     r     04/07/2010 16:45:38     1        
>      4 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
>      5 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
>      3 0.55500 script.cmd demetrio     r     04/07/2010 16:48:38     1        
> ---------------------------------------------------------------------------------
> all.q at compute-0-1.local        BIP   0/16/16        4.98     lx26-amd64    
>     19 0.55500 script.cmd demetrio     r     04/08/2010 19:42:10     1        
>     20 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     21 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     22 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     23 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     24 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     25 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     26 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     27 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     28 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     29 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     30 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     31 0.55500 script.cmd demetrio     r     04/08/2010 19:42:25     1        
>     32 0.55500 script.cmd demetrio     r     04/08/2010 19:42:55     1        
>     33 0.55500 script.cmd demetrio     r     04/08/2010 19:42:55     1        
>     34 0.55500 script.cmd demetrio     r     04/08/2010 19:42:55     1        
> 
> ############################################################################
> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
> ############################################################################
>     35 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:33     1        
>     36 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:34     1        
>     37 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:34     1        
>     38 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:35     1        
>     39 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:35     1        
>     40 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:35     1        
>     41 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:36     1        
>     42 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:36     1        
>     43 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:37     1        
>     44 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:37     1        
>     45 0.55500 script.cmd demetrio     qw    04/08/2010 19:42:38     1        
> [demetrio at cluster ~]$
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252756
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253108

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list