[GE users] au state

Chris Dagdigian dag at sonsorol.org
Wed May 11 16:28:46 BST 2005



When Grid Engine starts up it echos the hostname (as it understands  
it) of the qmaster server to a location:

$SGE_ROOT/<cell>/common/act_qmaster

When compute nodes start up they read the act_qmaster file to learn  
which host they need to connect to.

There are probably a few issues here, the biggest one is that your  
compute nodes can't figure out how to get to a machine named  
"marvin.local"

This would be caused by not having DNS configured for this hostname  
or the compute nodes not having a valid entry for marvin.local in  
their /etc/hosts file.

You can fix this in DNS, in /etc/hosts or by changing the FDQN of  
your qmaster host. If none of these can be changed easily there is an  
mechanism within SGE called "host_aliases" whereby you can remap  
"marvin.local" to an IP address that you know the nodes can get at.

Good debug tools can be found in SGE_ROOT/utilbin/ -- in that  
directory are applications that will let you see exactly what Grid  
Engine thinks about hostname resolution and lookups.

If I'm totally wrong about this being a hostname/resolution issue  
here are some other possibilities:

(1) You have a firewall blocking port 535  ( probably not the case if  
you had SGE working previously)

(2) Equally possible is that sge_qmaster is not actually running on  
host marvin.local or had some sort of fatal startup problem. This can  
happen if previous SGE daemons did not exit cleanly

-Chris











On May 11, 2005, at 11:11 AM, Wheeler, Dr M.D. wrote:

> i just rebooted my machines and now i get the error:
> unable to contact qmaster via "marvin.local" commd using port 535  
> (service "sge_commd")
>
> help please.....
>
> Martyn
>
> ----------------------------------------------
> Dr. Martyn D. Wheeler
> Department of Chemistry
> University of Leicester
> University Road
> Leicester, LE1 7RH, UK.
> Tel (office): +44 (0)116 252 3985
> Tel (lab):    +44 (0)116 252 2115
> Fax:          +44 (0)116 252 3789
> Email:        martyn.wheeler at le.ac.uk
> http://www.le.ac.uk/chemistry/staff/mdw10.html
>
>
>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 11 May 2005 16:04
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] au state
>>
>>
>> What it:
>>
>> qalter -w v 1932
>>
>> saying? - Reuti
>>
>> Wheeler, Dr M.D. wrote:
>>
>>> here's some more output
>>>
>>>
>>> # qconf -sp molpro
>>> pe_name           molpro
>>> queue_list        compute-0-0.q compute-0-2.q
>>> slots             999
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /home/software/scripts/startmolpro.sh
>>>
>> -catch_rsh $pe_hostfile
>>
>>> stop_proc_args    /home/software/scripts/stopmolpro.sh
>>> allocation_rule   $fill_up
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>>
>>>
>>> # qstat -r
>>> job-ID  prior name       user         state submit/start at
>>>
>>     queue      master  ja-task-ID
>>
>>>
>>>
>> --------------------------------------------------------------
>> -------------------------------
>>
>>>    1929     0 HCl_H2O_2+ victorm      r     05/11/2005
>>>
>> 11:35:38 compute-0- MASTER
>>
>>>        Full jobname:     HCl_H2O_2+SP
>>>        Master queue:     compute-0-0.q
>>>        Requested PE:     molpro 2
>>>        Granted PE:       molpro 2
>>>        Hard Resources:   h_rt=500:00:00
>>>                          virtual_free=2900M
>>>                          h_fsize=30G
>>>                          arch=lx24-amd64
>>>             0 HCl_H2O_2+ victorm      r     05/11/2005
>>>
>> 11:35:38 compute-0- SLAVE
>>
>>>             0 HCl_H2O_2+ victorm      r     05/11/2005
>>>
>> 11:35:38 compute-0- SLAVE
>>
>>>    1931     0 formal     nantakorn    qw    05/11/2005 13:08:15
>>>        Full jobname:     formal
>>>        Requested PE:     molpro 2
>>>        Hard Resources:   h_rt=500:00:00
>>>                          virtual_free=2900M
>>>                          h_fsize=30G
>>>                          arch=lx24-amd64
>>>    1932     0 p          nantakorn    qw    05/11/2005 13:38:07
>>>        Full jobname:     p
>>>        Requested PE:     molpro 2
>>>        Hard Resources:   h_rt=500:00:00
>>>                          virtual_free=2900M
>>>                          h_fsize=30G
>>>                          arch=lx24-amd64
>>>
>>> ----------------------------------------------
>>> Dr. Martyn D. Wheeler
>>> Department of Chemistry
>>> University of Leicester
>>> University Road
>>> Leicester, LE1 7RH, UK.
>>> Tel (office): +44 (0)116 252 3985
>>> Tel (lab):    +44 (0)116 252 2115
>>> Fax:          +44 (0)116 252 3789
>>> Email:        martyn.wheeler at le.ac.uk
>>> http://www.le.ac.uk/chemistry/staff/mdw10.html
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: 11 May 2005 13:57
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] au state
>>>>
>>>>
>>>> Hi,
>>>>
>>>> the requested PE also has compute-0-2.q in it's list? How
>>>> many slots did
>>>> you request?
>>>>
>>>> CU - Reuti
>>>>
>>>>
>>>> Wheeler, Dr M.D. wrote:
>>>>
>>>>
>>>>> compute0-0, compute0-1, and compute0-2, are identical, so i
>>>>>
>>>>
>>>> figure it should bypass c0-1 and move onto c0-2
>>>>
>>>>
>>>>> ----------------------------------------------
>>>>> Dr. Martyn D. Wheeler
>>>>> Department of Chemistry
>>>>> University of Leicester
>>>>> University Road
>>>>> Leicester, LE1 7RH, UK.
>>>>> Tel (office): +44 (0)116 252 3985
>>>>> Tel (lab):    +44 (0)116 252 2115
>>>>> Fax:          +44 (0)116 252 3789
>>>>> Email:        martyn.wheeler at le.ac.uk
>>>>> http://www.le.ac.uk/chemistry/staff/mdw10.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Robert Griffiths
>>>>>> [mailto:Robert.Griffiths at mitsubishi-sec-intl.com]
>>>>>> Sent: 11 May 2005 13:40
>>>>>> To: 'users at gridengine.sunsource.net'
>>>>>> Subject: RE: [GE users] au state
>>>>>>
>>>>>>
>>>>>> Well, it looks like the scheduler knows about both
>>>>>> compute-0-1.q and you've
>>>>>> filled-up compute-0-0.q.
>>>>>>
>>>>>> You must now look at how job 1931 was submitted - does it
>>>>>>
>>>>
>>>> request any
>>>>
>>>>
>>>>>> resources which are available only on compute-0-0 or
>>>>>> compute-0-1. It could
>>>>>> be that it's requesting something which *doesn't exist* on
>>>>>> compute-0-2 or
>>>>>> your 32-bit node hence it can't be scheduled.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Wheeler, Dr M.D. [mailto:mdw10 at leicester.ac.uk]
>>>>>> Sent: 11 May 2005 13:36
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: RE: [GE users] au state
>>>>>>
>>>>>>
>>>>>> # qstat -j
>>>>>> scheduling info:            queue "compute-0-1.q" dropped
>>>>>> because it is
>>>>>> temporarily not available
>>>>>>                           queue "compute-0-0.q" dropped
>>>>>> because it is full
>>>>>>
>>>>>> Jobs cannot run because resources requested are not available
>>>>>> for parallel
>>>>>> job
>>>>>>       1931
>>>>>>
>>>>>>
>>>>>>
>>>>>> ****************************************************************
>>>>>> Mitsubishi Securities International plc ('MSI') is
>>>>>> registered in England, company number 1698498 and
>>>>>> registered office at 6 Broadgate, London EC2M 2AA.
>>>>>> MSI is part of the Mitsubishi Tokyo Financial Group
>>>>>> and is authorised and regulated by The Financial
>>>>>> Services Authority. This message is intended solely
>>>>>> for the individual addressee named above. The
>>>>>> information contained in this e-mail is confidential
>>>>>> and may be legally privileged. If you are not the
>>>>>> intended recipient please delete in its entirety.
>>>>>> Messages sent via this medium may be subject to
>>>>>> delays, non-delivery and unauthorised alteration.
>>>>>> The information contained herein or attached hereto
>>>>>> has been obtained from sources we believe to be
>>>>>> reliable but we do not represent that it is accurate
>>>>>> or complete. Any reference to past performance should
>>>>>> not be taken as an indication of future performance.
>>>>>> The information contained herein or attached hereto
>>>>>> is not to be construed as an offer or solicitation to
>>>>>> buy or sell any security, instrument or investment.
>>>>>> MSI or any affiliated company, may have an interest,
>>>>>> position, or effect transactions, in any investment
>>>>>> mentioned herein. Any opinions or recommendations
>>>>>> expressed herein are solely those of the author or
>>>>>> analyst and are subject to change without notice.
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>>
>>>>
>>>> ---------
>>>>
>>>>
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>>>>>>
>> users-help at gridengine.sunsource.net
>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------------
>>>>
>> ---------
>>
>>>>
>>>>
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>>>>>
>> users-help at gridengine.sunsource.net
>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>>
>> ---------
>>
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list