No subject


Wed Jan 12 20:38:46 GMT 2011


On the exechost itself:
 [dolbersen at sge-alfrodull-076 ~]$ qping -info sge-qmaster-01.eng.atg.nw.net 5000 qmaster 1
 08/27/2008 06:51:08:
 SIRM version:             0.1
 SIRM message id:          1
 start time:               08/18/2008 13:35:48 (1219091748)
 run time [s]:             753320
 messages in read buffer:  0
 messages in write buffer: 0
 nr. of connected clients: 142
 status:                   0
 info:                     TET: R (2.82) | EDT: R (0.01) | SIGT: R (753320.13) | MT(1): R (0.06) | MT(2): R (0.01) | OK
 Monitor:                  disabled

I've since had to kill -9 execd and restart it. Doing so brings things up right away.

Whatever the problem is, it's very hard to catch!

-- 
David Olbersen

-----Original Message-----
From: Christian Reissmann [mailto:Christian.Reissmann at Sun.COM]
Sent: Wed 8/27/2008 12:23 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Help with 6.1u3 and execd acting funny
 
Hi David,

please try also a sge qping on our execd host which is not able to connect:

qping YOUR_MASTER_HOST YOUR_MASTER_PORT qmaster 1

you can also use qping -info option (see man qping).

Regards,

Christian


P.S.: The qping -dump option on your qmaster host might also be helpful!


Reuti wrote:
> Hi David,
> 
> Am 26.08.2008 um 21:15 schrieb David Olbersen:
> 
>> Reuti,
>>
>> Sorry about that. What I meant was that the binaries are on NFS. When 
>> I upgraded the cluster I killed all the exec daemons, then ran 
>> 'inst_execd' on each node using the new SGE_ROOT to upgrade them. So 
>> all the init scripts are updated.
>>
>> We've been using 5000 and 5001 for SGE in the old and new system.
>> Checking the new qmaster I see that I configured it properly and it's 
>> listening on port 5000, so I don't think it's that.
>>
>> This is all pretty tricky for me to diagnose; we reworked the network 
>> *and* upgraded the cluster at the same time. It's hard to point 
>> fingers :)
> 
> you can ping the systems each other and a:
> 
> telnet <your_qmaster> 5000
> 
> is connecting from the exec node to the qmaster? Did the names of the 
> machines change?
> 
> -- Reuti
> 
> 
>> -- 
>> David Olbersen
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tue 8/26/2008 12:06 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Help with 6.1u3 and execd acting funny
>>
>> Am 26.08.2008 um 21:01 schrieb David Olbersen:
>>
>>> Yes, the daemons are running. All the binaries got updated since they
>>> live in a NFS filesystem.
>>
>> Also /etc/init.d is on NFS? These scripts are not shared in $SGE_ROOT
>> to start the sgeexecd (they are there in default/common, but usually
>> copied to /etc/init.d during installation).
>>
>>> All the machines are configured the same, but they don't all behave
>>> the
>>> same.
>>> The only hint I get from the machines that are having issues is:
>>>
>>> 08/26/2008 11:53:42|execd|sge-alfrodull-011|W|can't register at
>>> "qmaster": unable to send message to qmaster using port 5000 on host
>>> "sge-qmaster-01.eng.atg.nw.net": got messa
>>
>> Is port 5000 the one you want to use for SGE?
>>
>> -- Reuti
>>
>>> Which is annoying because I don't get to see what the rest of the
>>> message is.
>>>
>>> Has anybody run into this before? It's starting to become a problem!
>>>
>>> -- 
>>> David Olbersen
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Tuesday, August 26, 2008 1:07 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Help with 6.1u3 and execd acting funny
>>>
>>> Hi,
>>>
>>> Am 25.08.2008 um 22:02 schrieb David Olbersen:
>>>
>>>> This weekend I upgraded our cluster from 6.0u8 to 6.1u3 (because
>>>> 6.1u4 hadn't been released when I started working on the upgrade).
>>>>
>>>> Now that everything's been cut over I'm seeing strange behaviour out
>>>> of execd; when I run qhost, I get entries that don't show the
>>>> load/memuse/swapus columns, they just show -'s instead.
>>>
>>> you checked, that the daemons are running? The scripts in /etc/init.d
>>> were also replaced (maybe the port settings changed for sge_qmaster
>>> and
>>> sge_execd), as maybe they were hardcoded before for some machines?
>>>
>>> -- Reuti
>>>
>>>> In the past this meant that execd had crashed on those exec hosts and
>>>> the machines needed to be dealt with. That appears to be the case now
>>>> but I'm not sure what's wrong; the machines haven't change aside from
>>>> having the new execd app installed.
>>>>
>>>> Any suggestions or starting points for debugging this?
>>>>
>>>> David Olbersen
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net






    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list