[GE users] this is a new one: "! lGetHost(): got NULL element for EH_name !"

Daniel Templeton Dan.Templeton at Sun.COM
Tue Oct 5 08:14:01 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

That sounds like a bug.  No DNS mismatch should cause the qmaster to abort.
First step is to turn on debugging output before starting the qmaster.  
(source util/dl.csh; dl 2)  If it crashes with debugging on, send us the 
last couple hundred line of output.
What would be even more helpful is if you could start the qmaster up in 
a debugger (with debugging turned on, as above) and print out a stack 
trace when it aborts.  This, of course, requires that you use a qmaster 
that was compiled with debugging enabled.
While I can't say I've seen this exact issue before, it certainly sounds 
familiar.  I can guess right about where it's happening.  The good news 
is that these kinds of issues are usually very easy to find and fix.

Daniel

Chris Dagdigian wrote:

>
> Anyone see this error message as a reason for qmaster failing to start:
>
> "!! lGetHost(): got NULL element for EH_name !!"
>
> Is this caused by the usual hostname, DNS, resolver mismatch issues?
>
>
> Background:
>
> A Large Apple Xserve cluster in which we are experimenting with using 
> BerkeleyDB spooling + shadow master failover capability by writing the 
> spool files to an Apple XSAN disk volume that is shared between 4 
> multihomed hosts capable of acting as SGE qmaster/shadow_master.
>
> The XSAN code is unreleased beta straight from Apple engineering with 
> several bugfixes we needed for other reasons -- it caused a system 
> panic today under load that wiped out the acting qmaster. I'm not sure 
> if the remaining systems were able to read/write to the XSAN volume at 
> the time.
>
> My job is to find out why after all the nodes were brought up again, 
> SGE qmaster refuses to start on any head node due to the error message 
> above.
>
> My take is that the SGE problem has nothing to do with spooling or SAN 
> stuff. I'm thinking that some hostname oddness creeped in that only 
> bit us once the head nodes were bounced.
>
> Anyone see this error before? A quick search through the list archives 
> did not reveal much.
>
> Regards,
> Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

-- 
*******************************************************
*          Daniel Templeton   ERGB01 x60220           *
*         Staff Engineer, Sun N1 Grid Engine          *
*******************************************************
*    "Camera one closes in, the soundtrack starts,    *
*     The scene begins.  You're playing you now."     *
*                -Josh Joplin Group, "Camera One"     *
*******************************************************



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list