[GE users] this is a new one: "! lGetHost(): got NULL element for EH_name !"

Chris Dagdigian dag at sonsorol.org
Mon Oct 4 22:14:31 BST 2004


Anyone see this error message as a reason for qmaster failing to start:

"!! lGetHost(): got NULL element for EH_name !!"

Is this caused by the usual hostname, DNS, resolver mismatch issues?


Background:

A Large Apple Xserve cluster in which we are experimenting with using 
BerkeleyDB spooling + shadow master failover capability by writing the 
spool files to an Apple XSAN disk volume that is shared between 4 
multihomed hosts capable of acting as SGE qmaster/shadow_master.

The XSAN code is unreleased beta straight from Apple engineering with 
several bugfixes we needed for other reasons -- it caused a system panic 
today under load that wiped out the acting qmaster. I'm not sure if the 
remaining systems were able to read/write to the XSAN volume at the time.

My job is to find out why after all the nodes were brought up again, SGE 
qmaster refuses to start on any head node due to the error message above.

My take is that the SGE problem has nothing to do with spooling or SAN 
stuff. I'm thinking that some hostname oddness creeped in that only bit 
us once the head nodes were bounced.

Anyone see this error before? A quick search through the list archives 
did not reveal much.

Regards,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list