[GE users] yet another commlib error

Olesen, Mark Mark.Olesen at arvinmeritor.com
Tue Dec 6 12:02:42 GMT 2005


This sounds like some problems that I had had with 6.0u2 + classic spooling
- even with a clean system shutdown.

If you try to start sge_qmaster directly, or with strace you might get more
hints. In our case, the qmaster always failed after reporting 'read XX
entries in XX seconds'.

The desperate and drastic solution was to remove all entries from the
spool/qmaster/jobs/.. directory.
It appeared that some of the mpich jobs might have been corrupt.
I don't know if this problem has been fixed yet, because stress testing the
system would mean that I have to start killing user jobs again if problems
crop up.  For some reason, this sort of action isn't too popular ;)

/mark

Dr. Mark Olesen
Principal Engineer Thermofluids Analysis
ArvinMeritor Light Vehicle Systems
ArvinMeritor Emissions Technologies GmbH
Biberbachstr. 9
D-86154 Augsburg, GERMANY 

> -----Original Message-----
> From: Christian Reissmann [mailto:Christian.Reissmann at Sun.COM]
> Sent: Tuesday, December 06, 2005 11:44 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] yet another commlib error
> 
> Hello Michael,
> 
> This seems not to be a "commlib error", more a "qmaster doesn't startup
> error".
> 
> Is qmaster running after starting up?
> 
> Your configuration may be corrupt. Did you shutdown the qmaster before
> removing filesystems ?
> 
> 
> Regards,
> 
> Christian
> 
> 
> Michael Green wrote:
> >SLES9SP1
> >N1GE 6U6
> >1 master <-NAT-> 8 nodes
> >$SGE_ROOT=/srv/N1GE on physical shared file system (GPFS) on IBM FASTt700
> SAN.
> >
> >Yesterday I had IBM staff over here servicing the storage. I cleanly
> >unmounted file systems and shut down all machines before they put
> >their hands on it.
> >
> >After they finished I booted the systems, everything went without
> >hitch except SGE refused to start.
> >
> >On the master:
> ><code>
> >gene1:/srv/N1GE/default/spool/qmaster # /etc/init.d/sgemaster start
> >   starting sge_qmaster
> >
> >sge_qmaster didn't start!
> >Please check the messages file
> >
> >   starting sge_schedd
> >error: commlib error: can't connect to service (Connection refused)
> >error: getting configuration: unable to contact qmaster using port 536
> >on host "gene1.weizmann.ac.il"
> >error: can't get configuration from qmaster -- backgrounding
> ></code>
> >
> ><log>
> >gene1:/srv/N1GE/default/spool/qmaster # tail -f messages
> >12/06/2005 10:24:54|qmaster|gene1|E|missing configuration attribute
> "hostname"
> >12/06/2005 10:24:54|qmaster|gene1|E|cannot recreate queue all.q from
> >disk because of unknown host g1.biocl.weizmann.ac.il
> >12/06/2005 10:24:54|qmaster|gene1|I|read job database with 1 entries
> >in 0 seconds
> >12/06/2005
> 10:24:54|qmaster|gene1|E|cqueue_list_locate_qinstance("all.q at g3.biocl.weiz
> mann.ac.il"):
> >cqueue == NULL("all.q", "g3.biocl.weizmann.ac.il", 1, 0)
> >12/06/2005 10:24:54|qmaster|gene1|E|can't find queue
> >"all.q at g3.biocl.weizmann.ac.il" referenced in job 27
> ></log>
> >
> >qmaster complains on missing hostname attribute, but what is the file
> >that contains it? grepping on default/ directory reveals quite a few
> >files containing 'hostname'.
> >Also the line with 'cqueue_list_locate_qinstance', does it check the
> >cqueues/all.q file?
> >
> >Please help!
> >--
> >Warm regards,
> >Michael Green
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list