[GE users] rcmd: socket: Cannot assign requested address

John_Tai John_Tai at smics.com
Mon Jan 17 01:08:40 GMT 2005


    [ The following text is in the "gb2312" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I did shutdown the qmaster daemon, and that is when I couldn't restart it at all:

root at dsfileserver: /etc/init.d # ./sgemaster
   starting sge_qmaster

sge_qmaster didn't start!
Please check the messages file

   starting sge_schedd
error: getting configuration: unable to contact qmaster using port 5098 on host "dsfileserver"
can't get configuration from qmaster -- waiting ...
can't get configuration from qmaster -- waiting ...
can't get configuration from qmaster -- waiting ...
error: can't get configuration from qmaster -- backgrounding

Messages file had this:

01/13/2005 15:39:20|qmaster|dsfileserver|E|couldn't set rpc server in database environment: (-30993) DB_NOSERVER: Fatal error, no RPC server
01/13/2005 15:39:20|qmaster|dsfileserver|E|startup of rule "default rule" in context "berkeleydb spooling" failed
01/13/2005 15:39:20|qmaster|dsfileserver|C|setup failed

So I tried to restart the BDB, but the daemon occupied 20-30% of the CPU, which I had never seen before. So I assumed the BDB daemon was hanging (where are the BDB messages?).

qmaster could not restart, so I reinstalled with classic db, and it worked. I didn't know what else to do, and I had 100 users waiting for me.

Thanks again for trying to figure this out.
John

-----Original Message-----
From: Andreas Haas [ mailto:Andreas.Haas at Sun.COM]
Sent: Saturday, January 15, 2005 12:26 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] rcmd: socket: Cannot assign requested address


On Fri, 14 Jan 2005, John_Tai wrote:

> No, I've always used the BDB. And I only switched to the classic yesterday because of this error.
>
> If the BDB was not working correctly, could that have affected the job execution?

John,

it shouldn't. Also it is very unlikely since the error

   failed before writing exit_status:can't read usage file for job 61013.1

occured in execution daemon which does not even link BDB dynamically ...

But who am I to rule this out? The interesting thing is that 6.0u1 BDB
spooling worked at first quite some time before it broke your system.

Have you tried to shutdown and restarted your master node using?

   # rcsge stop
   # rcsge start

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net






More information about the gridengine-users mailing list