[GE users] Errors After DNS Domain Change
swaltner at mac.com
Thu Apr 29 13:56:48 BST 2010
I was finally able to restore functionality on our SGE cluster.
The solution was to manually edit the /etc/hosts file on the qmaster host to add an entry for the new and old dns names for the two execution hosts. Once I did this, I was able to run the "qhost -mq all.q" command and edit the seq_no and slots parameters to drop the old FQDN from the config file. Once this was done, everything resumed to normal. Before we shutdown our next site that will go through a similar configuration change, I'll try updating the SGE configuration ahead of time. Maybe that will ease the pain of changing DNS domain names.
I would count this as a bug in SGE since the software wouldn't let me correct the configuration issue because there was a configuration issue that needed to be corrected....
On Apr 27, 2010, at 3:55 PM, swaltner wrote:
> This last weekend, our India site changed their DNS zone from in.lsil.com to lsi.com. At the same time, the servers also were shipped to a new building and in the process, the IP address and other network settings changed. We've made similar changes at another site, so most issues were taken care of right away. However, this was the first site that used SGE and our Linux execution hosts are still off-line.
> Our SGE clusters are all setup to ignore the DNS domain name, so I thought this wouldn't cause too many issue. However, I think what's happened is when the execution hosts originally got installed, various config settings were done with the full hostname. The only customization to SGE when installing the execution host is to use a local spooling directory. Every other option is the default option.
> Initially, I noticed that our "qhost -q" would report that our Linux execution hosts would go into the au state occasionally, then return to a normal state, then go back to the au state. This seemed to happen on a cycle of a couple minutes. The qmaster messages file contains lots of errors like:
> 04/25/2010 06:40:04|qmaster|saps|E|can't resolve hostname "bdcgrid003.in.lsil.com"
> 04/25/2010 06:40:04|qmaster|saps|E|no execd known on host bdcgrid003.in.lsil.com to send conf notification
> 04/25/2010 06:40:04|qmaster|saps|E|can't notify exec host "bdcgrid003.lsi.com" of new conf
> 04/25/2010 06:40:44|qmaster|saps|E|can't resolve hostname "bdcgrid003.in.lsil.com"
> 04/25/2010 06:40:44|qmaster|saps|E|no execd known on host bdcgrid003.in.lsil.com to send conf notification
> 04/25/2010 06:40:44|qmaster|saps|E|can't notify exec host "bdcgrid003.lsi.com" of new conf
> 04/25/2010 06:41:24|qmaster|saps|E|can't resolve hostname "bdcgrid003.in.lsil.com"
> 04/25/2010 06:41:24|qmaster|saps|E|no execd known on host bdcgrid003.in.lsil.com to send conf notification
> 04/25/2010 06:41:24|qmaster|saps|E|can't notify exec host "bdcgrid003.lsi.com" of new conf
> Seeing that there were still references to in.lsil.com in the database, I tried updating those references by running commands like "qconf -mq all.q". One of the places that full domain names existed was in here for the seq_no parameter.
> seq_no 0,[bdcgrid001=20],[bdcgrid002=20], \
> Unfortunately, I would get an error saying that the bdcgrid003.in.lsil.com wasn't a valid hostname when I tried to save this config file, which is why I was editing the file and was the changes that I had made. The error I get was/is....
> error: unable to resolve host "bdcgrid003.in.lsil.com"
> unable to resolve host "bdcgrid003.in.lsil.com"
> Since that wasn't working properly, I ran "./sge_inst -ux -host bdcgrid003" to uninstall the execution daemon on the execution host, then reinstall it. After the reinstall, the execution host at least stays in a normal state instead of going to the au state in qhost. However, once I try to submit a job to the Linux hosts, they go into an Error state. The messages log now reports:
> 04/26/2010 12:23:24|qmaster|saps|W|job 519008.1 failed on host bdcgrid003.lsi.com general assumedly before job because: can't create directory active_jobs/519008.1: No such file or directory
> 04/26/2010 12:23:24|qmaster|saps|W|rescheduling job 519008.1
> 04/26/2010 12:23:24|qmaster|saps|E|queue all.q marked QERROR as result of job 519008's failure at host bdcgrid003.lsi.com
> I still see some references to in.lsil.com in the SGE configs, so I suspect the qmaster is trying to copy files to the old hostname, which is failing, yielding the errors we are seeing. When I try to correct these using commands like the "qconf -mq all.q" command above, it errors when I save the changes, and doesn't update the configuration.
> Is there anything I can do besides renaming all of our execution hosts? I'd really like to avoid doing that.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users