[GE users] qlogin put node in Error state

gg3796 gg3796 at yahoo.com
Wed Oct 13 16:52:41 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Correct. The selinux is disabled.

So last night the server basically stopped working and I could not even restart it it was complaining about  some host name error. I'm struggling to bring it backup.

 Thanks
Babar

________________________________
From: reuti <reuti at staff.uni-marburg.de>
To: users at gridengine.sunsource.net
Sent: Wed, October 13, 2010 6:48:31 AM
Subject: Re: [GE users] qlogin put node in Error state

Am 12.10.2010 um 21:30 schrieb gg3796:

> Disk is not full. I had suspected it as well. Permissiona  also same as before moving to RHEL5. our domian name was changed do you think that could cause this issue.

When normal `qsub`-ed jobs are running, this shouldn't be a problem. Especially with the -builtin- startup method you don't have the need to touch /etc/hosts.equiv.

You could of course try to fall back to rsh or ssh and check whether this would work.

Besides any firewall: SELinux in place?

-- Reuti


>
> Regards,
> babar
>
> From: reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>>
> To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
> Sent: Tue, October 12, 2010 12:08:15 PM
> Subject: Re: [GE users] qlogin put node in Error state
>
> Hi,
>
> is "loglevel" set to "log_info" in SGE's configuration, for now I would assume some permission problem or disk full?
>
> -- Reuti
>
> Am 12.10.2010 um 20:21 schrieb gg3796 <gg3796 at yahoo.com<mailto:gg3796 at yahoo.com>>:
>
>> Hi Reuti:
>>
>> Thanks for your response:
>>
>> Here is the message I see in spool/qmaster/messages file:
>> :
>> 10/12/2010 09:03:59|worker|gm-cal|W|job 4482163.1 failed on host c8-1.netlogicmicro.com<http://c8-1.netlogicmicro.com/> general before job because: 10/12/2010 09:03:58 [511:30522]: startup of qrsh job failed:
>> 10/12/2010 09:03:59|worker|gm-ca|E|queue pd.q marked QERROR as result of job 4482163's failure at host c8-1.netlogicmicro.com
>>
>> Here is the messages from exec hosts spool messages:
>>
>>
>> 10/12/2010 09:03:59|  main|c8-1|E|shepherd of job 4482163.1 exited with exit status = 11
>>
>>
>> From: reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>>
>> To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
>> Sent: Tue, October 12, 2010 1:13:30 AM
>> Subject: Re: [GE users] qlogin put node in Error state
>>
>> Am 11.10.2010 um 20:04 schrieb gg3796:
>>
>> > Thanks Reuti:
>> >
>> > I am using builtin:
>> >
>> > qlogin_command              builtin
>> > qlogin_daemon                builtin
>> > rlogin_command              builtin
>> > rlogin_daemon                builtin
>> > rsh_command                  builtin
>> > rsh_daemon                  builtin
>> >
>> > local_configuration for hosts doesn't have any thing related, only following 3 lines
>> > mailer                      /bin/mail
>> > xterm                        /usr/bin/xterm
>> > execd_spool_dir              /var/sge/6.2u3/california/spool/
>>
>> Fine.
>>
>>
>> > I can ssh to the hosts without any problem. It was all working well until I upgraded all submit and executaion hosts to rhel5.4. One thing I would like to mention is SGEMASTER  is still running RHEL4.X. Do you think that may be the problem.
>>
>> Do you see any additional hint in the messages file of the qmaster and/or the involved nodes?
>>
>> -- Reuti
>>
>>
>> > Regards,
>> > Babar
>> >
>> >
>> > From: reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>>
>> > To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
>> > Sent: Mon, October 11, 2010 2:19:50 AM
>> > Subject: Re: [GE users] qlogin put node in Error state
>> >
>> > Hi,
>> >
>> > Am 09.10.2010 um 05:02 schrieb gg3796:
>> >
>> > > I am running 6.2u3. since we upgraded our  Desktops and Servers to RHEL5.4 qlogin put the Exec host to E state.
>> >
>> > what is your startup method for `qlogin` (`qconf -sconf` and/or the local configuration of each exechost)? I would assume, that the "telnetd" or "telnet" wasn't installed and you are not using -builtin-. NB: "telnetd" can stay disabled in /etc/xinit.d/telnetd as SGE will start its own instance of `telnetd`.
>> >
>> > -- Reuti
>> >
>> >
>> >
>> > > The only message is see in the exec host spool message file is:
>> > > 10/08/2010 19:49:35|  main|cluster-1|E|shepherd of job 4456333.1 exited with exit status = 11
>> > >
>> > >
>> > > The job status email has following lines in it:
>> > >
>> > >
>> > > Job 4456333 caused action: Queue "pd.q at cluster-1.xyz.com<mailto:pd.q at cluster-1.xyz.com>" set to ERROR
>> > >
>> > > User = babar
>> > >
>> > > Queue = pd.q at c8-1.xyz.com<mailto:pd.q at c8-1.xyz.com>
>> > >
>> > > Start Time = <unknown>
>> > >
>> > > End Time = <unknown>
>> > >
>> > > failed before job:10/08/2010 19:49:34 [511:4487]: startup of qrsh job failed:
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks,
>> > >
>> > > Babar
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> > ------------------------------------------------------
>> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286475
>> >
>> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
>> >
>> >
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286556
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
>>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286876

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].




More information about the gridengine-users mailing list