[GE users] qlogin put node in Error state

gg3796 gg3796 at yahoo.com
Tue Oct 12 20:30:10 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Disk is not full. I had suspected it as well. Permissiona  also same as before moving to RHEL5. our domian name was changed do you think that could cause this issue.

Regards,
babar

________________________________
From: reuti <reuti at staff.uni-marburg.de>
To: users at gridengine.sunsource.net
Sent: Tue, October 12, 2010 12:08:15 PM
Subject: Re: [GE users] qlogin put node in Error state

Hi,

is "loglevel" set to "log_info" in SGE's configuration, for now I would assume some permission problem or disk full?

-- Reuti

Am 12.10.2010 um 20:21 schrieb gg3796 <gg3796 at yahoo.com<mailto:gg3796 at yahoo.com>>:

Hi Reuti:

Thanks for your response:

Here is the message I see in spool/qmaster/messages file:
:
10/12/2010 09:03:59|worker|gm-cal|W|job 4482163.1 failed on host c8-1.netlogicmicro.com<http://c8-1.netlogicmicro.com/> general before job because: 10/12/2010 09:03:58 [511:30522]: startup of qrsh job failed:
10/12/2010 09:03:59|worker|gm-cal|E|queue pd.q marked QERROR as result of job 4482163's failure at host <http://c8-1.netlogicmicro.com/> c8-1.netlogicmicro.com<http://c8-1.netlogicmicro.com/>

Here is the messages from exec hosts spool messages:


10/12/2010 09:03:59|  main|c8-1|E|shepherd of job 4482163.1 exited with exit status = 11


________________________________
From: reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>>
To: <mailto:users at gridengine.sunsource.net> users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
Sent: Tue, October 12, 2010 1:13:30 AM
Subject: Re: [GE users] qlogin put node in Error state

Am 11.10.2010 um 20:04 schrieb gg3796:

> Thanks Reuti:
>
> I am using builtin:
>
> qlogin_command              builtin
> qlogin_daemon                builtin
> rlogin_command              builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                  builtin
>
> local_configuration for hosts doesn't have any thing related, only following 3 lines
> mailer                      /bin/mail
> xterm                        /usr/bin/xterm
> execd_spool_dir              /var/sge/6.2u3/california/spool/

Fine.


> I can ssh to the hosts without any problem. It was all working well until I upgraded all submit and executaion hosts to rhel5.4. One thing I would like to mention is SGEMASTER  is still running RHEL4.X. Do you think that may be the problem.

Do you see any additional hint in the messages file of the qmaster and/or the involved nodes?

-- Reuti


> Regards,
> Babar
>
>
> From: reuti <<mailto:reuti at staff.uni-marburg.de>reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>>
> To: <mailto:users at gridengine.sunsource.net> users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
> Sent: Mon, October 11, 2010 2:19:50 AM
> Subject: Re: [GE users] qlogin put node in Error state
>
> Hi,
>
> Am 09.10.2010 um 05:02 schrieb gg3796:
>
> > I am running 6.2u3. since we upgraded our  Desktops and Servers to RHEL5.4 qlogin put the Exec host to E state.
>
> what is your startup method for `qlogin` (`qconf -sconf` and/or the local configuration of each exechost)? I would assume, that the "telnetd" or "telnet" wasn't installed and you are not using -builtin-. NB: "telnetd" can stay disabled in /etc/xinit.d/telnetd as SGE will start its own instance of `telnetd`.
>
> -- Reuti
>
>
>
> > The only message is see in the exec host spool message file is:
> > 10/08/2010 19:49:35|  main|cluster-1|E|shepherd of job 4456333.1 exited with exit status = 11
> >
> >
> > The job status email has following lines in it:
> >
> >
> > Job 4456333 caused action: Queue "<mailto:pd.q at cluster-1.xyz.com>pd.q at cluster-1.xyz.com<mailto:pd.q at cluster-1.xyz.com>" set to ERROR
> >
> > User = babar
> >
> > Queue = <mailto:pd.q at c8-1.xyz.com> pd.q at c8-1.xyz.com<mailto:pd.q at c8-1.xyz.com>
> >
> > Start Time = <unknown>
> >
> > End Time = <unknown>
> >
> > failed before job:10/08/2010 19:49:34 [511:4487]: startup of qrsh job failed:
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Babar
> >
> >
> >
> >
> >
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286475
>
> To unsubscribe from this discussion, e-mail: [<mailto:users-unsubscribe at gridengine.sunsource.net>users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286556

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].





More information about the gridengine-users mailing list