[GE users] qlogin put node in Error state

reuti reuti at staff.uni-marburg.de
Wed Oct 13 14:48:31 BST 2010


Am 12.10.2010 um 21:30 schrieb gg3796:

> Disk is not full. I had suspected it as well. Permissiona  also same as before moving to RHEL5. our domian name was changed do you think that could cause this issue.

When normal `qsub`-ed jobs are running, this shouldn't be a problem. Especially with the -builtin- startup method you don't have the need to touch /etc/hosts.equiv.

You could of course try to fall back to rsh or ssh and check whether this would work.

Besides any firewall: SELinux in place?

-- Reuti


> 
> Regards,
> babar  
> 
> From: reuti <reuti at staff.uni-marburg.de>
> To: users at gridengine.sunsource.net
> Sent: Tue, October 12, 2010 12:08:15 PM
> Subject: Re: [GE users] qlogin put node in Error state
> 
> Hi,
> 
> is "loglevel" set to "log_info" in SGE's configuration, for now I would assume some permission problem or disk full?
> 
> -- Reuti
> 
> Am 12.10.2010 um 20:21 schrieb gg3796 <gg3796 at yahoo.com>:
> 
>> Hi Reuti:
>>  
>> Thanks for your response:
>>  
>> Here is the message I see in spool/qmaster/messages file:
>> :
>> 10/12/2010 09:03:59|worker|gm-cal|W|job 4482163.1 failed on host c8-1.netlogicmicro.com general before job because: 10/12/2010 09:03:58 [511:30522]: startup of qrsh job failed:
>> 10/12/2010 09:03:59|worker|gm-ca|E|queue pd.q marked QERROR as result of job 4482163's failure at host c8-1.netlogicmicro.com
>>  
>> Here is the messages from exec hosts spool messages:
>>  
>>  
>> 10/12/2010 09:03:59|  main|c8-1|E|shepherd of job 4482163.1 exited with exit status = 11
>> 
>> 
>> From: reuti <reuti at staff.uni-marburg.de>
>> To: users at gridengine.sunsource.net
>> Sent: Tue, October 12, 2010 1:13:30 AM
>> Subject: Re: [GE users] qlogin put node in Error state
>> 
>> Am 11.10.2010 um 20:04 schrieb gg3796:
>> 
>> > Thanks Reuti:
>> >  
>> > I am using builtin:
>> >  
>> > qlogin_command              builtin
>> > qlogin_daemon                builtin
>> > rlogin_command              builtin
>> > rlogin_daemon                builtin
>> > rsh_command                  builtin
>> > rsh_daemon                  builtin
>> >  
>> > local_configuration for hosts doesn't have any thing related, only following 3 lines
>> > mailer                      /bin/mail
>> > xterm                        /usr/bin/xterm
>> > execd_spool_dir              /var/sge/6.2u3/california/spool/
>> 
>> Fine.
>> 
>> 
>> > I can ssh to the hosts without any problem. It was all working well until I upgraded all submit and executaion hosts to rhel5.4. One thing I would like to mention is SGEMASTER  is still running RHEL4.X. Do you think that may be the problem.
>> 
>> Do you see any additional hint in the messages file of the qmaster and/or the involved nodes?
>> 
>> -- Reuti
>> 
>> 
>> > Regards,
>> > Babar
>> >  
>> > 
>> > From: reuti <reuti at staff.uni-marburg.de>
>> > To: users at gridengine.sunsource.net
>> > Sent: Mon, October 11, 2010 2:19:50 AM
>> > Subject: Re: [GE users] qlogin put node in Error state
>> > 
>> > Hi,
>> > 
>> > Am 09.10.2010 um 05:02 schrieb gg3796:
>> > 
>> > > I am running 6.2u3. since we upgraded our  Desktops and Servers to RHEL5.4 qlogin put the Exec host to E state.
>> > 
>> > what is your startup method for `qlogin` (`qconf -sconf` and/or the local configuration of each exechost)? I would assume, that the "telnetd" or "telnet" wasn't installed and you are not using -builtin-. NB: "telnetd" can stay disabled in /etc/xinit.d/telnetd as SGE will start its own instance of `telnetd`.
>> > 
>> > -- Reuti
>> > 
>> > 
>> > 
>> > > The only message is see in the exec host spool message file is:
>> > > 10/08/2010 19:49:35|  main|cluster-1|E|shepherd of job 4456333.1 exited with exit status = 11
>> > >  
>> > >  
>> > > The job status email has following lines in it:
>> > >  
>> > >  
>> > > Job 4456333 caused action: Queue "pd.q at cluster-1.xyz.com" set to ERROR
>> > > 
>> > > User = babar
>> > > 
>> > > Queue = pd.q at c8-1.xyz.com
>> > > 
>> > > Start Time = <unknown>
>> > > 
>> > > End Time = <unknown>
>> > > 
>> > > failed before job:10/08/2010 19:49:34 [511:4487]: startup of qrsh job failed:
>> > > 
>> > >  
>> > >  
>> > >  
>> > >  
>> > > Thanks,
>> > > 
>> > > Babar
>> > > 
>> > >  
>> > >  
>> > > 
>> > >
>> > 
>> > ------------------------------------------------------
>> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286475
>> > 
>> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> > 
>> >
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286556
>> 
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>> 
> 
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286876

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list