[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Fri Jun 29 18:30:46 BST 2007


>> ListenAddress 0.0.0.0
>> #ListenAddress ::

was already commented out in /etc/ssh/sshd_config

On Jun 29, 2007, at 12:20 PM, Margaret Doll wrote:

> I rebooted the compute node, the job restarted the system showing  
> up on top and seems to have completed successfully.  There are no  
> sge-shepherd jobs hanging around associated with the job.
>
> I will try the change to sshd_config.
>
> I do not believe that the system has been compromised because some  
> jobs complete successfully on the queues.  This same job that had  
> problems, completely  fine on a compute node when it was not  
> submitted through the qsub.  We are behind  a campus firewall and I  
> have /etc/hosts.allow restricted to just a couple of subnets.
>
> I look at the logwatch  for the cluster each morning and haven't  
> seen any  strange logins.
>
> There are strange messages:
>
> A total of 10 unidentified 'other' records logged
>   GET /411.d//etc.auto..net HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.passwd HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.group HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.auto..master HTTP/1.1 with response code(s) 36  
> 200 responses
>   GET /411.d//etc.services HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.auto..share HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.auto..misc HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.shadow HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.auto..home HTTP/1.1 with response code(s) 36 200  
> responses
>   GET /411.d//etc.rpc HTTP/1.1 with response code(s) 36 200 responses
>
>
> and the /var/log/messages  contains the messages about ntp.
>
>
>
> I just installed this system last month using ROCKS 4.2.1 with the  
> Centos version that was part of the set.
>
>
> Linux compute-0-1.local 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23  
> 13:38:27 BST 2006 x86_64 x86_64 x86_64 GNU/Linux.
>
>
> On Jun 28, 2007, at 6:51 PM, John Hearns wrote:
>
>> Margaret Doll wrote:
>>> Jun 27 16:38:02 compute-0-1 rpc.statd[3329]: Caught signal 15, un- 
>>> registering and exiting.
>> Errrr... your code is hanging waiting to do some I/O to an NFS  
>> mounted filesystem?
>>
>>
>>
>>> Jun 27 16:39:44 compute-0-1 sshd[3697]: error: Bind to port 22 on  
>>> 0.0.0.0 failed: Address already in use.
>> quick bit Googling - it is already bound to the IPV6 address.
>> As you won't be using IPV6, the suggestion is to comment it out of  
>> the sshd_config
>>
>> ListenAddress 0.0.0.0
>> #ListenAddress ::
>>
>> And why is sshd being started up at this time? Should only be  
>> started at boot time.
>>
>> Has something acted to change the runlevel of this machine at 16:38?
>>
>> Which distribution and kernel are these machines running?
>> I would advise updating to the latest kernel available for this  
>> distribution, and latest NFS packages.
>>
>> Also I really hate to say this - and am opening myself up to a bit  
>> of ridicule - but is there any possibility these machines have  
>> been compromised?
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list