[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Fri Jun 29 17:20:05 BST 2007


I rebooted the compute node, the job restarted the system showing up  
on top and seems to have completed successfully.  There are no sge- 
shepherd jobs hanging around associated with the job.

I will try the change to sshd_config.

I do not believe that the system has been compromised because some  
jobs complete successfully on the queues.  This same job that had  
problems, completely  fine on a compute node when it was not  
submitted through the qsub.  We are behind  a campus firewall and I  
have /etc/hosts.allow restricted to just a couple of subnets.

I look at the logwatch  for the cluster each morning and haven't seen  
any  strange logins.

There are strange messages:

A total of 10 unidentified 'other' records logged
   GET /411.d//etc.auto..net HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.passwd HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.group HTTP/1.1 with response code(s) 36 200 responses
   GET /411.d//etc.auto..master HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.services HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.auto..share HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.auto..misc HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.shadow HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.auto..home HTTP/1.1 with response code(s) 36 200  
responses
   GET /411.d//etc.rpc HTTP/1.1 with response code(s) 36 200 responses


and the /var/log/messages  contains the messages about ntp.



I just installed this system last month using ROCKS 4.2.1 with the  
Centos version that was part of the set.


Linux compute-0-1.local 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27  
BST 2006 x86_64 x86_64 x86_64 GNU/Linux.


On Jun 28, 2007, at 6:51 PM, John Hearns wrote:

> Margaret Doll wrote:
>> Jun 27 16:38:02 compute-0-1 rpc.statd[3329]: Caught signal 15, un- 
>> registering and exiting.
> Errrr... your code is hanging waiting to do some I/O to an NFS  
> mounted filesystem?
>
>
>
>> Jun 27 16:39:44 compute-0-1 sshd[3697]: error: Bind to port 22 on  
>> 0.0.0.0 failed: Address already in use.
> quick bit Googling - it is already bound to the IPV6 address.
> As you won't be using IPV6, the suggestion is to comment it out of  
> the sshd_config
>
> ListenAddress 0.0.0.0
> #ListenAddress ::
>
> And why is sshd being started up at this time? Should only be  
> started at boot time.
>
> Has something acted to change the runlevel of this machine at 16:38?
>
> Which distribution and kernel are these machines running?
> I would advise updating to the latest kernel available for this  
> distribution, and latest NFS packages.
>
> Also I really hate to say this - and am opening myself up to a bit  
> of ridicule - but is there any possibility these machines have been  
> compromised?
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list