[GE users] sge_shepherd problems perhaps connected to nfs problems

Margaret Doll Margaret_Doll at brown.edu
Thu Jun 28 21:31:21 BST 2007


The  jobs have not gained any time.

root      4061     1  0 Jun27 ?        00:00:00 /bin/sh /opt/globus/ 
sbin/grid-info-soft-regi
ster -log /opt/globus/var/grid-info-system.log -register -t mdsreg2 - 
h compute-0-1.local -p
2135 -period 600 -dn Mds-Vo-Op-name=register, Mds-Vo-name=site,  
o=grid -daemon -t ldap -h co
mpute-0-1.local -p 2135 -ttl 1200 -r Mds-Vo-name=local, o=grid -T 20 - 
b ANONYM-ONLY -z 0 -m
cachedump -period 30
root      8677  4061  0 16:20 ?        00:00:00  \_ sleep 600
sge       4228     1  0 Jun27 ?        00:04:13 /opt/gridengine/bin/ 
lx26-amd64/sge_execd
sge       5094  4228  0 Jun27 ?        00:00:00  \_ sge_shepherd-549 -bg
mad       5095  5094  0 Jun27 ?        00:00:00      \_ -csh /opt/ 
gridengine/default/spool/c
ompute-0-1/job_scripts/549
mad       5168  5095 57 Jun27 ?        13:30:53          \_ /home/mad/ 
jdoll/mad


The only  error messages on the compute node include numerous  
problems with the ntpdate server and
this:

Jun 27 16:38:02 compute-0-1 rpc.statd[3329]: Caught signal 15, un- 
registering and exiting.
Jun 27 16:39:40 compute-0-1 kernel: PCI: Cannot allocate resource  
region 0 of device 0000:00:08.0
Jun 27 16:39:41 compute-0-1 kernel: hw_random: RNG not detected
Jun 27 16:39:44 compute-0-1 sshd[3697]: error: Bind to port 22 on  
0.0.0.0 failed: Address already in use.
Jun 27 17:01:03 compute-0-1 ntpdate[5087]: no server suitable for  
synchronization found
Jun 27 19:01:02 c


On Jun 28, 2007, at 12:18 PM, Fred Youhanaie wrote:

>
>
> Margaret Doll wrote:
>> I have a job that I started last night.
>> It is no longer running on top, but it shows up in qstat -f
>> ps -ef --forest | more^M
>> UID        PID  PPID  C STIME TTY          TIME CMD^M
>> sge       4228     1  0 Jun27 ?        00:02:53 /opt/gridengine/ 
>> bin/lx26-amd64/sge_execd^M
>> sge       5094  4228  0 Jun27 ?        00:00:00  \_  
>> sge_shepherd-549 -bg^M
>> mad       5095  5094  0 Jun27 ?        00:00:00      \_ -csh /opt/ 
>> gridengine/default/spool/compute-0-1/job_scripts/549^M
>> mad       5168  5095 84 Jun27 ?        13:30:53          \_ /home/ 
>> mad/user1/mad^M
>
>
> It looks like the script has been running and so far it has used  
> 13.5 hours of cpu time. Is the TIME column still increasing?

>
> qdel 549 should delete the job and the 3 processes should disappear.
>
> I think it is also worthwhile following John's advice and  
> investigate the hanging df problems. Are there any NFS issues?
>
>
> Cheers
> f.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list