[GE users] RE: [SPAM] Re: [GE users] SGE6.1 error

Reuti reuti at staff.uni-marburg.de
Mon Aug 13 13:26:12 BST 2007


Am 13.08.2007 um 11:16 schrieb John_Tai:

> There isn't any cronjob running on exec host. Also it happens on  
> all my exec hosts (about 70) so I don't think the problem is in the  
> exec hosts. I think it should be a problem with GE config or install?
>
> Actually, let me correct my previous email. The jobs in GE are  
> lost, so there are not in the qstat. However the actual processes  
> are not terminated, they are still running in the exec host.

Exit code 11 is "Resource temporarily unavailable" - was there any  
change to the network/NFS-server with this upgrade?

One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission  
denied" is not the usual location of the pid - for me it's in /var/ 
spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.

Where is your local SGE spool directory located - local on the nodes  
or in $SGE_ROOT?

-- Reuti


> Thanks.
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Monday, August 13, 2007 4:53 PM
> To: users at gridengine.sunsource.net
> Subject: [SPAM] Re: [GE users] SGE6.1 error
> Importance: Low
>
>
> Hi,
>
> Am 13.08.2007 um 09:56 schrieb John_Tai:
>
>> I have recently installed 6.1, but every job is terminated after a
>> while.
>>
>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd -now n
>> icfb":
>>
>>     950 0.55500 icfb       johnt        r     08/13/2007 14:48:02
>> layout.q at dsl46
>>
>> Here is the message I get from the command line:
>>
>>     error: error reading returncode of remote command
>>
>> This is the qmaster messages:
>>
>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on host
>> dsl46 general before job because: 08/13/2007 15:03:31 [999:20475]:
>> can't open file /tmp/950.1.layout.q/pid: Permission denied
>>
>> This is the exec host messages:
>>
>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1 exited
>> with exit status = 11
>>
>> Looking at the qmaster messages, it seems that this happens every
>> hour to the majority of jobs. It doesn't seem to be bound by user
>> nor exec host.
>>
>> Hope somebody can help me. I had been using 6.0u7-1 for a long time
>> without problems, but now that I changed qmaster server and
>> installed the latest version, I keep getting this problem.
>
> if it's just every hour: is there a cronjob for cleaning /tmp
> running? - Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list