[GE users] RE: [SPAM] Re: [GE users] SGE6.1 error

Reuti reuti at staff.uni-marburg.de
Tue Aug 14 10:37:31 BST 2007


Am 14.08.2007 um 02:42 schrieb John_Tai:

> There is nothing else under /tmp related to GE.
>
> Running jobs do have a directory under the spool dir and /tmp.  
> However when the hour and 3 mins comes, it just disappears.
>
> I didn't do any other local config, apart from the local spool.
>
> I am quite desperate actually, I might have to go back to 6.0.

Just an idea: is there any ulimit defined on the nodes, when you  
login. Is the sge_execd running without any limits as real user root  
and maybe any other effective user? - Reuti


>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, August 14, 2007 7:30 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>
>
> Am 13.08.2007 um 15:02 schrieb John_Tai:
>
> So the shephered just quits you mean. Is there anything in /tmp in
> addtion as error output from the shepherd on the nodes?
>
>> There were no changes in the network (as far as I know) or NFS.
>>
>> The local spool is in the local disk, /data1/sge/spool, not in the
>> $SGE_ROOT.
>
> Okay, when you look into /data1/sge/spool/<nodename>/active_jobs with
> a running job, there is a directory for the job? Same with /tmp,
> where in addition the queuename is added?
>
> Do you have local configurations for the nodes defined?
>
>> The resource in the exit code, does it refer to the /tmp dir? Or it
>> could be any other resource?
>
> /tmp is also local - any symbolic link to /data/sge/spool?
>
> -- Reuti
>
>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Mon 8/13/2007 20:26
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] RE: [SPAM]  Re: [GE users] SGE6.1 error
>>
>> Am 13.08.2007 um 11:16 schrieb John_Tai:
>>
>>> There isn't any cronjob running on exec host. Also it happens on
>>> all my exec hosts (about 70) so I don't think the problem is in the
>>> exec hosts. I think it should be a problem with GE config or  
>>> install?
>>>
>>> Actually, let me correct my previous email. The jobs in GE are
>>> lost, so there are not in the qstat. However the actual processes
>>> are not terminated, they are still running in the exec host.
>>
>> Exit code 11 is "Resource temporarily unavailable" - was there any
>> change to the network/NFS-server with this upgrade?
>>
>> One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission
>> denied" is not the usual location of the pid - for me it's in /var/
>> spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.
>>
>> Where is your local SGE spool directory located - local on the nodes
>> or in $SGE_ROOT?
>>
>> -- Reuti
>>
>>
>>> Thanks.
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Monday, August 13, 2007 4:53 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: [SPAM] Re: [GE users] SGE6.1 error
>>> Importance: Low
>>>
>>>
>>> Hi,
>>>
>>> Am 13.08.2007 um 09:56 schrieb John_Tai:
>>>
>>>> I have recently installed 6.1, but every job is terminated after a
>>>> while.
>>>>
>>>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd -now n
>>>> icfb":
>>>>
>>>>     950 0.55500 icfb       johnt        r     08/13/2007 14:48:02
>>>> layout.q at dsl46
>>>>
>>>> Here is the message I get from the command line:
>>>>
>>>>     error: error reading returncode of remote command
>>>>
>>>> This is the qmaster messages:
>>>>
>>>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on host
>>>> dsl46 general before job because: 08/13/2007 15:03:31 [999:20475]:
>>>> can't open file /tmp/950.1.layout.q/pid: Permission denied
>>>>
>>>> This is the exec host messages:
>>>>
>>>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1 exited
>>>> with exit status = 11
>>>>
>>>> Looking at the qmaster messages, it seems that this happens every
>>>> hour to the majority of jobs. It doesn't seem to be bound by user
>>>> nor exec host.
>>>>
>>>> Hope somebody can help me. I had been using 6.0u7-1 for a long time
>>>> without problems, but now that I changed qmaster server and
>>>> installed the latest version, I keep getting this problem.
>>>
>>> if it's just every hour: is there a cronjob for cleaning /tmp
>>> running? - Reuti
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list