[GE users] RE: [SPAM] Re: [GE users] SGE6.1 error

John_Tai John_Tai at smics.com
Tue Aug 14 01:42:42 BST 2007


    [ The following text is in the "gb2312" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

There is nothing else under /tmp related to GE.

Running jobs do have a directory under the spool dir and /tmp. However when the hour and 3 mins comes, it just disappears. 

I didn't do any other local config, apart from the local spool. 

I am quite desperate actually, I might have to go back to 6.0. 


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Tuesday, August 14, 2007 7:30 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error


Am 13.08.2007 um 15:02 schrieb John_Tai:

So the shephered just quits you mean. Is there anything in /tmp in  
addtion as error output from the shepherd on the nodes?

> There were no changes in the network (as far as I know) or NFS.
>
> The local spool is in the local disk, /data1/sge/spool, not in the  
> $SGE_ROOT.

Okay, when you look into /data1/sge/spool/<nodename>/active_jobs with  
a running job, there is a directory for the job? Same with /tmp,  
where in addition the queuename is added?

Do you have local configurations for the nodes defined?

> The resource in the exit code, does it refer to the /tmp dir? Or it  
> could be any other resource?

/tmp is also local - any symbolic link to /data/sge/spool?

-- Reuti


> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Mon 8/13/2007 20:26
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] RE: [SPAM]  Re: [GE users] SGE6.1 error
>
> Am 13.08.2007 um 11:16 schrieb John_Tai:
>
>> There isn't any cronjob running on exec host. Also it happens on
>> all my exec hosts (about 70) so I don't think the problem is in the
>> exec hosts. I think it should be a problem with GE config or install?
>>
>> Actually, let me correct my previous email. The jobs in GE are
>> lost, so there are not in the qstat. However the actual processes
>> are not terminated, they are still running in the exec host.
>
> Exit code 11 is "Resource temporarily unavailable" - was there any
> change to the network/NFS-server with this upgrade?
>
> One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission
> denied" is not the usual location of the pid - for me it's in /var/
> spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.
>
> Where is your local SGE spool directory located - local on the nodes
> or in $SGE_ROOT?
>
> -- Reuti
>
>
>> Thanks.
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Monday, August 13, 2007 4:53 PM
>> To: users at gridengine.sunsource.net
>> Subject: [SPAM] Re: [GE users] SGE6.1 error
>> Importance: Low
>>
>>
>> Hi,
>>
>> Am 13.08.2007 um 09:56 schrieb John_Tai:
>>
>>> I have recently installed 6.1, but every job is terminated after a
>>> while.
>>>
>>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd -now n
>>> icfb":
>>>
>>>     950 0.55500 icfb       johnt        r     08/13/2007 14:48:02
>>> layout.q at dsl46
>>>
>>> Here is the message I get from the command line:
>>>
>>>     error: error reading returncode of remote command
>>>
>>> This is the qmaster messages:
>>>
>>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on host
>>> dsl46 general before job because: 08/13/2007 15:03:31 [999:20475]:
>>> can't open file /tmp/950.1.layout.q/pid: Permission denied
>>>
>>> This is the exec host messages:
>>>
>>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1 exited
>>> with exit status = 11
>>>
>>> Looking at the qmaster messages, it seems that this happens every
>>> hour to the majority of jobs. It doesn't seem to be bound by user
>>> nor exec host.
>>>
>>> Hope somebody can help me. I had been using 6.0u7-1 for a long time
>>> without problems, but now that I changed qmaster server and
>>> installed the latest version, I keep getting this problem.
>>
>> if it's just every hour: is there a cronjob for cleaning /tmp
>> running? - Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list