[GE users] RE: [SPAM] Re: [GE users] SGE6.1 error

John_Tai John_Tai at smics.com
Thu Aug 16 01:40:00 BST 2007


    [ The following text is in the "gb2312" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Yes, it always happens around the first 7 minutes of each hour. 


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Wednesday, August 15, 2007 9:14 PM
To: users at gridengine.sunsource.net
Subject: [SPAM] Re: [GE users] SGE6.1 error
Importance: Low


As you stated before: it's always happening at <hrs>:01:08 or so?  -  
Reuti

Am 15.08.2007 um 09:34 schrieb John_Tai:

> Maybe the following log can help:
>
> Job 5481 caused action: Job 5481 set to ERROR
>  User        = nellie
>  Queue       = sun.q at designserver
>  Host        = designserver
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before job:08/15/2007 15:01:08 [999:966]: can't open file / 
> tmp/5481.1.sun.q/pid: Permission denied
> Shepherd trace:
> 08/15/2007 14:37:30 [999:966]: shepherd called with uid = 0, euid =  
> 999
> 08/15/2007 14:37:30 [999:966]: starting up 6.1
> 08/15/2007 14:37:30 [999:966]: setpgid(966, 966) returned 0
> 08/15/2007 14:37:30 [999:966]: no prolog script to start
> 08/15/2007 14:37:30 [999:966]: forked "job" with pid 967
> 08/15/2007 14:37:30 [999:966]: child: job - pid: 967
> 08/15/2007 14:37:30 [999:967]: processing qlogin job
> 08/15/2007 14:37:30 [999:967]: pid=967 pgrp=967 sid=967 old  
> pgrp=966 getlogin()=<no login set>
> 08/15/2007 14:37:30 [999:967]: reading passwd information for user  
> 'root'
> 08/15/2007 14:37:30 [999:967]: setting limits
> 08/15/2007 14:37:30 [999:967]: RLIMIT_CPU setting: (soft  
> 18446744073709551613 hard 18446744073709551613) resulting: (soft  
> 18446744073709551613 hard 18446744073709551613)
> 08/15/2007 14:37:30 [999:967]: RLIMIT_FSIZE setting: (soft  
> 18446744073709551613 hard 18446744073709551613) resulting: (soft  
> 18446744073709551613 hard 18446744073709551613)
> 08/15/2007 14:37:30 [999:967]: RLIMIT_DATA setting: (soft  
> 18446744073709551613 hard 18446744073709551613) resulting: (soft  
> 18446744073709551613 hard 18446744073709551613)
> 08/15/2007 14:37:30 [999:967]: RLIMIT_STACK setting: (soft  
> 18446744073709551613 hard 18446744073709551613) resulting: (soft  
> 18446744073709551613 hard 18446744073709551613)
> 08/15/2007 14:37:30 [999:967]: RLIMIT_CORE setting: (soft  
> 18446744073709551613 hard 18446744073709551613) resulting: (soft  
> 18446744073709551613 hard 18446744073709551613)
> 08/15/2007 14:37:30 [999:967]: RLIMIT_VMEM setting: (soft  
> 18446744073709551613 hard 18446744073709551613) resulting: (soft  
> 18446744073709551613 hard 18446744073709551613)
> 08/15/2007 14:37:30 [999:967]: setting environment
> 08/15/2007 14:37:30 [999:967]: Initializing error file
> 08/15/2007 14:37:30 [999:967]: switching to intermediate/target user
> 08/15/2007 14:37:30 [407:967]: closing all filedescriptors
> 08/15/2007 14:37:30 [407:967]: further messages are in "error" and  
> "trace"
> 08/15/2007 14:37:30 [0:967]: now running with uid=0, euid=0
> 08/15/2007 14:37:30 [0:967]: start qlogin
> 08/15/2007 14:37:30 [0:967]: calling qlogin_starter(/home/sge/ 
> sge6.1/cell1/spool/designserver/active_jobs/5481.1, /home/sge/ 
> sge6.1/utilbin/sol-sparc64/rshd -l);
> 08/15/2007 14:37:30 [0:967]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/15/2007 14:37:30 [0:967]: using sfd 1
> 08/15/2007 14:37:30 [0:967]: bound to port 65302
> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - data = 0:65302:/home/ 
> sge/sge6.1/utilbin/sol-sparc64:/home/sge/sge6.1/cell1/spool/ 
> designserver/active_jobs/5481.1:designserver
> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - address = designserver: 
> 65301
> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - host = designserver,  
> port = 65301
> 08/15/2007 14:37:30 [0:967]: waiting for connection.
> 08/15/2007 14:37:30 [0:967]: accepted connection on fd 2
> 08/15/2007 14:37:30 [0:967]: daemon to start: |/home/sge/sge6.1/ 
> utilbin/sol-sparc64/rshd -l|
> 08/15/2007 14:37:30 [999:970]: setosjobid: uid = 0, euid = 999
> 08/15/2007 15:01:08 [999:966]: wait3 returned -1
> 08/15/2007 15:01:08 [999:966]: mapped signal TSTP to signal KILL
> 08/15/2007 15:01:08 [999:966]: queued signal KILL
> 08/15/2007 15:01:08 [999:966]: can't open file /tmp/5481.1.sun.q/ 
> pid: Permission denied
> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - data = 1:can't open  
> file /tmp/5481.1.sun.q/pid: Permission denied
> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - address =  
> designserver:65301
> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - host = designserver,  
> port = 65301
> 08/15/2007 15:01:08 [999:966]: error connecting stream socket:  
> Connection refused
>
> Shepherd error:
> 08/15/2007 15:01:08 [999:966]: can't open file /tmp/5481.1.sun.q/ 
> pid: Permission denied
>
> Shepherd pe_hostfile:
> designserver 1 sun.q at designserver <NULL>
>
>
>
> -----Original Message-----
> From: John_Tai
> Sent: Wednesday, August 15, 2007 11:59 AM
> To: 'users at gridengine.sunsource.net'
> Subject: Re: [GE users] SGE6.1 error
>
>
> I checked with
>
> ps --User root
>
> and sge_execd is by root. Using
>
> ps -ef
>
> the user is sge.
>
> However in my old installation (6.0) the user was always root, even  
> with ps -ef.
>
> Could this be the cause of my problem?
>
> Was this changed from 6.0 to 6.1? Or is this decided during  
> installation?
>
>
>
>
> -----Original Message-----
> From: Rayson Ho [mailto:rayrayson at gmail.com]
> Sent: Wednesday, August 15, 2007 11:46 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>
>
> Check the real uid of sge_execd -- sge_execd switches its effective
> uid between root and the admin account during execution so that it can
> write to NFS directories. The manpage of ps(1) should tell you which
> argument you need to get the real uid... or you can always google for
> it...
>
> Rayson
>
>
>
>
> On 8/14/07, John_Tai <John_Tai at smics.com> wrote:
>> The sge_execd is running as the sge admin account (sge), which is  
>> different from my previous installation (sge6.0). Is this the  
>> cause? How do I revert it to start sge_execd and sge_qmaster as root?
>>
>> Do I have to re-install everything?
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, August 14, 2007 5:38 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>
>>
>> Am 14.08.2007 um 02:42 schrieb John_Tai:
>>
>>> There is nothing else under /tmp related to GE.
>>>
>>> Running jobs do have a directory under the spool dir and /tmp.
>>> However when the hour and 3 mins comes, it just disappears.
>>>
>>> I didn't do any other local config, apart from the local spool.
>>>
>>> I am quite desperate actually, I might have to go back to 6.0.
>>
>> Just an idea: is there any ulimit defined on the nodes, when you
>> login. Is the sge_execd running without any limits as real user root
>> and maybe any other effective user? - Reuti
>>
>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Tuesday, August 14, 2007 7:30 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>>
>>>
>>> Am 13.08.2007 um 15:02 schrieb John_Tai:
>>>
>>> So the shephered just quits you mean. Is there anything in /tmp in
>>> addtion as error output from the shepherd on the nodes?
>>>
>>>> There were no changes in the network (as far as I know) or NFS.
>>>>
>>>> The local spool is in the local disk, /data1/sge/spool, not in the
>>>> $SGE_ROOT.
>>>
>>> Okay, when you look into /data1/sge/spool/<nodename>/active_jobs  
>>> with
>>> a running job, there is a directory for the job? Same with /tmp,
>>> where in addition the queuename is added?
>>>
>>> Do you have local configurations for the nodes defined?
>>>
>>>> The resource in the exit code, does it refer to the /tmp dir? Or it
>>>> could be any other resource?
>>>
>>> /tmp is also local - any symbolic link to /data/sge/spool?
>>>
>>> -- Reuti
>>>
>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Mon 8/13/2007 20:26
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] RE: [SPAM]  Re: [GE users] SGE6.1 error
>>>>
>>>> Am 13.08.2007 um 11:16 schrieb John_Tai:
>>>>
>>>>> There isn't any cronjob running on exec host. Also it happens on
>>>>> all my exec hosts (about 70) so I don't think the problem is in  
>>>>> the
>>>>> exec hosts. I think it should be a problem with GE config or
>>>>> install?
>>>>>
>>>>> Actually, let me correct my previous email. The jobs in GE are
>>>>> lost, so there are not in the qstat. However the actual processes
>>>>> are not terminated, they are still running in the exec host.
>>>>
>>>> Exit code 11 is "Resource temporarily unavailable" - was there any
>>>> change to the network/NFS-server with this upgrade?
>>>>
>>>> One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission
>>>> denied" is not the usual location of the pid - for me it's in /var/
>>>> spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.
>>>>
>>>> Where is your local SGE spool directory located - local on the  
>>>> nodes
>>>> or in $SGE_ROOT?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> Thanks.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: Monday, August 13, 2007 4:53 PM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: [SPAM] Re: [GE users] SGE6.1 error
>>>>> Importance: Low
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 13.08.2007 um 09:56 schrieb John_Tai:
>>>>>
>>>>>> I have recently installed 6.1, but every job is terminated  
>>>>>> after a
>>>>>> while.
>>>>>>
>>>>>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd - 
>>>>>> now n
>>>>>> icfb":
>>>>>>
>>>>>>     950 0.55500 icfb       johnt        r     08/13/2007 14:48:02
>>>>>> layout.q at dsl46
>>>>>>
>>>>>> Here is the message I get from the command line:
>>>>>>
>>>>>>     error: error reading returncode of remote command
>>>>>>
>>>>>> This is the qmaster messages:
>>>>>>
>>>>>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on host
>>>>>> dsl46 general before job because: 08/13/2007 15:03:31  
>>>>>> [999:20475]:
>>>>>> can't open file /tmp/950.1.layout.q/pid: Permission denied
>>>>>>
>>>>>> This is the exec host messages:
>>>>>>
>>>>>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1  
>>>>>> exited
>>>>>> with exit status = 11
>>>>>>
>>>>>> Looking at the qmaster messages, it seems that this happens every
>>>>>> hour to the majority of jobs. It doesn't seem to be bound by user
>>>>>> nor exec host.
>>>>>>
>>>>>> Hope somebody can help me. I had been using 6.0u7-1 for a long  
>>>>>> time
>>>>>> without problems, but now that I changed qmaster server and
>>>>>> installed the latest version, I keep getting this problem.
>>>>>
>>>>> if it's just every hour: is there a cronjob for cleaning /tmp
>>>>> running? - Reuti
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list