[GE users] SGE6.1 error

John_Tai John_Tai at smics.com
Fri Aug 17 01:36:19 BST 2007


    [ The following text is in the "gb2312" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

This log is from a SUN server, but all other Linux servers have the same problem. 

Since 6.0 was working fine before on a different qmaster machine (PC, Intel), I assume it's either something to di with the new 6.1 or with the new qmaster machine (AMD server). 

I wonder if upgrading to 6.1u2 would help at all. I've tried re-installing the sge_execd, reboot the qmaster. Next I guess I'll have to try upgrading or re-installing qmaster, and changing machine again. Any other suggestions? 




-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Thursday, August 16, 2007 8:33 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] SGE6.1 error


Then I would assume, that there is a reason why TSTP is generated.  
The shephered will map this by default to a KILL.

As I don't see this signal generated on Linux: is there anyone who  
knows how and why this might be generated on a SUN to a terminal  
process?

-- Reuti


Am 16.08.2007 um 02:40 schrieb John_Tai:

> Yes, it always happens around the first 7 minutes of each hour.
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, August 15, 2007 9:14 PM
> To: users at gridengine.sunsource.net
> Subject: [SPAM] Re: [GE users] SGE6.1 error
> Importance: Low
>
>
> As you stated before: it's always happening at <hrs>:01:08 or so?  -
> Reuti
>
> Am 15.08.2007 um 09:34 schrieb John_Tai:
>
>> Maybe the following log can help:
>>
>> Job 5481 caused action: Job 5481 set to ERROR
>>  User        = nellie
>>  Queue       = sun.q at designserver
>>  Host        = designserver
>>  Start Time  = <unknown>
>>  End Time    = <unknown>
>> failed before job:08/15/2007 15:01:08 [999:966]: can't open file /
>> tmp/5481.1.sun.q/pid: Permission denied
>> Shepherd trace:
>> 08/15/2007 14:37:30 [999:966]: shepherd called with uid = 0, euid =
>> 999
>> 08/15/2007 14:37:30 [999:966]: starting up 6.1
>> 08/15/2007 14:37:30 [999:966]: setpgid(966, 966) returned 0
>> 08/15/2007 14:37:30 [999:966]: no prolog script to start
>> 08/15/2007 14:37:30 [999:966]: forked "job" with pid 967
>> 08/15/2007 14:37:30 [999:966]: child: job - pid: 967
>> 08/15/2007 14:37:30 [999:967]: processing qlogin job
>> 08/15/2007 14:37:30 [999:967]: pid=967 pgrp=967 sid=967 old
>> pgrp=966 getlogin()=<no login set>
>> 08/15/2007 14:37:30 [999:967]: reading passwd information for user
>> 'root'
>> 08/15/2007 14:37:30 [999:967]: setting limits
>> 08/15/2007 14:37:30 [999:967]: RLIMIT_CPU setting: (soft
>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>> 18446744073709551613 hard 18446744073709551613)
>> 08/15/2007 14:37:30 [999:967]: RLIMIT_FSIZE setting: (soft
>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>> 18446744073709551613 hard 18446744073709551613)
>> 08/15/2007 14:37:30 [999:967]: RLIMIT_DATA setting: (soft
>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>> 18446744073709551613 hard 18446744073709551613)
>> 08/15/2007 14:37:30 [999:967]: RLIMIT_STACK setting: (soft
>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>> 18446744073709551613 hard 18446744073709551613)
>> 08/15/2007 14:37:30 [999:967]: RLIMIT_CORE setting: (soft
>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>> 18446744073709551613 hard 18446744073709551613)
>> 08/15/2007 14:37:30 [999:967]: RLIMIT_VMEM setting: (soft
>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>> 18446744073709551613 hard 18446744073709551613)
>> 08/15/2007 14:37:30 [999:967]: setting environment
>> 08/15/2007 14:37:30 [999:967]: Initializing error file
>> 08/15/2007 14:37:30 [999:967]: switching to intermediate/target user
>> 08/15/2007 14:37:30 [407:967]: closing all filedescriptors
>> 08/15/2007 14:37:30 [407:967]: further messages are in "error" and
>> "trace"
>> 08/15/2007 14:37:30 [0:967]: now running with uid=0, euid=0
>> 08/15/2007 14:37:30 [0:967]: start qlogin
>> 08/15/2007 14:37:30 [0:967]: calling qlogin_starter(/home/sge/
>> sge6.1/cell1/spool/designserver/active_jobs/5481.1, /home/sge/
>> sge6.1/utilbin/sol-sparc64/rshd -l);
>> 08/15/2007 14:37:30 [0:967]: uid = 0, euid = 0, gid = 0, egid = 0
>> 08/15/2007 14:37:30 [0:967]: using sfd 1
>> 08/15/2007 14:37:30 [0:967]: bound to port 65302
>> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - data = 0:65302:/home/
>> sge/sge6.1/utilbin/sol-sparc64:/home/sge/sge6.1/cell1/spool/
>> designserver/active_jobs/5481.1:designserver
>> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - address = designserver:
>> 65301
>> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - host = designserver,
>> port = 65301
>> 08/15/2007 14:37:30 [0:967]: waiting for connection.
>> 08/15/2007 14:37:30 [0:967]: accepted connection on fd 2
>> 08/15/2007 14:37:30 [0:967]: daemon to start: |/home/sge/sge6.1/
>> utilbin/sol-sparc64/rshd -l|
>> 08/15/2007 14:37:30 [999:970]: setosjobid: uid = 0, euid = 999
>> 08/15/2007 15:01:08 [999:966]: wait3 returned -1
>> 08/15/2007 15:01:08 [999:966]: mapped signal TSTP to signal KILL
>> 08/15/2007 15:01:08 [999:966]: queued signal KILL
>> 08/15/2007 15:01:08 [999:966]: can't open file /tmp/5481.1.sun.q/
>> pid: Permission denied
>> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - data = 1:can't open
>> file /tmp/5481.1.sun.q/pid: Permission denied
>> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - address =
>> designserver:65301
>> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - host = designserver,
>> port = 65301
>> 08/15/2007 15:01:08 [999:966]: error connecting stream socket:
>> Connection refused
>>
>> Shepherd error:
>> 08/15/2007 15:01:08 [999:966]: can't open file /tmp/5481.1.sun.q/
>> pid: Permission denied
>>
>> Shepherd pe_hostfile:
>> designserver 1 sun.q at designserver <NULL>
>>
>>
>>
>> -----Original Message-----
>> From: John_Tai
>> Sent: Wednesday, August 15, 2007 11:59 AM
>> To: 'users at gridengine.sunsource.net'
>> Subject: Re: [GE users] SGE6.1 error
>>
>>
>> I checked with
>>
>> ps --User root
>>
>> and sge_execd is by root. Using
>>
>> ps -ef
>>
>> the user is sge.
>>
>> However in my old installation (6.0) the user was always root, even
>> with ps -ef.
>>
>> Could this be the cause of my problem?
>>
>> Was this changed from 6.0 to 6.1? Or is this decided during
>> installation?
>>
>>
>>
>>
>> -----Original Message-----
>> From: Rayson Ho [mailto:rayrayson at gmail.com]
>> Sent: Wednesday, August 15, 2007 11:46 AM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>
>>
>> Check the real uid of sge_execd -- sge_execd switches its effective
>> uid between root and the admin account during execution so that it  
>> can
>> write to NFS directories. The manpage of ps(1) should tell you which
>> argument you need to get the real uid... or you can always google for
>> it...
>>
>> Rayson
>>
>>
>>
>>
>> On 8/14/07, John_Tai <John_Tai at smics.com> wrote:
>>> The sge_execd is running as the sge admin account (sge), which is
>>> different from my previous installation (sge6.0). Is this the
>>> cause? How do I revert it to start sge_execd and sge_qmaster as  
>>> root?
>>>
>>> Do I have to re-install everything?
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Tuesday, August 14, 2007 5:38 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>>
>>>
>>> Am 14.08.2007 um 02:42 schrieb John_Tai:
>>>
>>>> There is nothing else under /tmp related to GE.
>>>>
>>>> Running jobs do have a directory under the spool dir and /tmp.
>>>> However when the hour and 3 mins comes, it just disappears.
>>>>
>>>> I didn't do any other local config, apart from the local spool.
>>>>
>>>> I am quite desperate actually, I might have to go back to 6.0.
>>>
>>> Just an idea: is there any ulimit defined on the nodes, when you
>>> login. Is the sge_execd running without any limits as real user root
>>> and maybe any other effective user? - Reuti
>>>
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Tuesday, August 14, 2007 7:30 AM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>>>
>>>>
>>>> Am 13.08.2007 um 15:02 schrieb John_Tai:
>>>>
>>>> So the shephered just quits you mean. Is there anything in /tmp in
>>>> addtion as error output from the shepherd on the nodes?
>>>>
>>>>> There were no changes in the network (as far as I know) or NFS.
>>>>>
>>>>> The local spool is in the local disk, /data1/sge/spool, not in the
>>>>> $SGE_ROOT.
>>>>
>>>> Okay, when you look into /data1/sge/spool/<nodename>/active_jobs
>>>> with
>>>> a running job, there is a directory for the job? Same with /tmp,
>>>> where in addition the queuename is added?
>>>>
>>>> Do you have local configurations for the nodes defined?
>>>>
>>>>> The resource in the exit code, does it refer to the /tmp dir?  
>>>>> Or it
>>>>> could be any other resource?
>>>>
>>>> /tmp is also local - any symbolic link to /data/sge/spool?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: Mon 8/13/2007 20:26
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] RE: [SPAM]  Re: [GE users] SGE6.1 error
>>>>>
>>>>> Am 13.08.2007 um 11:16 schrieb John_Tai:
>>>>>
>>>>>> There isn't any cronjob running on exec host. Also it happens on
>>>>>> all my exec hosts (about 70) so I don't think the problem is in
>>>>>> the
>>>>>> exec hosts. I think it should be a problem with GE config or
>>>>>> install?
>>>>>>
>>>>>> Actually, let me correct my previous email. The jobs in GE are
>>>>>> lost, so there are not in the qstat. However the actual processes
>>>>>> are not terminated, they are still running in the exec host.
>>>>>
>>>>> Exit code 11 is "Resource temporarily unavailable" - was there any
>>>>> change to the network/NFS-server with this upgrade?
>>>>>
>>>>> One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission
>>>>> denied" is not the usual location of the pid - for me it's in / 
>>>>> var/
>>>>> spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.
>>>>>
>>>>> Where is your local SGE spool directory located - local on the
>>>>> nodes
>>>>> or in $SGE_ROOT?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>> Sent: Monday, August 13, 2007 4:53 PM
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: [SPAM] Re: [GE users] SGE6.1 error
>>>>>> Importance: Low
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 13.08.2007 um 09:56 schrieb John_Tai:
>>>>>>
>>>>>>> I have recently installed 6.1, but every job is terminated
>>>>>>> after a
>>>>>>> while.
>>>>>>>
>>>>>>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd -
>>>>>>> now n
>>>>>>> icfb":
>>>>>>>
>>>>>>>     950 0.55500 icfb       johnt        r     08/13/2007  
>>>>>>> 14:48:02
>>>>>>> layout.q at dsl46
>>>>>>>
>>>>>>> Here is the message I get from the command line:
>>>>>>>
>>>>>>>     error: error reading returncode of remote command
>>>>>>>
>>>>>>> This is the qmaster messages:
>>>>>>>
>>>>>>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on  
>>>>>>> host
>>>>>>> dsl46 general before job because: 08/13/2007 15:03:31
>>>>>>> [999:20475]:
>>>>>>> can't open file /tmp/950.1.layout.q/pid: Permission denied
>>>>>>>
>>>>>>> This is the exec host messages:
>>>>>>>
>>>>>>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1
>>>>>>> exited
>>>>>>> with exit status = 11
>>>>>>>
>>>>>>> Looking at the qmaster messages, it seems that this happens  
>>>>>>> every
>>>>>>> hour to the majority of jobs. It doesn't seem to be bound by  
>>>>>>> user
>>>>>>> nor exec host.
>>>>>>>
>>>>>>> Hope somebody can help me. I had been using 6.0u7-1 for a long
>>>>>>> time
>>>>>>> without problems, but now that I changed qmaster server and
>>>>>>> installed the latest version, I keep getting this problem.
>>>>>>
>>>>>> if it's just every hour: is there a cronjob for cleaning /tmp
>>>>>> running? - Reuti
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> -
>>>>>> --
>>>>>> -
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> -
>>>>>> --
>>>>>> -
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> -
>>>>> --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> -
>>>>> --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-
>>>>> help at gridengine.sunsource.net
>>>>
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list