[GE users] SGE6.1 error

Ravi Chandra Nallan Ravichandra.Nallan at Sun.COM
Thu Aug 16 14:37:29 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Can this be because of suspend_threshold? by default I think SIGSTOP is 
sent for suspending the job.
Any hints with -verbose added to qrsh submission?
regards,
~Ravi

Reuti wrote:
> Then I would assume, that there is a reason why TSTP is generated. The 
> shephered will map this by default to a KILL.
>
> As I don't see this signal generated on Linux: is there anyone who 
> knows how and why this might be generated on a SUN to a terminal process?
>
> -- Reuti
>
>
> Am 16.08.2007 um 02:40 schrieb John_Tai:
>
>> Yes, it always happens around the first 7 minutes of each hour.
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Wednesday, August 15, 2007 9:14 PM
>> To: users at gridengine.sunsource.net
>> Subject: [SPAM] Re: [GE users] SGE6.1 error
>> Importance: Low
>>
>>
>> As you stated before: it's always happening at <hrs>:01:08 or so?  -
>> Reuti
>>
>> Am 15.08.2007 um 09:34 schrieb John_Tai:
>>
>>> Maybe the following log can help:
>>>
>>> Job 5481 caused action: Job 5481 set to ERROR
>>>  User        = nellie
>>>  Queue       = sun.q at designserver
>>>  Host        = designserver
>>>  Start Time  = <unknown>
>>>  End Time    = <unknown>
>>> failed before job:08/15/2007 15:01:08 [999:966]: can't open file /
>>> tmp/5481.1.sun.q/pid: Permission denied
>>> Shepherd trace:
>>> 08/15/2007 14:37:30 [999:966]: shepherd called with uid = 0, euid =
>>> 999
>>> 08/15/2007 14:37:30 [999:966]: starting up 6.1
>>> 08/15/2007 14:37:30 [999:966]: setpgid(966, 966) returned 0
>>> 08/15/2007 14:37:30 [999:966]: no prolog script to start
>>> 08/15/2007 14:37:30 [999:966]: forked "job" with pid 967
>>> 08/15/2007 14:37:30 [999:966]: child: job - pid: 967
>>> 08/15/2007 14:37:30 [999:967]: processing qlogin job
>>> 08/15/2007 14:37:30 [999:967]: pid=967 pgrp=967 sid=967 old
>>> pgrp=966 getlogin()=<no login set>
>>> 08/15/2007 14:37:30 [999:967]: reading passwd information for user
>>> 'root'
>>> 08/15/2007 14:37:30 [999:967]: setting limits
>>> 08/15/2007 14:37:30 [999:967]: RLIMIT_CPU setting: (soft
>>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>>> 18446744073709551613 hard 18446744073709551613)
>>> 08/15/2007 14:37:30 [999:967]: RLIMIT_FSIZE setting: (soft
>>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>>> 18446744073709551613 hard 18446744073709551613)
>>> 08/15/2007 14:37:30 [999:967]: RLIMIT_DATA setting: (soft
>>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>>> 18446744073709551613 hard 18446744073709551613)
>>> 08/15/2007 14:37:30 [999:967]: RLIMIT_STACK setting: (soft
>>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>>> 18446744073709551613 hard 18446744073709551613)
>>> 08/15/2007 14:37:30 [999:967]: RLIMIT_CORE setting: (soft
>>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>>> 18446744073709551613 hard 18446744073709551613)
>>> 08/15/2007 14:37:30 [999:967]: RLIMIT_VMEM setting: (soft
>>> 18446744073709551613 hard 18446744073709551613) resulting: (soft
>>> 18446744073709551613 hard 18446744073709551613)
>>> 08/15/2007 14:37:30 [999:967]: setting environment
>>> 08/15/2007 14:37:30 [999:967]: Initializing error file
>>> 08/15/2007 14:37:30 [999:967]: switching to intermediate/target user
>>> 08/15/2007 14:37:30 [407:967]: closing all filedescriptors
>>> 08/15/2007 14:37:30 [407:967]: further messages are in "error" and
>>> "trace"
>>> 08/15/2007 14:37:30 [0:967]: now running with uid=0, euid=0
>>> 08/15/2007 14:37:30 [0:967]: start qlogin
>>> 08/15/2007 14:37:30 [0:967]: calling qlogin_starter(/home/sge/
>>> sge6.1/cell1/spool/designserver/active_jobs/5481.1, /home/sge/
>>> sge6.1/utilbin/sol-sparc64/rshd -l);
>>> 08/15/2007 14:37:30 [0:967]: uid = 0, euid = 0, gid = 0, egid = 0
>>> 08/15/2007 14:37:30 [0:967]: using sfd 1
>>> 08/15/2007 14:37:30 [0:967]: bound to port 65302
>>> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - data = 0:65302:/home/
>>> sge/sge6.1/utilbin/sol-sparc64:/home/sge/sge6.1/cell1/spool/
>>> designserver/active_jobs/5481.1:designserver
>>> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - address = designserver:
>>> 65301
>>> 08/15/2007 14:37:30 [0:967]: write_to_qrsh - host = designserver,
>>> port = 65301
>>> 08/15/2007 14:37:30 [0:967]: waiting for connection.
>>> 08/15/2007 14:37:30 [0:967]: accepted connection on fd 2
>>> 08/15/2007 14:37:30 [0:967]: daemon to start: |/home/sge/sge6.1/
>>> utilbin/sol-sparc64/rshd -l|
>>> 08/15/2007 14:37:30 [999:970]: setosjobid: uid = 0, euid = 999
>>> 08/15/2007 15:01:08 [999:966]: wait3 returned -1
>>> 08/15/2007 15:01:08 [999:966]: mapped signal TSTP to signal KILL
>>> 08/15/2007 15:01:08 [999:966]: queued signal KILL
>>> 08/15/2007 15:01:08 [999:966]: can't open file /tmp/5481.1.sun.q/
>>> pid: Permission denied
>>> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - data = 1:can't open
>>> file /tmp/5481.1.sun.q/pid: Permission denied
>>> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - address =
>>> designserver:65301
>>> 08/15/2007 15:01:08 [999:966]: write_to_qrsh - host = designserver,
>>> port = 65301
>>> 08/15/2007 15:01:08 [999:966]: error connecting stream socket:
>>> Connection refused
>>>
>>> Shepherd error:
>>> 08/15/2007 15:01:08 [999:966]: can't open file /tmp/5481.1.sun.q/
>>> pid: Permission denied
>>>
>>> Shepherd pe_hostfile:
>>> designserver 1 sun.q at designserver <NULL>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: John_Tai
>>> Sent: Wednesday, August 15, 2007 11:59 AM
>>> To: 'users at gridengine.sunsource.net'
>>> Subject: Re: [GE users] SGE6.1 error
>>>
>>>
>>> I checked with
>>>
>>> ps --User root
>>>
>>> and sge_execd is by root. Using
>>>
>>> ps -ef
>>>
>>> the user is sge.
>>>
>>> However in my old installation (6.0) the user was always root, even
>>> with ps -ef.
>>>
>>> Could this be the cause of my problem?
>>>
>>> Was this changed from 6.0 to 6.1? Or is this decided during
>>> installation?
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Rayson Ho [mailto:rayrayson at gmail.com]
>>> Sent: Wednesday, August 15, 2007 11:46 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>>
>>>
>>> Check the real uid of sge_execd -- sge_execd switches its effective
>>> uid between root and the admin account during execution so that it can
>>> write to NFS directories. The manpage of ps(1) should tell you which
>>> argument you need to get the real uid... or you can always google for
>>> it...
>>>
>>> Rayson
>>>
>>>
>>>
>>>
>>> On 8/14/07, John_Tai <John_Tai at smics.com> wrote:
>>>> The sge_execd is running as the sge admin account (sge), which is
>>>> different from my previous installation (sge6.0). Is this the
>>>> cause? How do I revert it to start sge_execd and sge_qmaster as root?
>>>>
>>>> Do I have to re-install everything?
>>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: Tuesday, August 14, 2007 5:38 PM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>>>
>>>>
>>>> Am 14.08.2007 um 02:42 schrieb John_Tai:
>>>>
>>>>> There is nothing else under /tmp related to GE.
>>>>>
>>>>> Running jobs do have a directory under the spool dir and /tmp.
>>>>> However when the hour and 3 mins comes, it just disappears.
>>>>>
>>>>> I didn't do any other local config, apart from the local spool.
>>>>>
>>>>> I am quite desperate actually, I might have to go back to 6.0.
>>>>
>>>> Just an idea: is there any ulimit defined on the nodes, when you
>>>> login. Is the sge_execd running without any limits as real user root
>>>> and maybe any other effective user? - Reuti
>>>>
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: Tuesday, August 14, 2007 7:30 AM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>>>>>
>>>>>
>>>>> Am 13.08.2007 um 15:02 schrieb John_Tai:
>>>>>
>>>>> So the shephered just quits you mean. Is there anything in /tmp in
>>>>> addtion as error output from the shepherd on the nodes?
>>>>>
>>>>>> There were no changes in the network (as far as I know) or NFS.
>>>>>>
>>>>>> The local spool is in the local disk, /data1/sge/spool, not in the
>>>>>> $SGE_ROOT.
>>>>>
>>>>> Okay, when you look into /data1/sge/spool/<nodename>/active_jobs
>>>>> with
>>>>> a running job, there is a directory for the job? Same with /tmp,
>>>>> where in addition the queuename is added?
>>>>>
>>>>> Do you have local configurations for the nodes defined?
>>>>>
>>>>>> The resource in the exit code, does it refer to the /tmp dir? Or it
>>>>>> could be any other resource?
>>>>>
>>>>> /tmp is also local - any symbolic link to /data/sge/spool?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>> Sent: Mon 8/13/2007 20:26
>>>>>> To: users at gridengine.sunsource.net
>>>>>> Subject: Re: [GE users] RE: [SPAM]  Re: [GE users] SGE6.1 error
>>>>>>
>>>>>> Am 13.08.2007 um 11:16 schrieb John_Tai:
>>>>>>
>>>>>>> There isn't any cronjob running on exec host. Also it happens on
>>>>>>> all my exec hosts (about 70) so I don't think the problem is in
>>>>>>> the
>>>>>>> exec hosts. I think it should be a problem with GE config or
>>>>>>> install?
>>>>>>>
>>>>>>> Actually, let me correct my previous email. The jobs in GE are
>>>>>>> lost, so there are not in the qstat. However the actual processes
>>>>>>> are not terminated, they are still running in the exec host.
>>>>>>
>>>>>> Exit code 11 is "Resource temporarily unavailable" - was there any
>>>>>> change to the network/NFS-server with this upgrade?
>>>>>>
>>>>>> One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission
>>>>>> denied" is not the usual location of the pid - for me it's in /var/
>>>>>> spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.
>>>>>>
>>>>>> Where is your local SGE spool directory located - local on the
>>>>>> nodes
>>>>>> or in $SGE_ROOT?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>>>> Sent: Monday, August 13, 2007 4:53 PM
>>>>>>> To: users at gridengine.sunsource.net
>>>>>>> Subject: [SPAM] Re: [GE users] SGE6.1 error
>>>>>>> Importance: Low
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 13.08.2007 um 09:56 schrieb John_Tai:
>>>>>>>
>>>>>>>> I have recently installed 6.1, but every job is terminated
>>>>>>>> after a
>>>>>>>> while.
>>>>>>>>
>>>>>>>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd -
>>>>>>>> now n
>>>>>>>> icfb":
>>>>>>>>
>>>>>>>>     950 0.55500 icfb       johnt        r     08/13/2007 14:48:02
>>>>>>>> layout.q at dsl46
>>>>>>>>
>>>>>>>> Here is the message I get from the command line:
>>>>>>>>
>>>>>>>>     error: error reading returncode of remote command
>>>>>>>>
>>>>>>>> This is the qmaster messages:
>>>>>>>>
>>>>>>>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on host
>>>>>>>> dsl46 general before job because: 08/13/2007 15:03:31
>>>>>>>> [999:20475]:
>>>>>>>> can't open file /tmp/950.1.layout.q/pid: Permission denied
>>>>>>>>
>>>>>>>> This is the exec host messages:
>>>>>>>>
>>>>>>>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1
>>>>>>>> exited
>>>>>>>> with exit status = 11
>>>>>>>>
>>>>>>>> Looking at the qmaster messages, it seems that this happens every
>>>>>>>> hour to the majority of jobs. It doesn't seem to be bound by user
>>>>>>>> nor exec host.
>>>>>>>>
>>>>>>>> Hope somebody can help me. I had been using 6.0u7-1 for a long
>>>>>>>> time
>>>>>>>> without problems, but now that I changed qmaster server and
>>>>>>>> installed the latest version, I keep getting this problem.
>>>>>>>
>>>>>>> if it's just every hour: is there a cronjob for cleaning /tmp
>>>>>>> running? - Reuti
>>>>>>>
>>>>>>> ------------------------------------------------------------------
>>>>>>> -- 
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>> ------------------------------------------------------------------
>>>>>>> -- 
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> -- 
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> -- 
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list