[GE users] SGE6.1 error

John_Tai John_Tai at smics.com
Wed Aug 15 04:59:04 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I checked with 

ps --User root

and sge_execd is by root. Using 

ps -ef 

the user is sge.

However in my old installation (6.0) the user was always root, even with ps -ef. 

Could this be the cause of my problem?

Was this changed from 6.0 to 6.1? Or is this decided during installation? 




-----Original Message-----
From: Rayson Ho [mailto:rayrayson at gmail.com]
Sent: Wednesday, August 15, 2007 11:46 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error


Check the real uid of sge_execd -- sge_execd switches its effective
uid between root and the admin account during execution so that it can
write to NFS directories. The manpage of ps(1) should tell you which
argument you need to get the real uid... or you can always google for
it...

Rayson




On 8/14/07, John_Tai <John_Tai at smics.com> wrote:
> The sge_execd is running as the sge admin account (sge), which is different from my previous installation (sge6.0). Is this the cause? How do I revert it to start sge_execd and sge_qmaster as root?
>
> Do I have to re-install everything?
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, August 14, 2007 5:38 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
>
>
> Am 14.08.2007 um 02:42 schrieb John_Tai:
>
> > There is nothing else under /tmp related to GE.
> >
> > Running jobs do have a directory under the spool dir and /tmp.
> > However when the hour and 3 mins comes, it just disappears.
> >
> > I didn't do any other local config, apart from the local spool.
> >
> > I am quite desperate actually, I might have to go back to 6.0.
>
> Just an idea: is there any ulimit defined on the nodes, when you
> login. Is the sge_execd running without any limits as real user root
> and maybe any other effective user? - Reuti
>
>
> >
> > -----Original Message-----
> > From: Reuti [mailto:reuti at staff.uni-marburg.de]
> > Sent: Tuesday, August 14, 2007 7:30 AM
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] RE: [SPAM] Re: [GE users] SGE6.1 error
> >
> >
> > Am 13.08.2007 um 15:02 schrieb John_Tai:
> >
> > So the shephered just quits you mean. Is there anything in /tmp in
> > addtion as error output from the shepherd on the nodes?
> >
> >> There were no changes in the network (as far as I know) or NFS.
> >>
> >> The local spool is in the local disk, /data1/sge/spool, not in the
> >> $SGE_ROOT.
> >
> > Okay, when you look into /data1/sge/spool/<nodename>/active_jobs with
> > a running job, there is a directory for the job? Same with /tmp,
> > where in addition the queuename is added?
> >
> > Do you have local configurations for the nodes defined?
> >
> >> The resource in the exit code, does it refer to the /tmp dir? Or it
> >> could be any other resource?
> >
> > /tmp is also local - any symbolic link to /data/sge/spool?
> >
> > -- Reuti
> >
> >
> >> -----Original Message-----
> >> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> >> Sent: Mon 8/13/2007 20:26
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] RE: [SPAM]  Re: [GE users] SGE6.1 error
> >>
> >> Am 13.08.2007 um 11:16 schrieb John_Tai:
> >>
> >>> There isn't any cronjob running on exec host. Also it happens on
> >>> all my exec hosts (about 70) so I don't think the problem is in the
> >>> exec hosts. I think it should be a problem with GE config or
> >>> install?
> >>>
> >>> Actually, let me correct my previous email. The jobs in GE are
> >>> lost, so there are not in the qstat. However the actual processes
> >>> are not terminated, they are still running in the exec host.
> >>
> >> Exit code 11 is "Resource temporarily unavailable" - was there any
> >> change to the network/NFS-server with this upgrade?
> >>
> >> One thing I wonder about: "/tmp/950.1.layout.q/pid: Permission
> >> denied" is not the usual location of the pid - for me it's in /var/
> >> spool/sge/<node_name>/active_jobs/<job_id.task_id>/pid.
> >>
> >> Where is your local SGE spool directory located - local on the nodes
> >> or in $SGE_ROOT?
> >>
> >> -- Reuti
> >>
> >>
> >>> Thanks.
> >>>
> >>> -----Original Message-----
> >>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> >>> Sent: Monday, August 13, 2007 4:53 PM
> >>> To: users at gridengine.sunsource.net
> >>> Subject: [SPAM] Re: [GE users] SGE6.1 error
> >>> Importance: Low
> >>>
> >>>
> >>> Hi,
> >>>
> >>> Am 13.08.2007 um 09:56 schrieb John_Tai:
> >>>
> >>>> I have recently installed 6.1, but every job is terminated after a
> >>>> while.
> >>>>
> >>>> This is my job from qstat, started as "qrsh -v eda=$cmd -cwd -now n
> >>>> icfb":
> >>>>
> >>>>     950 0.55500 icfb       johnt        r     08/13/2007 14:48:02
> >>>> layout.q at dsl46
> >>>>
> >>>> Here is the message I get from the command line:
> >>>>
> >>>>     error: error reading returncode of remote command
> >>>>
> >>>> This is the qmaster messages:
> >>>>
> >>>>     08/13/2007 15:03:34|qmaster|dsls11|W|job 950.1 failed on host
> >>>> dsl46 general before job because: 08/13/2007 15:03:31 [999:20475]:
> >>>> can't open file /tmp/950.1.layout.q/pid: Permission denied
> >>>>
> >>>> This is the exec host messages:
> >>>>
> >>>>     08/13/2007 15:03:31|execd|dsl46|E|shepherd of job 950.1 exited
> >>>> with exit status = 11
> >>>>
> >>>> Looking at the qmaster messages, it seems that this happens every
> >>>> hour to the majority of jobs. It doesn't seem to be bound by user
> >>>> nor exec host.
> >>>>
> >>>> Hope somebody can help me. I had been using 6.0u7-1 for a long time
> >>>> without problems, but now that I changed qmaster server and
> >>>> installed the latest version, I keep getting this problem.
> >>>
> >>> if it's just every hour: is there a cronjob for cleaning /tmp
> >>> running? - Reuti
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list