Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (34 - 36 of 431)

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Ticket Resolution Summary Owner Reporter
#179 fixed IZ1061: Trace file does not get new data after chown over NFS uddeborg
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1061]

        Issue #:      1061             Platform:     All        Reporter: uddeborg (uddeborg)
       Component:     gridengine          OS:        All
     Subcomponent:    kernel           Version:      6.0beta2      CC:    None defined
        Status:       VERIFIED         Priority:     P3
      Resolution:     DUPLICATE       Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     Trace file does not get new data after chown over NFS
   Status whiteboard:
      Attachments:

     Issue 1061 blocks:
   Votes for issue 1061:


   Opened: Fri May 21 09:22:00 -0700 2004 
------------------------


In main in shepherd.c, the job's trace file is
first created, and a first line where shepherd
says it was called and the uid and euid is
written.  Then the file's owner is changed to the
user running the job.  After that several more
lines are written.  If the spool directory is
mounted over NFS, all these writes fail.

I assume the reason is, contrary to the comment in
shepherd_trace_chown_intern says, that the
state-less NFS doesn't care if you have a file
descriptor open.  Each write is instead checked
for permission.  And after having changed back to
the euid of the SGE administrator, one no longer
has permission to write to this file.  (If the
write system call actually tells you so seems to
be a bit dependent on OS version and NFS flags.)

   ------- Additional comments from pollinger Wed May 26 01:53:06 -0700 2004 -------
The reason is, in fact, that NFS doesn't provide a proper way to
append to a file from two processes (even on the same host)
concurrently. So whenever a file handle is closed, whatever has been
written to the file from the other process since the file handle was
opened is overwritten.

This is documented in some creat(2) man pages (e.g of Linux) and
applies to most NFS Server implementations - an exception seems to be
the Irix 6.5 NFS Server which seems to handle appending correctly.

In our case this means, the output of the parent shepherd overwrites
the outputs of all child shepherds (which are forked to execute
prolog, pe_start, job, pe_stop and epilog).

This bug has already been reported (and fixed) as issue 1021.


*** This issue has been marked as a duplicate of 1021 ***

   ------- Additional comments from pollinger Wed May 26 01:55:37 -0700 2004 -------
Edit: This is not a duplicate of Issue 1021, it's a duplicate of Issue
1012.

   ------- Additional comments from uddeborg Wed May 26 08:23:43 -0700 2004 -------
Yes, that seems to be the same thing.  And your fix does indeed seem
to solve my problems.  (Including some consequential problems I had.)
#180 fixed IZ1074: remove conditional compilation ENABLE_NGC joga
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1074]

        Issue #:      1074             Platform:     All       Reporter: joga (joga)
       Component:     gridengine          OS:        All
     Subcomponent:    cleanup          Version:      current      CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     remove conditional compilation ENABLE_NGC
   Status whiteboard:
      Attachments:

     Issue 1074 blocks:
   Votes for issue 1074:


   Opened: Thu May 27 06:24:00 -0700 2004 
------------------------


Parts of the source code still contain 2 versions
of code, one for the new and one for the old
communication.

Remove the old communication code.
#185 fixed IZ1102: user belonging to too many groups sets all queues into error state wig
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1102]

        Issue #:      1102             Platform:     All           Reporter: wig (wig)
       Component:     gridengine          OS:        Solaris
     Subcomponent:    qmaster          Version:      5.3              CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     ernst
          URL:
       * Summary:     user belonging to too many groups sets all queues into error state
   Status whiteboard:
      Attachments:

     Issue 1102 blocks:
   Votes for issue 1102:


   Opened: Fri Jun 18 07:51:00 -0700 2004 
------------------------


Followup on the following discussion on the
users@gridengine maillist:

Please change the behviour of such jobs from
"set queue into error state and resubmit job
to another queue" to "set job into error
state".


> Date: Wed, 28 May 2003 10:42:24 +0200 (MEST)
> From: Andy Schwierskott <andy.schwierskott@sun.com>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
> Subject: Re: [GE users] Too many groups send
queues go into (E)rror state
>
>
> Hi,
>
> > --- Jon  Kleinsmith <jon.kleinsmith@conexant.com>
> > wrote:
> > > Looking in the messages log, I see two
things that
> > > bother me:
> > >
> > > 1. the scheduler is trying to set uid and
euid 0 on
> > > the job?
> >
> > Actually the shepherd, not the scheduler. Each
job has
> > its own shepherd, which is used for
> > collecting/controlling the job.
> >
> > The message (uid=0, euid=0) tells you that SGE
> > couldn't add an additional group, but
uid=0/euid=0 is
> > not what SGE is trying to set to, it is
actually the
> > uid/euid of shepherd.
> >
> > > 1. Why does the scheduler need to add the
uid/euid
> > > info for the job?
> >
> > The shepherd needs to add an additional group
id to
> > the job for job control.
>
> There are two reasons why setting the additional
group id for the job may
> fail:
>
>    - the root user on that machine already has
(usually 16) add. group
> id's
>      This should be easy to fix for the admin.
All jobs on that machine
>      would be affected in put the queue into
error state.
>
>      (thinking about this I'm wondering if it
could be possible to reduce
>       the number of add. group id's in the
shepherd process to avoid this
>       source of failure, but I'm not sure if
that couldn't cause other
>       unexpected problems???)
>
>    - the 'job user' has 16 add. group id's.
While this is purely admin
>      related, it's not neccessarily in the
responsibility of the SGE admin
>
>      I think in practice it should be relativly
easy to reduce the number
> of
>      add. group id's of the users to 15, but I
understand that an SGE
> admin
>      does not want that this blocks compute
resources until this problem
> is
>      solved.
>
> > > 2. Why does this error "lock-up" my queues?
> >
> > So that the cluster admin knows that something
wrong
> > is going on.
>
> We had a couple of times this problem report in
the past. I see it would
> be
> better to put the job into error state if user
is already in 16 add.
> groups
>
> Is anyone seeing a reason why it could be a
problem putting the job into
> error state and not the queue in that case? If
no one sees a problem we
> might file this is an enhancement issue.
>
> Andy
>
> > > 3. Is there a simple way of reseting the
error state
> > > of the queues
> > > without losing scheduled/running jobs?
> >
> > As mentioned by other people, use qmod -c
<queue>. You
> > can consider using a cron job to clear the
error state
> > of the queues.
> >
> >  -Ron
> >
> > >
> > > Any info would be helpful.
> > >
> > > -Jon
> > >
> > >
> > > --
> > > Jon  Kleinsmith <jon.kleinsmith@conexant.com>
> > >
> > >

   ------- Additional comments from sgrell Mon Dec 12 02:45:29 -0700 2005 -------
Changed subcomponent.

Stephan

   ------- Additional comments from sgrell Mon Dec 12 02:48:52 -0700 2005 -------
Changed subcomponent.

Stephan
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Note: See TracQuery for help on using queries.