[GE users] Re: deleting large numbers of jobs

tmac tmacmd at gmail.com
Fri May 9 11:22:18 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Thanks for an explination.

--tmac

On Fri, May 9, 2008 at 3:34 AM, Roland Dittel <Roland.Dittel at sun.com> wrote:
> Hi tmac,
>
> for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:
>
> 1) the time deleting a job increases with the amount of pending jobs in the
> cluster and the amount of queue instances. The reason for this is the
> messages list for schedd_job_info. Every message in the qstat -j output is
> one list element and below this element are the job id references stored
> inheriting this message. At job deletion time qmaster has to loop over the
> whole list of messages and loop over all references to removes right one. As
> a matter of fact this does not scale, and for 6.2 I've added a hash access
> to the reference id that decreased the job deletion time in large clusters
> heavily. Sadly I don't remember the exact numbers.
>
> To verify this you can disable schedd_job_info in the scheduler config and
> then delete your jobs.
>
> 2) The job script and the job itself needs to be removed from the database.
> This time depends if you use berkeleydb or classic spooling and if you spool
> on local storage or on a NFS share. As faster your access to the storage is
> as faster you can delete the jobs.
>
> If disabling schedd_job_info doesn't help in your case you might be hit by
> this point.
>
> 3) With 6.1u3 we've introduced the parameters gdi_timeout and gdi_retries to
> tune this behaviour. But that's anyway more a workaround than a real
> solution.
>
> I hope this helps
> Roland
>
>
>
> tmac wrote:
> > Re-posting.
> >
> > The delete is of an array job with many components.
> >
> > During the delete, it is as if the master chokes.
> > any new qsub's fail (the infamous GDI message)
> > Again, it is only deleting 100-300 jobs. Really, not a whole lot of them.
> >
> > Is there any way to find out further what is happening and why?
> >
> > Is there any way to increase the timeout before the GDI message appears?
> >
> > thanks
> >
> > On Thu, Apr 24, 2008 at 10:49 AM, tmac <tmacmd at gmail.com> wrote:
> >
> > > SGE 6.0u7 all around
> > > Master/shadows RHEL4u2
> > > BDB via RPC on Solaris 10
> > >
> > > When we try to delete a large number of jobs (with large being more
> > > than *just* a couple hundred)
> > > the master stops responding. Sometimes it comes back, sometimes not.
> > >
> > > This morning, we deleted 330+ array jobs. The master hung. We waited 4
> > > minutes and qstat/qmon was still not responding.
> > > The master itself seemed OK.
> > >
> > > The service was restarted on the master/slaves.
> > >
> > > Anyone have any idea as to what might be going on?
> > >
> > > --
> > > --tmac
> > >
> > > RedHat Certified Engineer #804006984323821 (RHEL4)
> > > RedHat Certified Engineer #805007643429572 (RHEL5)
> > >
> > > Principal Consultant
> > >
> > >
> >
> >
> >
> >
>
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Roland Dittel               Tel: +49 (0)941 3075-275 (x60275)
> Software Engineering        Fax: +49 (0)941 3075-222 (x60222)
> Sun Microsystems GmbH
> Dr.-Leo-Ritter-Str. 7       mailto:roland.dittel at sun.com
> D-93049 Regensburg          http://www.sun.com/gridware
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Registered Office / Sitz der Gesellschaft:
>  Sun Microsystems GmbH
>  Sonnenallee 1
>  D-85551 Kirchheim-Heimstetten
>  Germany
> Commercial register of the Local Court of Munich /
> Handelsregistereintrag Amtsgericht Muenchen:
>  HRB 161028
> Managing Directors / Geschaeftsfuehrer:
>  Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Chairman of the Supervisory Board / Vorsitzender des Aufsichtsrates
>  Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



-- 
--tmac

RedHat Certified Engineer #804006984323821 (RHEL4)
RedHat Certified Engineer #805007643429572 (RHEL5)

Principal Consultant

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list