[GE users] Re: deleting large numbers of jobs

Roland Dittel Roland.Dittel at Sun.COM
Fri May 9 08:34:10 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi tmac,

for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:

1) the time deleting a job increases with the amount of pending jobs in 
the cluster and the amount of queue instances. The reason for this is 
the messages list for schedd_job_info. Every message in the qstat -j 
output is one list element and below this element are the job id 
references stored inheriting this message. At job deletion time qmaster 
has to loop over the whole list of messages and loop over all references 
to removes right one. As a matter of fact this does not scale, and for 
6.2 I've added a hash access to the reference id that decreased the job 
deletion time in large clusters heavily. Sadly I don't remember the 
exact numbers.

To verify this you can disable schedd_job_info in the scheduler config 
and then delete your jobs.

2) The job script and the job itself needs to be removed from the 
database. This time depends if you use berkeleydb or classic spooling 
and if you spool on local storage or on a NFS share. As faster your 
access to the storage is as faster you can delete the jobs.

If disabling schedd_job_info doesn't help in your case you might be hit 
by this point.

3) With 6.1u3 we've introduced the parameters gdi_timeout and 
gdi_retries to tune this behaviour. But that's anyway more a workaround 
than a real solution.

I hope this helps
Roland

tmac wrote:
> Re-posting.
> 
> The delete is of an array job with many components.
> 
> During the delete, it is as if the master chokes.
> any new qsub's fail (the infamous GDI message)
> Again, it is only deleting 100-300 jobs. Really, not a whole lot of them.
> 
> Is there any way to find out further what is happening and why?
> 
> Is there any way to increase the timeout before the GDI message appears?
> 
> thanks
> 
> On Thu, Apr 24, 2008 at 10:49 AM, tmac <tmacmd at gmail.com> wrote:
>> SGE 6.0u7 all around
>> Master/shadows RHEL4u2
>> BDB via RPC on Solaris 10
>>
>> When we try to delete a large number of jobs (with large being more
>> than *just* a couple hundred)
>> the master stops responding. Sometimes it comes back, sometimes not.
>>
>> This morning, we deleted 330+ array jobs. The master hung. We waited 4
>> minutes and qstat/qmon was still not responding.
>> The master itself seemed OK.
>>
>> The service was restarted on the master/slaves.
>>
>> Anyone have any idea as to what might be going on?
>>
>> --
>> --tmac
>>
>> RedHat Certified Engineer #804006984323821 (RHEL4)
>> RedHat Certified Engineer #805007643429572 (RHEL5)
>>
>> Principal Consultant
>>
> 
> 
> 


-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Roland Dittel               Tel: +49 (0)941 3075-275 (x60275)
Software Engineering        Fax: +49 (0)941 3075-222 (x60222)
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7       mailto:roland.dittel at sun.com
D-93049 Regensburg          http://www.sun.com/gridware
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Registered Office / Sitz der Gesellschaft:
   Sun Microsystems GmbH
   Sonnenallee 1
   D-85551 Kirchheim-Heimstetten
   Germany
Commercial register of the Local Court of Munich /
Handelsregistereintrag Amtsgericht Muenchen:
   HRB 161028
Managing Directors / Geschaeftsfuehrer:
   Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Chairman of the Supervisory Board / Vorsitzender des Aufsichtsrates
   Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list