[GE users] Manys jobs seems to lead to dropped callbacks

reuti reuti at staff.uni-marburg.de
Fri Dec 11 13:07:25 GMT 2009


Am 10.12.2009 um 18:35 schrieb richmaes:

> I don't think this is an SGE issue, but rather a large number of  
> jobs versus limited resources kind of thing.
>
> The scenario is that I have a TCL script (It's actually the Altera  
> Quartus DSE tcl scripts made up of 100's of files) that launches  
> FPGA systhesis jobs.
>
> The way it works in general is to based on your exploration  
> settings, create anywhere from 1 to 1200 directories, each one for  
> a what will eventually be a SGE job.

What does the vendor say about the maximum number of jobs his master- 
job can get messages from at a time?

Would it help to throttle the scheduled job, i.e. by submitting them  
as an array job with -tc option, or in a queue by any kind of RQS?

There is one master-job for every user and/or also every job, or do  
they run all the time?

-- Reuti


>
> Each directory is populated with initial data that is approximately  
> 1.5MB in size.
>
> Once a job is finished, the job directory has typically grown to  
> 500MB.
>
> The scripts then copy the data off to a repository location and  
> perform a some report file analysis.  I then appends the results  
> files to a log file which by the time we are done is huge as well.
>
> Control of the copy and analysis seem to be based on callbacks to a  
> master script.
>
> Here is the issue,  If I set my settings to do a 500 job run,  
> everything works as expected.  If I set up to do a 1000 point run,  
> based on the debug output, I don't believe the master script is  
> catching callbacks... or the children aren't really sending them.   
> In any case all jobs do complete, and the children debug logs  
> indicate they are sending messages, however none of the copy to  
> repository or report analysis occurs.
>
> Does anyone know what the relationship to number of jobs, or maybe  
> amount of data is to successful callback and vwait operation?
>
> I think there must be a limit I should be looking at.  I did just  
> up my maxproc from 64000 and my descriptors from 4096.  There was  
> no change in the number of jobs that could complete before the  
> callback mechanism stopped working.
>
>
> Here are my current limits
> [waxgridqm.ciena.com]-> ~ 101> limit
> cputime      unlimited
> filesize     unlimited
> datasize     unlimited
> stacksize    10240 kbytes
> coredumpsize 0 kbytes
> memoryuse    unlimited
> vmemoryuse   unlimited
> descriptors  16384
> memorylocked 32 kbytes
> maxproc      147456
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=232655
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232786

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list