[GE users] Manys jobs seems to lead to dropped callbacks
rmaes at ciena.com
Thu Dec 10 17:35:18 GMT 2009
I don't think this is an SGE issue, but rather a large number of jobs versus limited resources kind of thing.
The scenario is that I have a TCL script (It's actually the Altera Quartus DSE tcl scripts made up of 100's of files) that launches FPGA systhesis jobs.
The way it works in general is to based on your exploration settings, create anywhere from 1 to 1200 directories, each one for a what will eventually be a SGE job.
Each directory is populated with initial data that is approximately 1.5MB in size.
Once a job is finished, the job directory has typically grown to 500MB.
The scripts then copy the data off to a repository location and perform a some report file analysis. I then appends the results files to a log file which by the time we are done is huge as well.
Control of the copy and analysis seem to be based on callbacks to a master script.
Here is the issue, If I set my settings to do a 500 job run, everything works as expected. If I set up to do a 1000 point run, based on the debug output, I don't believe the master script is catching callbacks... or the children aren't really sending them. In any case all jobs do complete, and the children debug logs indicate they are sending messages, however none of the copy to repository or report analysis occurs.
Does anyone know what the relationship to number of jobs, or maybe amount of data is to successful callback and vwait operation?
I think there must be a limit I should be looking at. I did just up my maxproc from 64000 and my descriptors from 4096. There was no change in the number of jobs that could complete before the callback mechanism stopped working.
Here are my current limits
[waxgridqm.ciena.com]-> ~ 101> limit
stacksize 10240 kbytes
coredumpsize 0 kbytes
memorylocked 32 kbytes
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users