[GE users] Fwd: subnode with empty slots but jobs in queue

reuti reuti at staff.uni-marburg.de
Tue Dec 7 18:39:12 GMT 2010

Am 06.12.2010 um 20:29 schrieb jlforrest:

> On 12/6/2010 11:02 AM, reuti wrote:
>> When the local spool directory exists after the reboot, the
>> execd would inform the qmaster about the failed jobs. When there is
>> no information on the node about the last running jobs, the execd
>> won't tell anything to the qmaster, and on its own it's waiting for
>> the jobs to reappear.
> I was thinking about this. I wonder if this
> is the right thing to do. If the actual
> contents of the local spool directory is
> empty, or different than what the qmaster
> expects, then what point is there for the
> qmaster to continue to think that the
> jobs exist, or will ever come back?
> In other words, shouldn't the contents
> of the local spool directory determine
> the qmaster's conception of reality?

I remember, that this discussion was already on the list before. I'm not sure of the final conclusion and also can't find any issue entered for it. It was like a making a check one time when the execd comes up again, what should be there from the point of view of the qmaster and what finds the execd in his local (spool) directory.

SGE was just not designed to handle reinstalled nodes in combination with local spool directories.

This is not only an issue of the jobs listed in `qstat`, but also possible tightly integrated tasks which were started by `qrsh -inherit`. If there was one, the complete job is invalid. If there was none (as your parallel jobs has serial and parallel steps), then you may be lucky and the job won't notice the reinstallation of the node at all.

Feel free to enter an issue about it.

More complicated would be this issue with persistent scratch directories on the nodes, which I suggested:


If the local persistent scratch directory is gone, something happened to the node...

-- Reuti

> -- 
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302559
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list