[GE users] Aborted jobs mystery

ffoertter ffoertter at gmail.com
Tue Aug 31 19:23:14 BST 2010

Hi Everyone, 

So here is my scenario:
	SGE: 6.2u3
	Distro: SL 5.4
	I have some jobs that take ~48h to complete.  It's CPU intensive, not maxed on mem.
	In my highmem nodes, they run for about a day outputs, then they just die, and the compute node reboots
	We've had issues with the 10GbE cards before in other identical (lower mem) nodes
	Hardware checks show failure is some 10GbE cards
	Job death coincides with machine reboot
	All nodes have NAS directly mounted

Qmaster says:
Exit Status      = -1
Signal           = unknown signal
failed before writing exit_status because:
can't read usage file for job 1470.1

So I'm naturally assuming it's yet another 10GbE card dying on us... I can't read the usage file because, well, it's lost connection to our NAS, and that's where it's at.

But another developer here says he thinks the jobs are being killed because they spend a while not outputting and somehow qmaster believes it hung...
So in another queue he runs it but modifies these env settings (quoting his email):

i) Added "UNSET TMOUT"  to submit.sh;
ii) Created .ssh/config in the home directory;
iii) Edited ~/.ssh/config and added
serveraliveinterval 60
serveralivecountmax 10
iv) Modified the permissions on ~/.ssh/config
$ chmod 700 ~/.ssh/config

And wouldn't you know it, the same job runs to completion, in the general queue.
I haven't tested the same job in the highmem queue (still waiting on him)

But this solution shouldn't make any sense because:
1) it's not an interactive job, it's being qsub'ed
2) isn't the TMOUT default UNSET in almost all distros?  (I've checked in the obvious places /etc/profile, nowhere set in our distro)
3) serveraliveinterval, and serveralivecountmax act on a ssh login to the head node, but that doesn't translate once the job is sent to the compute node.  The shell here is between the user (qsub) and qmaster, but the shepherd takes over once its in the compute node...right?  So once it's running, isn't the execution deamon the owner now (taking the submitters uid), and so these envvars shouldn't matter when the job begins to run because the shepherd talks to sge_exec, so where is the shell?  Qmaster would only define the job dead if it can no longer talk to host, not directly to the job.

Is my logic correct and if not, where am I going wrong?
The fact that machine dies despite the memory not being high tells me this is hardware related... not a sleeping shepherd :)  Any other ideas?


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list