[GE users] Aborted jobs mystery

reuti reuti at staff.uni-marburg.de
Tue Aug 31 20:32:53 BST 2010


Am 31.08.2010 um 20:23 schrieb ffoertter:

> So here is my scenario:
> 	SGE: 6.2u3
> 	Distro: SL 5.4
> 	I have some jobs that take ~48h to complete.  It's CPU intensive, not maxed on mem.
> 	In my highmem nodes, they run for about a day outputs, then they just die, and the compute node reboots

are they dying on their own, or killed by the shutdown script of your Linux before it reboots finally? Do your nodes reboot this way (i.e. with a proper shutdown first) and hence you have something in your /var/log/messages - or just like pressing "reset"?

> 	We've had issues with the 10GbE cards before in other identical (lower mem) nodes
> 	Hardware checks show failure is some 10GbE cards
> 	Job death coincides with machine reboot
> 	All nodes have NAS directly mounted
> Qmaster says:
> Exit Status      = -1
> Signal           = unknown signal
> failed before writing exit_status because:
> can't read usage file for job 1470.1
> So I'm naturally assuming it's yet another 10GbE card dying on us... I can't read the usage file because, well, it's lost connection to our NAS, and that's where it's at.

What's on the NAS? Spool directory of SGE, / directory, /home, /var, all?

When you have local disks, I'd put the spool directory of SGE local in something like /var/spool/sge:


> But another developer here says he thinks the jobs are being killed because they spend a while not outputting and somehow qmaster believes it hung...

No. Unless you defined a time limit for the job, SGE won't kill it because of no output. In fact, it would be a matter to define "no output" first.

> So in another queue he runs it but modifies these env settings (quoting his email):
> i) Added "UNSET TMOUT"  to submit.sh;
> ii) Created .ssh/config in the home directory;
> iii) Edited ~/.ssh/config and added
> serveraliveinterval 60
> serveralivecountmax 10
> iv) Modified the permissions on ~/.ssh/config
> $ chmod 700 ~/.ssh/config

This would only apply, when you have a parallel job and changed SGE's configuration to use SSH instead of its own default startup for slave tasks.

If it's hardware, there can be many sources: I saw a faulty power supply which could run one job on a dual CPU machine at that time w/o problems for days. But sending a second job there rebooted the node instantly.

-- Reuti

> And wouldn't you know it, the same job runs to completion, in the general queue.
> I haven't tested the same job in the highmem queue (still waiting on him)
> But this solution shouldn't make any sense because:
> 1) it's not an interactive job, it's being qsub'ed
> 2) isn't the TMOUT default UNSET in almost all distros?  (I've checked in the obvious places /etc/profile, nowhere set in our distro)
> 3) serveraliveinterval, and serveralivecountmax act on a ssh login to the head node, but that doesn't translate once the job is sent to the compute node.  The shell here is between the user (qsub) and qmaster, but the shepherd takes over once its in the compute node...right?  So once it's running, isn't the execution deamon the owner now (taking the submitters uid), and so these envvars shouldn't matter when the job begins to run because the shepherd talks to sge_exec, so where is the shell?  Qmaster would only define the job dead if it can no longer talk to host, not directly to the job.
> Is my logic correct and if not, where am I going wrong?
> The fact that machine dies despite the memory not being high tells me this is hardware related... not a sleeping shepherd :)  Any other ideas?
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278547
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list