Opened 7 years ago

Last modified 7 years ago

#1440 new defect

execd: zero-length spool files

Reported by: m.c.dixon@… Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.2
Severity: minor Keywords:
Cc:

Description

This isn't going to be the best problem report, but I'll do my best.

It would seem that, in some circumstances, files in an execd's spool directory can become zero-length:

# find /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062 | xargs ls -ld
drwxr-xr-x 2 sge_prod sge_prod 4096 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/config
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/environment
-rw-r--r-- 1 foobar users 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/error
-rw-r--r-- 1 foobar users 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/exit_status
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/pe_hostfile
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/pid
-rw-r--r-- 1 foobar users 0 Nov 10 05:09 /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/trace

We are running with local spools on ext4 (CentOS6.3) and the compute node unexpectedly rebooted, so this isn't totally unexpected.

The job continues to show as running in qstat and execd messages file contains repeated entries like:

11/10/2012 05:19:51| main|g9s3n4|W|can't read pid from pid file "active_jobs/2213.3062/pid" of shepherd for job active_jobs/2213.3062

Attempting to kill the job just puts it into state dr but never goes. A "qdel -f" gets rid of it in qstat, but doesn't clear up the files on disk. Restarting the execd doesn't clean things up either.

I guess there may be a couple of things here:

1) "fsync" doesn't appear in the gridengine source (apart from when closing files on IRIX systems). Would calling fsync() at the appropriate points minimise the chance of this happening, or would it make central spool dirs perform even worse?

2) Would it be nice for the execd to recover from this situation? I can see that a corrupted spool would make this difficult.

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


Change History (7)

comment:1 Changed 7 years ago by wish

On 12 November 2012 13:14, Mark Dixon <m.c.dixon@…> wrote:

This isn't going to be the best problem report, but I'll do my best.

It would seem that, in some circumstances, files in an execd's spool
directory can become zero-length:

# find /var/spool/sge_prod/g9s3n4/active_jobs/2213.3062 | xargs ls -ld
drwxr-xr-x 2 sge_prod sge_prod 4096 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/config
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/environment
-rw-r--r-- 1 foobar users 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/error
-rw-r--r-- 1 foobar users 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/exit_status
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/pe_hostfile
-rw-r--r-- 1 sge_prod sge_prod 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/pid
-rw-r--r-- 1 foobar users 0 Nov 10 05:09
/var/spool/sge_prod/g9s3n4/active_jobs/2213.3062/trace

We are running with local spools on ext4 (CentOS6.3) and the compute node
unexpectedly rebooted, so this isn't totally unexpected.

The job continues to show as running in qstat and execd messages file
contains repeated entries like:

11/10/2012 05:19:51| main|g9s3n4|W|can't read pid from pid file
"active_jobs/2213.3062/pid" of shepherd for job active_jobs/2213.3062

Attempting to kill the job just puts it into state dr but never goes. A
"qdel -f" gets rid of it in qstat, but doesn't clear up the files on disk.
Restarting the execd doesn't clean things up either.

I guess there may be a couple of things here:

1) "fsync" doesn't appear in the gridengine source (apart from when
closing files on IRIX systems). Would calling fsync() at the appropriate
points minimise the chance of this happening, or would it make central
spool dirs perform even worse?

I'm not sure there would be much point. Unless I'm missing something the
only circumstance where this should happen is following an unexpected
reboot/crash. Following a reboot there should be no valid jobs on the node
so you should be able to cleanup the host's spooldir unconditionally from
an init script run before the sgeexecd init script. Potentially leaves
ghost jobs as far as the scheduler/qmaster is concerned though.

William

comment:2 Changed 7 years ago by m.c.dixon@…

On Mon, 12 Nov 2012, William Hay wrote:
...

I'm not sure there would be much point.  Unless I'm missing something
the only circumstance where this should happen is following an
unexpected reboot/crash.  Following a reboot there should be no valid
jobs on the node so you should be able to cleanup the host's spooldir
unconditionally from an init script run before the sgeexecd init script.

Potentially leaves ghost jobs as far as the scheduler/qmaster is

concerned though.

...

Maybe; however, there may be information in there that it should try harder to keep - for example, I recall that end of job usage data is passed from shepherd to execd through a file in that spool. On the other hand, it could just have been bad timing.

In any case, there is an argument for execd itself detecting and recovering bad spool entries, rather than moving the burden onto the start scripts.

Cheers,

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


comment:3 Changed 7 years ago by dlove

Mark Dixon <m.c.dixon@…> writes:

We are running with local spools on ext4 (CentOS6.3) and the compute
node unexpectedly rebooted, so this isn't totally unexpected.

Doesn't that depend on how it's journalled? I'm a bit surprised they're
all zero-length, and this must be quite rare -- I don't think I've ever
seen it on our horribly unreliable hardware.

Attempting to kill the job just puts it into state dr but never
goes. A "qdel -f" gets rid of it in qstat, but doesn't clear up the
files on disk. Restarting the execd doesn't clean things up either.

I think there's a general problem deleting jobs in some cases I can't
remember properly when it should be blown away errors notwithstanding.

I guess there may be a couple of things here:

1) "fsync" doesn't appear in the gridengine source (apart from when
closing files on IRIX systems). Would calling fsync() at the
appropriate points minimise the chance of this happening, or would it
make central spool dirs perform even worse?

I wouldn't have thought it was worth considering fsync there,
particularly for a spool on NFS, though shared everything works fine
currently for us.

2) Would it be nice for the execd to recover from this situation? I
can see that a corrupted spool would make this difficult.

I suppose the thing to do would be to bail out and delete the
job/directory if it can't read the pid when the file is older than a
certain amount. See the comment in examine_job_task_from_file for why
it's not necessarily an error.

comment:4 Changed 7 years ago by dlove

William Hay <w.hay@…> writes:

Following a reboot there should be no valid jobs on the node
so you should be able to cleanup the host's spooldir unconditionally from
an init script run before the sgeexecd init script. Potentially leaves
ghost jobs as far as the scheduler/qmaster is concerned though.

Isn't that the point? Execd should report the situation, and it should
then be tidied up.

comment:5 Changed 7 years ago by m.c.dixon@…

On Tue, 13 Nov 2012, Dave Love wrote:

Mark Dixon <m.c.dixon@…> writes:

We are running with local spools on ext4 (CentOS6.3) and the compute
node unexpectedly rebooted, so this isn't totally unexpected.

Doesn't that depend on how it's journalled? I'm a bit surprised they're
all zero-length, and this must be quite rare -- I don't think I've ever
seen it on our horribly unreliable hardware.

You're right about the journalling but I was thinking that, where possible, applications ought to be coded to do the right thing with the default settings.

I just got a job on my virtual test cluster to deliberately trigger a kernel panic and it didn't generate any 0 length files. Perhaps we were just very unlucky - someone ran a task array requiring insane disk I/O to the parallel storage, which in turn seemed to cause the reboots. This affected 8 tasks out of ~128K.

...

I suppose the thing to do would be to bail out and delete the
job/directory if it can't read the pid when the file is older than a
certain amount. See the comment in examine_job_task_from_file for why
it's not necessarily an error.

That was the sort of thing I had in mind :)

Thanks for the pointer to examine_job_task_from_file...

All the best,

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


comment:6 Changed 7 years ago by dlove

Mark Dixon <m.c.dixon@…> writes:

You're right about the journalling but I was thinking that, where
possible, applications ought to be coded to do the right thing with
the default settings.

It was just an incidental comment on what you expect with ext4 as it
sounded a slightly jaundiced view, doubtless for good reason.

It's clearly an SGE bug, but I guess it's triggered rarely, so seems of
"patch welcome" priority (modulo Liverpool users, obviously :-/)
compared with everything else.

comment:7 Changed 7 years ago by m.c.dixon@…

On Fri, 16 Nov 2012, Dave Love wrote:
...

It's clearly an SGE bug, but I guess it's triggered rarely, so seems of
"patch welcome" priority (modulo Liverpool users, obviously :-/)
compared with everything else.

...

Naturally :)

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


Note: See TracTickets for help on using tickets.