Opened 6 years ago

Closed 6 years ago

#1560 closed defect (fixed)

Project usage sometimes doubles after qmaster restart

Reported by: markdixon Owned by: Dave Love <…>
Priority: normal Milestone:
Component: sge Version: 8.1.8
Severity: minor Keywords:


Just found a fun interaction between these two fixes:

  • #1549 (Project usage is not saved across qmaster restarts)
  • #1551 (Spool not flushed at qmaster exit)

Project usage *does* sometimes make it into the project spool object ready for a qmaster restart - when the spool is flushed at qmaster exit for one.

This means that these fixes together mean the project usage is doubled across an ordinary qmaster restart, instead of zeroed.

We could:

1 Revert #1549 (but then project usage will be lost if the qmaster crashes).
2 Revert #1549 and periodically flush project objects to disk, like it does for user objects.
3 Not to flush project usage at qmaster exit (but there might be other ways it ends up in there, e.g. modification of project definition).
4 Stop storing usage in the project spool objects completely.

Sorry for this pain.

Change History (2)

comment:1 Changed 6 years ago by markdixon

(2nd attempt to write this comment, 1st got swallowed by my web browser)

I'm pretty certain that the fix to #1549 is wrong and should be reverted. Sorry.

  • A user's contribution to a project's usage exists in the user object in memory and spool
  • The total project's usage exists in the project object in memory only

I thought this was redundancy helping the qmaster run quickly. But it isn't: it means that users can be deleted without the project being prematurely forgiven by the sharetree for that user's usage. Which is probably the right answer.

Drat! I had hoped it was an artifact from when the single userprj object type was split into separate user and project types years ago. Tidying all that up would have made the code much simpler.

#1549 should be restated: "bug: the spool is not storing project usage in project objects when the scheduler orders it to"

The code does try to do this. I think the problem is in a mutex called Follow_Control which coordinates the (normally 2) worker threads and prevents the user/project objects from being written to too often. This problem probably dates from the userprj object type split.

Note: there is another known problem with Follow_Control - #1554

comment:2 Changed 6 years ago by Dave Love <…>

  • Owner set to Dave Love <…>
  • Resolution set to fixed
  • Status changed from new to closed

In 4844/sge:

Fix #1560: revert "Fix #1549 sge_calc_tickets - preserve project usage after qmaster restart"

Note: See TracTickets for help on using tickets.