[GE users] ARCo dbwriter not deleting outdated records

jana jana.olivova at sun.com
Mon Nov 23 13:15:03 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

The limit on deletion was added because of the issue http://gridengine.sunsource.net/issues/show_bug.cgi?id=2284

The 60 seconds is the interval at which dbwriter parses the reporting file.
The 1 hour is the interval at which the derived values and the deletion rules are run.
The deletion rules are dependent on the derived values. If the derived values are not computed, then it will not delete the raw values.

It is possible that with the database going offline every day that with this amount of data it somehow prevents the derived values being computed and thus the deletion values don't work either.

Or it is possible that the data are being deleted but just not fast enough (because of the 4 hour stop every day) to catch up with the rate of insertion.

You can see what indexes are configured on each table in the dbdefiniton.xml file. You can of course delete indexes and create your own, to optimize the performance. You can perform any other database optimization that you consider useful. like the index rebuild.

You can of course perform the deletion manually, I'd suggest stopping the dbwriter for that. It should not be necessary if you do not try to delete any records from the parent table, that might still be needed for the child record. Then you would, of course, receive the foreign key violation.

Also, before performing the deletion of the raw values, I would check that the derived values have been created. Also because of the rollback segments (see the issue above), I would not try to delete everything in one sweep.

There might be some other optimizations of the delete query itself that would speed it up, maybe adding an order by statement. You have to test that on your specific database, if it will have any impact. The query optimizer identifies the optimal execution plan, regardless of the query syntax.

Also to limit the database size make sure that you are only collection data that you really need. If you are not interested in job_log values, you can set it to false in qconf -mconf, you can also specify which values to collect in the host_values (by specifying the report_variables).

Regards,

Jana Olivova


On 11/23/09 11:18, rafaarco wrote:

Hello again,

We are still having some issues with the deletion of outdated values.
Our database still goes offline for a few hours (about 4) every day. We
wish this didn't happen, but we can't control it.

Right now the latest complete, successful deletion took place on October
17th, i.e., dbwriter has been more than one month without finishing a
whole delete task. And most of the data in the sge_host_values table is
configured to be kept only for 7 days.

We increased the log level and checked that the deletion thread was
actually working. However, it only deletes 500 rows at a time (and there
are more than 5 million rows) -in a single SQL DELETE statement-, and it
looks like it's inserting data faster than it's deleting it.

Last weekend we were also able to have the database online all the time,
but nevertheless the delete process did not finish. I am also worried
about the continuous mode and the interval it is configured with.
dbwriter configuration file says the default interval is 60 seconds
-ARCo documentation says 1 hour-. I guess that if the delete process has
not completed in the configured interval, no additional threads are
started and the current thread performing the DELETEs is not aborted,
but now I am not sure after seeing that 48 hours wasn't enough for the
delete process to successfully finish.

Besides, the table indexes seem not to be very useful, since every time
a DELETE is issued, a full scan of the table is performed. And this is
slowing down the process really a lot. Would an index rebuild be useful?
Are there any indexes optimized for the deletion of sge_host_values?

So, putting it short, I would like to know if it's normal that it takes
such a long time to delete the outdated records from the database and,
if so, if it would be possible to manually delete these records (for
instance, executing this DELETE query with no rownum condition) without
corrupting the database or the dbwriter application itself. I would also
like to be sure that the delete thread is not aborted every 60 seconds
-or minutes- when it has not completed.

In conclusion, how can the process be sped up so that it can finish in
less than 24 hours? Any solutions or workarounds will be welcome.

Thank you in advance,

Rafa

P.S.: We are still using ARCo 6.2u3 and Oracle 9. Changelog for 6.2u4
didn't indicate any differences regarding this topic so we didn't update
it.

El lun, 31-08-2009 a las 11:33 +0200, rafaarco escribió:


Hi again Jana,

El lun, 31-08-2009 a las 10:44 +0200, jana escribió:


Hi,

The deletion should also start as normally, but it depends on how much
data is in the reporting file. The dbwriter takes the timestamp from the
line of the reporting file. So if for example the dbwriter was not
inserting for a few days and there is lot of data that needs to be
processed, it may take a while before the deletion kicks in, as the the
timestamp of the data being inserted is not old enough. It does not
delete the data by the calendar date, but by the timestamp that is
written in the reporting file. So, data being inserted today, might have
timestamp 2 or 3 days old and therefore it is not a time to delete
anything because the oldest data in the database is not 7 days alder
then this timestamp.  Hope that makes sense.


I see. If I understand correctly, at the moment the deletion task was
scheduled (2:11 am) the connection to the database was broken. Then it
looks like dbwriter didn't check it hadn't deleted any records, and
removed the old records from the reporting file. So the next time the
deletion task was scheduled and run, nothing was old enough in the
reporting file so it didn't delete anything in the database (however,
the are no traces that deletion ran again after the connection problem,
maybe it did run, maybe it didn't). It's just an idea, maybe the reality
is completely different.



The other way when derived values and deletion rules are run is always
at the start up of dbwriter. Also, the deletion is not run until the
derived values are computed for the given period.

So, it is hard to say, there could be a bug, but it is also possible
that the deletion rules would be performed again even if you did not
restart it.


Well, dbwriter was running for two weeks performing no deletions at all.
It did the first deletion one day after I restarted it. But who knows...



What concerns me more is why is the connection failing.


Ah, don't worry about that. It's not a dbwriter problem. The database
goes offline every day for a while at 2 am to be backed up.

Thank you,

Rafa



Jana

On 08/31/09 09:52, rafaarco wrote:


Hi,

Thanks for the advice, Jana. I sort of know what happened now.
dbwriter.log shows this:

13/08/2009 13:55:12|grgen002|.ReportingDBWriter.initLogging|I|Starting
up dbwriter (Version 6.2u3) ---------------------------
13/08/2009 13:55:12|grgen002|r.ReportingDBWriter.initialize|I|Connection
to db jdbc:oracle:thin:@150.214.20.61:1711:e<mailto:jdbc:oracle:thin:@150.214.20.61:1711:e>
13/08/2009 13:55:12|grgen002|r.ReportingDBWriter.initialize|I|Found
database model version 10
13/08/2009 13:55:12|grgen002|tingDBWriter.getDbWriterConfig|I|
calculation
file /usr/local/apps/n1ge6_2_ib/dbwriter/database/oracle/dbwriter.xml
has changed, reread it
13/08/2009 13:55:12|grgen002|er.file.FileParser.processFile|I|Renaming
reporting  to reporting.processing
13/08/2009 13:55:12|grgen002|iter.file.FileParser.parseFile|W|0 lines
marked as erroneous, these will be skipped
13/08/2009 13:55:13|grgen002|ngDBWriter$StatisticThread.run|I|Next
statistic calculation will be done at 13/08/09 14:55
13/08/2009 13:55:37|grgen002|rtingDBWriter.logEventDuration|I|
calculating derived values took 0 hours 0 minutes
13/08/2009 13:55:38|grgen002|rtingDBWriter.logEventDuration|I|deleting
outdated values took 0 hours 0 minutes

[...]

15/08/2009 01:11:23|grgen002|rtingDBWriter.logEventDuration|I|
calculating derived values took 0 hours 0 minutes
15/08/2009 01:11:30|grgen002|rtingDBWriter.logEventDuration|I|deleting
outdated values took 0 hours 0 minutes
15/08/2009 01:11:30|grgen002|BWriter$DerivedValueThread.run|I|Next
regular task (derived values and delete) will be done at 15/08/09 2:11

Everything OK until here. Nothing is deleted since nothing is old
enough, and the deletion task runs every hour. But then:

15/08/2009 02:01:50|grgen002|bwriter.ReportingException.log|E|
Database.closeFailed

com.sun.grid.reporting.dbwriter.ReportingException: Database.closeFailed

com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:354)

com.sun.grid.reporting.dbwriter.db.Database.executeQuery(Database.java:568)

com.sun.grid.reporting.dbwriter.db.Database.test(Database.java:334)

com.sun.grid.reporting.dbwriter.ReportingDBWriter.mainLoop(ReportingDBWriter.java:1129)

com.sun.grid.reporting.dbwriter.ReportingDBWriter.run(ReportingDBWriter.java:1279)
                              Caused by java.sql.SQLException: No hay
más datos para leer del socket

oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)

oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)

oracle.jdbc.dbaccess.DBError.check_error(DBError.java:1160)

oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:962)

oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:894)

oracle.jdbc.ttc7.Ocommoncall.receive(Ocommoncall.java:106)

oracle.jdbc.ttc7.TTC7Protocol.logoff(TTC7Protocol.java:396)

oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:1524)

com.sun.grid.reporting.dbwriter.db.Database
$ConnectionProxy.close(Database.java:721)

com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:352)
15/08/2009 02:01:50|grgen002|bwriter.ReportingException.log|E|
Database.closeFailed

com.sun.grid.reporting.dbwriter.ReportingException: Database.closeFailed

com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:359)

com.sun.grid.reporting.dbwriter.db.Database.closeAll(Database.java:486)

com.sun.grid.reporting.dbwriter.ReportingDBWriter.mainLoop(ReportingDBWriter.java:1133)

com.sun.grid.reporting.dbwriter.ReportingDBWriter.run(ReportingDBWriter.java:1279)
                              Caused by java.sql.SQLException: No hay
más datos para leer del socket

oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)

oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)

oracle.jdbc.dbaccess.DBError.check_error(DBError.java:1160)

oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:962)

oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:894)

oracle.jdbc.ttc7.Ocommoncall.receive(Ocommoncall.java:106)

oracle.jdbc.ttc7.TTC7Protocol.logoff(TTC7Protocol.java:396)

oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:1524)

com.sun.grid.reporting.dbwriter.db.Database
$ConnectionProxy.close(Database.java:721)

com.sun.grid.reporting.dbwriter.db.Database.closeAll(Database.java:483)

15/08/2009 02:01:50|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
Error - Trying to reconnect

[...]

15/08/2009 02:10:52|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
Error - Trying to reconnect
15/08/2009 02:11:00|grgen002|BWriter$DerivedValueThread.run|E|Unknown
error: com.sun.grid.reporting.dbwriter.ReportingException: Can not
connect to database jdbc:oracle:thin:@15
0.214.20.61:1711:e: ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory


com.sun.grid.reporting.dbwriter.ReportingException: Can not connect to
database jdbc:oracle:thin:@150.214.20.61:1711:e:<mailto:jdbc:oracle:thin:@150.214.20.61:1711:e:> ORA-01034: ORACLE not
avail
able
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory


com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:359)

com.sun.grid.reporting.dbwriter.db.Database.getConnection(Database.java:387)

com.sun.grid.reporting.dbwriter.ReportingDBWriter
$DerivedValueThread.run(ReportingDBWriter.java:833)
                              Caused by java.sql.SQLException:
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory


oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)

oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)

oracle.jdbc.ttc7.O3log.receive1st(O3log.java:410)

oracle.jdbc.ttc7.TTC7Protocol.logon(TTC7Protocol.java:260)

oracle.jdbc.driver.OracleConnection.<init>(OracleConnection.java:371)

oracle.jdbc.driver.OracleDriver.getConnectionInstance(OracleDriver.java:551)

oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:351)

java.sql.DriverManager.getConnection(DriverManager.java:582)

java.sql.DriverManager.getConnection(DriverManager.java:154)

com.sun.grid.reporting.dbwriter.db.Database.getConnection(Database.java:381)
15/08/2009 02:11:12|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
Error - Trying to reconnect

[...]

15/08/2009 02:14:33|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
Error - Trying to reconnect
15/08/2009 02:14:53|grgen002|er.file.FileParser.processFile|I|Renaming
reporting  to reporting.processing
15/08/2009 02:14:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
marked as erroneous, these will be skipped
15/08/2009 02:15:01|grgen002|iter.file.FileParser.parseFile|I|Deleting
file reporting.processing
15/08/2009 02:15:01|grgen002|le.FileParser.createStatistics|I|Processed
3.765 lines in 7,65s (492,03 lines/s)
15/08/2009 02:15:53|grgen002|er.file.FileParser.processFile|I|Renaming
reporting  to reporting.processing
15/08/2009 02:15:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
marked as erroneous, these will be skipped
15/08/2009 02:15:53|grgen002|iter.file.FileParser.parseFile|I|Deleting
file reporting.processing
15/08/2009 02:15:53|grgen002|le.FileParser.createStatistics|I|Processed
372 lines in 0,52s (718,15 lines/s)
15/08/2009 02:16:53|grgen002|er.file.FileParser.processFile|I|Renaming
reporting  to reporting.processing
15/08/2009 02:16:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
marked as erroneous, these will be skipped
15/08/2009 02:16:53|grgen002|iter.file.FileParser.parseFile|I|Deleting
file reporting.processing
15/08/2009 02:16:53|grgen002|le.FileParser.createStatistics|I|Processed
172 lines in 0,35s (491,43 lines/s)
15/08/2009 02:17:53|grgen002|er.file.FileParser.processFile|I|Renaming
reporting  to reporting.processing
15/08/2009 02:17:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
marked as erroneous, these will be skipped
15/08/2009 02:17:55|grgen002|iter.file.FileParser.parseFile|I|Deleting
file reporting.processing

[...]

So Oracle database wasn't available for a while and dbwriter tried to
reconnect. It succeeded and the accounting took place again normally.
However, the module that deletes the outdated values never ran again
(or, at least, it failed to delete the outdated values and write the log
entry). Shouldn't deletion get back as well as the rest of dbwriter
does?

After seeing that, I tried restarting dbwriter. At first it seemed to be
useless, because after some hours no deletion had been performed. But I
checked to database again and now the minimum value was:

2009-08-24 01:00:00.0

I checked the log again and there it was:

30/08/2009 19:08:31|grgen002|rtingDBWriter.logEventDuration|I|deleting
outdated values took 3 hours 19 minutes
30/08/2009 19:09:21|grgen002|rtingDBWriter.logEventDuration|I|
calculating derived values took 0 hours 0 minutes

I restarted dbwriter on Saturday, and the first deletion occurred on
Sunday at 7 pm. This too much time, isn't it?

Now the connection failed again at 2 am as usual and the deletion module
is not being run. So as a workaround I guess we will have to restart the
process every day until the deletion module properly recovers as the
other module of dbwriter do.

Thanks again and I hope there is a solution for this that doesn't need a
restart.

Regards,

Rafa

El vie, 28-08-2009 a las 15:20 +0200, jana escribió:



Hi,

First, check if this is the case with all the tables. Then check if in
the dbwriter.log ($SGE_ROOT/$SGE_CELL/spool/dbwriter/dbwriter.log) are
lines starting: 'deleting outdated values '. Also, check the log for any
exceptions.

And of course, check in few hours if the min value is really still the same.

Regards,

Jana

rafaarco wrote:



I forgot: we're using SGE 6.2, ARCo 6.2u3 (this combination works
flawlessly with the exception of the outdated records) and Oracle 9i.

Regards,

Rafa

El vie, 28-08-2009 a las 14:14 +0200, Rafael Arco Arredondo escribió:




Hello,

We have an issue with outdated records in ARCo database. Our
configuration is the default one: raw host values are kept for 7 days
and dbwriter runs in continuous mode (i.e., deletion should happen once
in an hour). However, the query

SELECT MIN(hv_time_start) FROM sge_host_values WHERE hv_variable = 'cpu'

returns

2009-08-13 11:39:37.0

And the table sge_host_values keeps growing.

Since today is August 28th, I expected this minimum value to be
2009-08-21. So it's seems that old rows are not deleted.

Any ideas?

Thank you in advance,

Rafa





------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214744

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



HI

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215123

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].









More information about the gridengine-users mailing list