[GE users] ARCo dbwriter not deleting outdated records

rafaarco rafaarco at ugr.es
Mon Nov 23 10:18:52 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello again,

We are still having some issues with the deletion of outdated values.
Our database still goes offline for a few hours (about 4) every day. We
wish this didn't happen, but we can't control it.

Right now the latest complete, successful deletion took place on October
17th, i.e., dbwriter has been more than one month without finishing a
whole delete task. And most of the data in the sge_host_values table is
configured to be kept only for 7 days.

We increased the log level and checked that the deletion thread was
actually working. However, it only deletes 500 rows at a time (and there
are more than 5 million rows) -in a single SQL DELETE statement-, and it
looks like it's inserting data faster than it's deleting it.

Last weekend we were also able to have the database online all the time,
but nevertheless the delete process did not finish. I am also worried
about the continuous mode and the interval it is configured with.
dbwriter configuration file says the default interval is 60 seconds
-ARCo documentation says 1 hour-. I guess that if the delete process has
not completed in the configured interval, no additional threads are
started and the current thread performing the DELETEs is not aborted,
but now I am not sure after seeing that 48 hours wasn't enough for the
delete process to successfully finish.

Besides, the table indexes seem not to be very useful, since every time
a DELETE is issued, a full scan of the table is performed. And this is
slowing down the process really a lot. Would an index rebuild be useful?
Are there any indexes optimized for the deletion of sge_host_values?

So, putting it short, I would like to know if it's normal that it takes
such a long time to delete the outdated records from the database and,
if so, if it would be possible to manually delete these records (for
instance, executing this DELETE query with no rownum condition) without
corrupting the database or the dbwriter application itself. I would also
like to be sure that the delete thread is not aborted every 60 seconds
-or minutes- when it has not completed.

In conclusion, how can the process be sped up so that it can finish in
less than 24 hours? Any solutions or workarounds will be welcome.

Thank you in advance,

Rafa

P.S.: We are still using ARCo 6.2u3 and Oracle 9. Changelog for 6.2u4
didn't indicate any differences regarding this topic so we didn't update
it.

El lun, 31-08-2009 a las 11:33 +0200, rafaarco escribió:
> Hi again Jana,
>
> El lun, 31-08-2009 a las 10:44 +0200, jana escribió:
> > Hi,
> >
> > The deletion should also start as normally, but it depends on how much
> > data is in the reporting file. The dbwriter takes the timestamp from the
> > line of the reporting file. So if for example the dbwriter was not
> > inserting for a few days and there is lot of data that needs to be
> > processed, it may take a while before the deletion kicks in, as the the
> > timestamp of the data being inserted is not old enough. It does not
> > delete the data by the calendar date, but by the timestamp that is
> > written in the reporting file. So, data being inserted today, might have
> > timestamp 2 or 3 days old and therefore it is not a time to delete
> > anything because the oldest data in the database is not 7 days alder
> > then this timestamp.  Hope that makes sense.
>
> I see. If I understand correctly, at the moment the deletion task was
> scheduled (2:11 am) the connection to the database was broken. Then it
> looks like dbwriter didn't check it hadn't deleted any records, and
> removed the old records from the reporting file. So the next time the
> deletion task was scheduled and run, nothing was old enough in the
> reporting file so it didn't delete anything in the database (however,
> the are no traces that deletion ran again after the connection problem,
> maybe it did run, maybe it didn't). It's just an idea, maybe the reality
> is completely different.
>
> > The other way when derived values and deletion rules are run is always
> > at the start up of dbwriter. Also, the deletion is not run until the
> > derived values are computed for the given period.
> >
> > So, it is hard to say, there could be a bug, but it is also possible
> > that the deletion rules would be performed again even if you did not
> > restart it.
>
> Well, dbwriter was running for two weeks performing no deletions at all.
> It did the first deletion one day after I restarted it. But who knows...
>
> > What concerns me more is why is the connection failing.
>
> Ah, don't worry about that. It's not a dbwriter problem. The database
> goes offline every day for a while at 2 am to be backed up.
>
> Thank you,
>
> Rafa
>
> >
> > Jana
> >
> > On 08/31/09 09:52, rafaarco wrote:
> > > Hi,
> > >
> > > Thanks for the advice, Jana. I sort of know what happened now.
> > > dbwriter.log shows this:
> > >
> > > 13/08/2009 13:55:12|grgen002|.ReportingDBWriter.initLogging|I|Starting
> > > up dbwriter (Version 6.2u3) ---------------------------
> > > 13/08/2009 13:55:12|grgen002|r.ReportingDBWriter.initialize|I|Connection
> > > to db jdbc:oracle:thin:@150.214.20.61:1711:e
> > > 13/08/2009 13:55:12|grgen002|r.ReportingDBWriter.initialize|I|Found
> > > database model version 10
> > > 13/08/2009 13:55:12|grgen002|tingDBWriter.getDbWriterConfig|I|
> > > calculation
> > > file /usr/local/apps/n1ge6_2_ib/dbwriter/database/oracle/dbwriter.xml
> > > has changed, reread it
> > > 13/08/2009 13:55:12|grgen002|er.file.FileParser.processFile|I|Renaming
> > > reporting  to reporting.processing
> > > 13/08/2009 13:55:12|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > > marked as erroneous, these will be skipped
> > > 13/08/2009 13:55:13|grgen002|ngDBWriter$StatisticThread.run|I|Next
> > > statistic calculation will be done at 13/08/09 14:55
> > > 13/08/2009 13:55:37|grgen002|rtingDBWriter.logEventDuration|I|
> > > calculating derived values took 0 hours 0 minutes
> > > 13/08/2009 13:55:38|grgen002|rtingDBWriter.logEventDuration|I|deleting
> > > outdated values took 0 hours 0 minutes
> > >
> > > [...]
> > >
> > > 15/08/2009 01:11:23|grgen002|rtingDBWriter.logEventDuration|I|
> > > calculating derived values took 0 hours 0 minutes
> > > 15/08/2009 01:11:30|grgen002|rtingDBWriter.logEventDuration|I|deleting
> > > outdated values took 0 hours 0 minutes
> > > 15/08/2009 01:11:30|grgen002|BWriter$DerivedValueThread.run|I|Next
> > > regular task (derived values and delete) will be done at 15/08/09 2:11
> > >
> > > Everything OK until here. Nothing is deleted since nothing is old
> > > enough, and the deletion task runs every hour. But then:
> > >
> > > 15/08/2009 02:01:50|grgen002|bwriter.ReportingException.log|E|
> > > Database.closeFailed
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingException: Database.closeFailed
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:354)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.executeQuery(Database.java:568)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.test(Database.java:334)
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingDBWriter.mainLoop(ReportingDBWriter.java:1129)
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingDBWriter.run(ReportingDBWriter.java:1279)
> > >                               Caused by java.sql.SQLException: No hay
> > > más datos para leer del socket
> > >
> > > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> > >
> > > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)
> > >
> > > oracle.jdbc.dbaccess.DBError.check_error(DBError.java:1160)
> > >
> > > oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:962)
> > >
> > > oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:894)
> > >
> > > oracle.jdbc.ttc7.Ocommoncall.receive(Ocommoncall.java:106)
> > >
> > > oracle.jdbc.ttc7.TTC7Protocol.logoff(TTC7Protocol.java:396)
> > >
> > > oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:1524)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database
> > > $ConnectionProxy.close(Database.java:721)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:352)
> > > 15/08/2009 02:01:50|grgen002|bwriter.ReportingException.log|E|
> > > Database.closeFailed
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingException: Database.closeFailed
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:359)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.closeAll(Database.java:486)
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingDBWriter.mainLoop(ReportingDBWriter.java:1133)
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingDBWriter.run(ReportingDBWriter.java:1279)
> > >                               Caused by java.sql.SQLException: No hay
> > > más datos para leer del socket
> > >
> > > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> > >
> > > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)
> > >
> > > oracle.jdbc.dbaccess.DBError.check_error(DBError.java:1160)
> > >
> > > oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:962)
> > >
> > > oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:894)
> > >
> > > oracle.jdbc.ttc7.Ocommoncall.receive(Ocommoncall.java:106)
> > >
> > > oracle.jdbc.ttc7.TTC7Protocol.logoff(TTC7Protocol.java:396)
> > >
> > > oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:1524)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database
> > > $ConnectionProxy.close(Database.java:721)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.closeAll(Database.java:483)
> > >
> > > 15/08/2009 02:01:50|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > > Error - Trying to reconnect
> > >
> > > [...]
> > >
> > > 15/08/2009 02:10:52|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > > Error - Trying to reconnect
> > > 15/08/2009 02:11:00|grgen002|BWriter$DerivedValueThread.run|E|Unknown
> > > error: com.sun.grid.reporting.dbwriter.ReportingException: Can not
> > > connect to database jdbc:oracle:thin:@15
> > > 0.214.20.61:1711:e: ORA-01034: ORACLE not available
> > > ORA-27101: shared memory realm does not exist
> > > SVR4 Error: 2: No such file or directory
> > >
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingException: Can not connect to
> > > database jdbc:oracle:thin:@150.214.20.61:1711:e: ORA-01034: ORACLE not
> > > avail
> > > able
> > > ORA-27101: shared memory realm does not exist
> > > SVR4 Error: 2: No such file or directory
> > >
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:359)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.getConnection(Database.java:387)
> > >
> > > com.sun.grid.reporting.dbwriter.ReportingDBWriter
> > > $DerivedValueThread.run(ReportingDBWriter.java:833)
> > >                               Caused by java.sql.SQLException:
> > > ORA-01034: ORACLE not available
> > > ORA-27101: shared memory realm does not exist
> > > SVR4 Error: 2: No such file or directory
> > >
> > >
> > > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> > >
> > > oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)
> > >
> > > oracle.jdbc.ttc7.O3log.receive1st(O3log.java:410)
> > >
> > > oracle.jdbc.ttc7.TTC7Protocol.logon(TTC7Protocol.java:260)
> > >
> > > oracle.jdbc.driver.OracleConnection.<init>(OracleConnection.java:371)
> > >
> > > oracle.jdbc.driver.OracleDriver.getConnectionInstance(OracleDriver.java:551)
> > >
> > > oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:351)
> > >
> > > java.sql.DriverManager.getConnection(DriverManager.java:582)
> > >
> > > java.sql.DriverManager.getConnection(DriverManager.java:154)
> > >
> > > com.sun.grid.reporting.dbwriter.db.Database.getConnection(Database.java:381)
> > > 15/08/2009 02:11:12|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > > Error - Trying to reconnect
> > >
> > > [...]
> > >
> > > 15/08/2009 02:14:33|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > > Error - Trying to reconnect
> > > 15/08/2009 02:14:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > > reporting  to reporting.processing
> > > 15/08/2009 02:14:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > > marked as erroneous, these will be skipped
> > > 15/08/2009 02:15:01|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > > file reporting.processing
> > > 15/08/2009 02:15:01|grgen002|le.FileParser.createStatistics|I|Processed
> > > 3.765 lines in 7,65s (492,03 lines/s)
> > > 15/08/2009 02:15:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > > reporting  to reporting.processing
> > > 15/08/2009 02:15:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > > marked as erroneous, these will be skipped
> > > 15/08/2009 02:15:53|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > > file reporting.processing
> > > 15/08/2009 02:15:53|grgen002|le.FileParser.createStatistics|I|Processed
> > > 372 lines in 0,52s (718,15 lines/s)
> > > 15/08/2009 02:16:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > > reporting  to reporting.processing
> > > 15/08/2009 02:16:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > > marked as erroneous, these will be skipped
> > > 15/08/2009 02:16:53|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > > file reporting.processing
> > > 15/08/2009 02:16:53|grgen002|le.FileParser.createStatistics|I|Processed
> > > 172 lines in 0,35s (491,43 lines/s)
> > > 15/08/2009 02:17:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > > reporting  to reporting.processing
> > > 15/08/2009 02:17:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > > marked as erroneous, these will be skipped
> > > 15/08/2009 02:17:55|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > > file reporting.processing
> > >
> > > [...]
> > >
> > > So Oracle database wasn't available for a while and dbwriter tried to
> > > reconnect. It succeeded and the accounting took place again normally.
> > > However, the module that deletes the outdated values never ran again
> > > (or, at least, it failed to delete the outdated values and write the log
> > > entry). Shouldn't deletion get back as well as the rest of dbwriter
> > > does?
> > >
> > > After seeing that, I tried restarting dbwriter. At first it seemed to be
> > > useless, because after some hours no deletion had been performed. But I
> > > checked to database again and now the minimum value was:
> > >
> > > 2009-08-24 01:00:00.0
> > >
> > > I checked the log again and there it was:
> > >
> > > 30/08/2009 19:08:31|grgen002|rtingDBWriter.logEventDuration|I|deleting
> > > outdated values took 3 hours 19 minutes
> > > 30/08/2009 19:09:21|grgen002|rtingDBWriter.logEventDuration|I|
> > > calculating derived values took 0 hours 0 minutes
> > >
> > > I restarted dbwriter on Saturday, and the first deletion occurred on
> > > Sunday at 7 pm. This too much time, isn't it?
> > >
> > > Now the connection failed again at 2 am as usual and the deletion module
> > > is not being run. So as a workaround I guess we will have to restart the
> > > process every day until the deletion module properly recovers as the
> > > other module of dbwriter do.
> > >
> > > Thanks again and I hope there is a solution for this that doesn't need a
> > > restart.
> > >
> > > Regards,
> > >
> > > Rafa
> > >
> > > El vie, 28-08-2009 a las 15:20 +0200, jana escribió:
> > >
> > >> Hi,
> > >>
> > >> First, check if this is the case with all the tables. Then check if in
> > >> the dbwriter.log ($SGE_ROOT/$SGE_CELL/spool/dbwriter/dbwriter.log) are
> > >> lines starting: 'deleting outdated values '. Also, check the log for any
> > >> exceptions.
> > >>
> > >> And of course, check in few hours if the min value is really still the same.
> > >>
> > >> Regards,
> > >>
> > >> Jana
> > >>
> > >> rafaarco wrote:
> > >>
> > >>> I forgot: we're using SGE 6.2, ARCo 6.2u3 (this combination works
> > >>> flawlessly with the exception of the outdated records) and Oracle 9i.
> > >>>
> > >>> Regards,
> > >>>
> > >>> Rafa
> > >>>
> > >>> El vie, 28-08-2009 a las 14:14 +0200, Rafael Arco Arredondo escribió:
> > >>>
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> We have an issue with outdated records in ARCo database. Our
> > >>>> configuration is the default one: raw host values are kept for 7 days
> > >>>> and dbwriter runs in continuous mode (i.e., deletion should happen once
> > >>>> in an hour). However, the query
> > >>>>
> > >>>> SELECT MIN(hv_time_start) FROM sge_host_values WHERE hv_variable = 'cpu'
> > >>>>
> > >>>> returns
> > >>>>
> > >>>> 2009-08-13 11:39:37.0
> > >>>>
> > >>>> And the table sge_host_values keeps growing.
> > >>>>
> > >>>> Since today is August 28th, I expected this minimum value to be
> > >>>> 2009-08-21. So it's seems that old rows are not deleted.
> > >>>>
> > >>>> Any ideas?
> > >>>>
> > >>>> Thank you in advance,
> > >>>>
> > >>>> Rafa
> > >>>>
> > >>>>
> > >>>>
> > >> ------------------------------------------------------
> > >> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214744
> > >>
> > >> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> > >>
> > HI
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215123
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


--
Rafael Arco Arredondo
Centro de Servicios de Informática y Redes de Comunicaciones
Campus de Fuentenueva - Edificio Mecenas
Universidad de Granada
E-18071 Granada Spain
Tel: +34 958 241010   Ext:31114   E-mail: rafaarco at ugr.es

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=228717

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list