[GE users] ARCo dbwriter not deleting outdated records

rafaarco rafaarco at ugr.es
Mon Aug 31 10:33:03 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi again Jana,

El lun, 31-08-2009 a las 10:44 +0200, jana escribió:
> Hi,
>
> The deletion should also start as normally, but it depends on how much
> data is in the reporting file. The dbwriter takes the timestamp from the
> line of the reporting file. So if for example the dbwriter was not
> inserting for a few days and there is lot of data that needs to be
> processed, it may take a while before the deletion kicks in, as the the
> timestamp of the data being inserted is not old enough. It does not
> delete the data by the calendar date, but by the timestamp that is
> written in the reporting file. So, data being inserted today, might have
> timestamp 2 or 3 days old and therefore it is not a time to delete
> anything because the oldest data in the database is not 7 days alder
> then this timestamp.  Hope that makes sense.

I see. If I understand correctly, at the moment the deletion task was
scheduled (2:11 am) the connection to the database was broken. Then it
looks like dbwriter didn't check it hadn't deleted any records, and
removed the old records from the reporting file. So the next time the
deletion task was scheduled and run, nothing was old enough in the
reporting file so it didn't delete anything in the database (however,
the are no traces that deletion ran again after the connection problem,
maybe it did run, maybe it didn't). It's just an idea, maybe the reality
is completely different.

> The other way when derived values and deletion rules are run is always
> at the start up of dbwriter. Also, the deletion is not run until the
> derived values are computed for the given period.
>
> So, it is hard to say, there could be a bug, but it is also possible
> that the deletion rules would be performed again even if you did not
> restart it.

Well, dbwriter was running for two weeks performing no deletions at all.
It did the first deletion one day after I restarted it. But who knows...

> What concerns me more is why is the connection failing.

Ah, don't worry about that. It's not a dbwriter problem. The database
goes offline every day for a while at 2 am to be backed up.

Thank you,

Rafa

>
> Jana
>
> On 08/31/09 09:52, rafaarco wrote:
> > Hi,
> >
> > Thanks for the advice, Jana. I sort of know what happened now.
> > dbwriter.log shows this:
> >
> > 13/08/2009 13:55:12|grgen002|.ReportingDBWriter.initLogging|I|Starting
> > up dbwriter (Version 6.2u3) ---------------------------
> > 13/08/2009 13:55:12|grgen002|r.ReportingDBWriter.initialize|I|Connection
> > to db jdbc:oracle:thin:@150.214.20.61:1711:e
> > 13/08/2009 13:55:12|grgen002|r.ReportingDBWriter.initialize|I|Found
> > database model version 10
> > 13/08/2009 13:55:12|grgen002|tingDBWriter.getDbWriterConfig|I|
> > calculation
> > file /usr/local/apps/n1ge6_2_ib/dbwriter/database/oracle/dbwriter.xml
> > has changed, reread it
> > 13/08/2009 13:55:12|grgen002|er.file.FileParser.processFile|I|Renaming
> > reporting  to reporting.processing
> > 13/08/2009 13:55:12|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > marked as erroneous, these will be skipped
> > 13/08/2009 13:55:13|grgen002|ngDBWriter$StatisticThread.run|I|Next
> > statistic calculation will be done at 13/08/09 14:55
> > 13/08/2009 13:55:37|grgen002|rtingDBWriter.logEventDuration|I|
> > calculating derived values took 0 hours 0 minutes
> > 13/08/2009 13:55:38|grgen002|rtingDBWriter.logEventDuration|I|deleting
> > outdated values took 0 hours 0 minutes
> >
> > [...]
> >
> > 15/08/2009 01:11:23|grgen002|rtingDBWriter.logEventDuration|I|
> > calculating derived values took 0 hours 0 minutes
> > 15/08/2009 01:11:30|grgen002|rtingDBWriter.logEventDuration|I|deleting
> > outdated values took 0 hours 0 minutes
> > 15/08/2009 01:11:30|grgen002|BWriter$DerivedValueThread.run|I|Next
> > regular task (derived values and delete) will be done at 15/08/09 2:11
> >
> > Everything OK until here. Nothing is deleted since nothing is old
> > enough, and the deletion task runs every hour. But then:
> >
> > 15/08/2009 02:01:50|grgen002|bwriter.ReportingException.log|E|
> > Database.closeFailed
> >
> > com.sun.grid.reporting.dbwriter.ReportingException: Database.closeFailed
> >
> > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:354)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.executeQuery(Database.java:568)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.test(Database.java:334)
> >
> > com.sun.grid.reporting.dbwriter.ReportingDBWriter.mainLoop(ReportingDBWriter.java:1129)
> >
> > com.sun.grid.reporting.dbwriter.ReportingDBWriter.run(ReportingDBWriter.java:1279)
> >                               Caused by java.sql.SQLException: No hay
> > más datos para leer del socket
> >
> > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> >
> > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)
> >
> > oracle.jdbc.dbaccess.DBError.check_error(DBError.java:1160)
> >
> > oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:962)
> >
> > oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:894)
> >
> > oracle.jdbc.ttc7.Ocommoncall.receive(Ocommoncall.java:106)
> >
> > oracle.jdbc.ttc7.TTC7Protocol.logoff(TTC7Protocol.java:396)
> >
> > oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:1524)
> >
> > com.sun.grid.reporting.dbwriter.db.Database
> > $ConnectionProxy.close(Database.java:721)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:352)
> > 15/08/2009 02:01:50|grgen002|bwriter.ReportingException.log|E|
> > Database.closeFailed
> >
> > com.sun.grid.reporting.dbwriter.ReportingException: Database.closeFailed
> >
> > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:359)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.closeAll(Database.java:486)
> >
> > com.sun.grid.reporting.dbwriter.ReportingDBWriter.mainLoop(ReportingDBWriter.java:1133)
> >
> > com.sun.grid.reporting.dbwriter.ReportingDBWriter.run(ReportingDBWriter.java:1279)
> >                               Caused by java.sql.SQLException: No hay
> > más datos para leer del socket
> >
> > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> >
> > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)
> >
> > oracle.jdbc.dbaccess.DBError.check_error(DBError.java:1160)
> >
> > oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:962)
> >
> > oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:894)
> >
> > oracle.jdbc.ttc7.Ocommoncall.receive(Ocommoncall.java:106)
> >
> > oracle.jdbc.ttc7.TTC7Protocol.logoff(TTC7Protocol.java:396)
> >
> > oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:1524)
> >
> > com.sun.grid.reporting.dbwriter.db.Database
> > $ConnectionProxy.close(Database.java:721)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.closeAll(Database.java:483)
> >
> > 15/08/2009 02:01:50|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > Error - Trying to reconnect
> >
> > [...]
> >
> > 15/08/2009 02:10:52|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > Error - Trying to reconnect
> > 15/08/2009 02:11:00|grgen002|BWriter$DerivedValueThread.run|E|Unknown
> > error: com.sun.grid.reporting.dbwriter.ReportingException: Can not
> > connect to database jdbc:oracle:thin:@15
> > 0.214.20.61:1711:e: ORA-01034: ORACLE not available
> > ORA-27101: shared memory realm does not exist
> > SVR4 Error: 2: No such file or directory
> >
> >
> > com.sun.grid.reporting.dbwriter.ReportingException: Can not connect to
> > database jdbc:oracle:thin:@150.214.20.61:1711:e: ORA-01034: ORACLE not
> > avail
> > able
> > ORA-27101: shared memory realm does not exist
> > SVR4 Error: 2: No such file or directory
> >
> >
> > com.sun.grid.reporting.dbwriter.db.Database.createSQLError(Database.java:359)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.getConnection(Database.java:387)
> >
> > com.sun.grid.reporting.dbwriter.ReportingDBWriter
> > $DerivedValueThread.run(ReportingDBWriter.java:833)
> >                               Caused by java.sql.SQLException:
> > ORA-01034: ORACLE not available
> > ORA-27101: shared memory realm does not exist
> > SVR4 Error: 2: No such file or directory
> >
> >
> > oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> >
> > oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)
> >
> > oracle.jdbc.ttc7.O3log.receive1st(O3log.java:410)
> >
> > oracle.jdbc.ttc7.TTC7Protocol.logon(TTC7Protocol.java:260)
> >
> > oracle.jdbc.driver.OracleConnection.<init>(OracleConnection.java:371)
> >
> > oracle.jdbc.driver.OracleDriver.getConnectionInstance(OracleDriver.java:551)
> >
> > oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:351)
> >
> > java.sql.DriverManager.getConnection(DriverManager.java:582)
> >
> > java.sql.DriverManager.getConnection(DriverManager.java:154)
> >
> > com.sun.grid.reporting.dbwriter.db.Database.getConnection(Database.java:381)
> > 15/08/2009 02:11:12|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > Error - Trying to reconnect
> >
> > [...]
> >
> > 15/08/2009 02:14:33|grgen002|ter.ReportingDBWriter.mainLoop|W|Connection
> > Error - Trying to reconnect
> > 15/08/2009 02:14:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > reporting  to reporting.processing
> > 15/08/2009 02:14:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > marked as erroneous, these will be skipped
> > 15/08/2009 02:15:01|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > file reporting.processing
> > 15/08/2009 02:15:01|grgen002|le.FileParser.createStatistics|I|Processed
> > 3.765 lines in 7,65s (492,03 lines/s)
> > 15/08/2009 02:15:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > reporting  to reporting.processing
> > 15/08/2009 02:15:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > marked as erroneous, these will be skipped
> > 15/08/2009 02:15:53|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > file reporting.processing
> > 15/08/2009 02:15:53|grgen002|le.FileParser.createStatistics|I|Processed
> > 372 lines in 0,52s (718,15 lines/s)
> > 15/08/2009 02:16:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > reporting  to reporting.processing
> > 15/08/2009 02:16:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > marked as erroneous, these will be skipped
> > 15/08/2009 02:16:53|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > file reporting.processing
> > 15/08/2009 02:16:53|grgen002|le.FileParser.createStatistics|I|Processed
> > 172 lines in 0,35s (491,43 lines/s)
> > 15/08/2009 02:17:53|grgen002|er.file.FileParser.processFile|I|Renaming
> > reporting  to reporting.processing
> > 15/08/2009 02:17:53|grgen002|iter.file.FileParser.parseFile|W|0 lines
> > marked as erroneous, these will be skipped
> > 15/08/2009 02:17:55|grgen002|iter.file.FileParser.parseFile|I|Deleting
> > file reporting.processing
> >
> > [...]
> >
> > So Oracle database wasn't available for a while and dbwriter tried to
> > reconnect. It succeeded and the accounting took place again normally.
> > However, the module that deletes the outdated values never ran again
> > (or, at least, it failed to delete the outdated values and write the log
> > entry). Shouldn't deletion get back as well as the rest of dbwriter
> > does?
> >
> > After seeing that, I tried restarting dbwriter. At first it seemed to be
> > useless, because after some hours no deletion had been performed. But I
> > checked to database again and now the minimum value was:
> >
> > 2009-08-24 01:00:00.0
> >
> > I checked the log again and there it was:
> >
> > 30/08/2009 19:08:31|grgen002|rtingDBWriter.logEventDuration|I|deleting
> > outdated values took 3 hours 19 minutes
> > 30/08/2009 19:09:21|grgen002|rtingDBWriter.logEventDuration|I|
> > calculating derived values took 0 hours 0 minutes
> >
> > I restarted dbwriter on Saturday, and the first deletion occurred on
> > Sunday at 7 pm. This too much time, isn't it?
> >
> > Now the connection failed again at 2 am as usual and the deletion module
> > is not being run. So as a workaround I guess we will have to restart the
> > process every day until the deletion module properly recovers as the
> > other module of dbwriter do.
> >
> > Thanks again and I hope there is a solution for this that doesn't need a
> > restart.
> >
> > Regards,
> >
> > Rafa
> >
> > El vie, 28-08-2009 a las 15:20 +0200, jana escribió:
> >
> >> Hi,
> >>
> >> First, check if this is the case with all the tables. Then check if in
> >> the dbwriter.log ($SGE_ROOT/$SGE_CELL/spool/dbwriter/dbwriter.log) are
> >> lines starting: 'deleting outdated values '. Also, check the log for any
> >> exceptions.
> >>
> >> And of course, check in few hours if the min value is really still the same.
> >>
> >> Regards,
> >>
> >> Jana
> >>
> >> rafaarco wrote:
> >>
> >>> I forgot: we're using SGE 6.2, ARCo 6.2u3 (this combination works
> >>> flawlessly with the exception of the outdated records) and Oracle 9i.
> >>>
> >>> Regards,
> >>>
> >>> Rafa
> >>>
> >>> El vie, 28-08-2009 a las 14:14 +0200, Rafael Arco Arredondo escribió:
> >>>
> >>>
> >>>> Hello,
> >>>>
> >>>> We have an issue with outdated records in ARCo database. Our
> >>>> configuration is the default one: raw host values are kept for 7 days
> >>>> and dbwriter runs in continuous mode (i.e., deletion should happen once
> >>>> in an hour). However, the query
> >>>>
> >>>> SELECT MIN(hv_time_start) FROM sge_host_values WHERE hv_variable = 'cpu'
> >>>>
> >>>> returns
> >>>>
> >>>> 2009-08-13 11:39:37.0
> >>>>
> >>>> And the table sge_host_values keeps growing.
> >>>>
> >>>> Since today is August 28th, I expected this minimum value to be
> >>>> 2009-08-21. So it's seems that old rows are not deleted.
> >>>>
> >>>> Any ideas?
> >>>>
> >>>> Thank you in advance,
> >>>>
> >>>> Rafa
> >>>>
> >>>>
> >>>>
> >> ------------------------------------------------------
> >> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214744
> >>
> >> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >>
> HI
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215123
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
--
Rafael Arco Arredondo
Centro de Servicios de Informática y Redes de Comunicaciones
Universidad de Granada

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215129

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list