[GE users] Sun Grid Engine 6.2 - ARCO dbwriter issue

Karen Magee magee at mayo.edu
Tue Sep 9 16:19:51 BST 2008


THANK YOU...A combination of your stuff below finally allowed me
to "catch" up.....

-- the old cluster was/is totally shutdown

-- I shutdown dbwriter and did the removal by of the old data by hand
   The bulk DELETE command only took 6 minutes...then the
   processing took about 1.5 hours to just catch up on the last month
   of good data...no job data lost..which is the most important to
   us :-)

Thank you for your help...You helped me understand more about how
the process works and how to intercede.

--
Karen


On Mon, Sep 08, 2008 at 09:57:32AM +0200, Jana Olivova wrote:
> Hi Karen,
> 
> I am not sure why it is taking that long to delete the date, you 
> obviously don't have too many old data in the database. Since you have 
> done the cloned cluster upgrade, did you leave the old cluster running? 
> You have realized that the (old) dbwriter was not running, which may 
> mean that the process was stopped or that it was just not inserting in 
> the database, because there was a connection error to the database. Have 
> you actually stopped the old dbwriter, before installing the new one? 
> There cannot be 2 dbwriter processed writing into the same database 
> schema. Do you have the end of the old dbwriter log file?
> 
> At this point I would suggest the following:
> 
> 1. on dbwriter host, source the cluster settings.sh (or .csh) file
> 2. stop the dbwriter:
> 3. comment out or delete (probably save them somewhere else for later) 
> the deletion rules in the 
> $SGE_ROOT/dbwriter/database/mysql/dbwriter.xml. Do not delete the file 
> just get rid of the deletion rules. You can always perform the deletion 
> of the outdated values yourself, later.
> 4. If you do not need the old reporting.processing file, since it 
> contains just the host values, which are generated even if there are not 
> any jobs running, just delete it.
> 5. change the debug level of dbwriter back to INFO. (increased debug 
> level will also slowq down dbwriter)
> 6. start dbwriter.
> 7. You can monitor if dbwriter is inserting lines, by querying the 
> sge_checkpoint table (select * from sge_checkpoint), the value (ch_line) 
> should be increasing.
> 
> Hope that helps.
> 
> Jana
> 
> On 09/05/08 22:38, Karen Magee wrote:
> >See inline ..
> >-----------------
> >On Fri, Sep 05, 2008 at 07:00:17PM +0200, Jana Olivova wrote:
> >  
> >>>>Is there anything useful in the dbwriter log file? 
> >>>>($SGE_ROOT/$SGE_CELL/spool/dbwriter/dbwriter.log), any errors, 
> >>>>exceptions?
> >>>>
> >>>>   
> >>>>        
> >>>unfortunately, no...just was looks to me to be normal successful 
> >>>startup...
> >>>
> >>>04/09/2008 
> >>>11:14:06|dnode0-bkp.mayo.edu|.ReportingDBWriter.initLogging|I|Starting 
> >>>up dbwriter (Version 6.2) ---------------------------
> >>>04/09/2008 
> >>>11:14:06|dnode0-bkp.mayo.edu|r.ReportingDBWriter.initialize|I|Connection 
> >>>to db jdbc:mysql://rcfclusterdb.mayo.edu:3306/arco
> >>>04/09/2008 
> >>>11:14:06|dnode0-bkp.mayo.edu|r.ReportingDBWriter.initialize|I|Found 
> >>>database model version 8
> >>>04/09/2008 
> >>>11:14:07|dnode0-bkp.mayo.edu|er.file.FileParser.processFile|I|Renaming 
> >>>reporting  to reporting.processing
> >>>04/09/2008 
> >>>11:14:07|dnode0-bkp.mayo.edu|iter.file.FileParser.parseFile|W|0 lines 
> >>>marked as erroneous, these will be skipped
> >>>04/09/2008 
> >>>11:14:07|dnode0-bkp.mayo.edu|tingDBWriter.getDbWriterConfig|I|calculation 
> >>>file /home/sge6_2/dbwriter/database/mysql/dbwriter.xml has changed, 
> >>>reread it
> >>>04/09/2008 
> >>>11:14:13|dnode0-bkp.mayo.edu|ngDBWriter$StatisticThread.run|I|Next 
> >>>statistic calculation will be done at 9/4/08 12:14 PM
> >>>04/09/2008 
> >>>11:14:31|dnode0-bkp.mayo.edu|rtingDBWriter.logEventDuration|I|calculating 
> >>>derived values took 0 hours 0 minutes
> >>> 
> >>>      
> >>Is this the end of the log, or is there more and you had just copied the 
> >>snipplet? There should be also lines  in the log file about the deletion 
> >>time 'deleting outdated values took X hours X minutes. Are these 
> >>messages there? Yo can also check the in ARCo web console the 
> >>Performance Query, which also shows the same information. Are there any 
> >>lines in the log that say 'processed X   lines in X minutes' ?
> >>    
> >
> >..that is the entire log ...
> >The dbwriter query yields no fields with 'processed X   lines in X minutes'
> >
> >  
> >>> 
> >>>      
> >>>>How have you figured out that it is taking a long time to remove the 
> >>>>old  records?
> >>>>
> >>>>   
> >>>>        
> >>>just a guess by looking at the size of (count) of sge_host_values table
> >>>It's decreasing...and the select command that PHP MySQL sees has that 
> >>>table
> >>>in it..After running overnight, we've see a drop of 287,000 records in
> >>>the sge_host_values record count...but I haven't seen any of the "new"
> >>>data from jobs that have run in the last week or so show up yet..
> >>> 
> >>>      
> >>This is not an indication that anything is wrong. In the deletion rules 
> >>file the deletion for some host_values is set to 7 days:  So, if the 
> >>dbwriter was not running fro some 25 days it would after, restart delete 
> >>lot of records at once after restart. See: 
> >>http://wikis.sun.com/display/GridEngine/Derived+Values+and+Deletion+Rules. 
> >>Are  there no new data being inserted in *any* tables? Check sge_job 
> >>table.
> >>    
> >>> 
> >>>      
> >
> >Nothing going into the sge_job table - But I I shouldn't expect it if
> >it's still doing the cleanup before it starts looking at the reporting 
> >file...
> >
> >right?
> >
> >  
> >>>-rw-r--r--  1 sgeadmin sgeadmin  27109033 Sep  5 09:50 reporting
> >>>-rw-r--r--  1 sgeadmin sgeadmin 360673112 Sep  4 11:13 
> >>>reporting.processing
> >>>
> >>>[root at dnode0 common]# wc -l reporting*
> >>>  235247 reporting
> >>> 3138846 reporting.processing
> >>>
> >>>I'm concerned that I'm in a vicious cycle with the reporting file 
> >>>growing and
> >>>the reporting.processing file not finishing up.....and when it does we'll
> >>>be back to the same thing again..
> >>> 
> >>>      
> >>Hmm, if the log that you have showed me is the whole log, then it looks 
> >>like dbwriter is stuck somewhere. Try stopping the dbwriter, increase 
> >>the Debug level in dbwriter.conf file, and start again, see if there is 
> >>anything else in  the log.
> >>
> >>    
> >I've restarted with debugging...it's working on stuff...this query over and
> >over...about 1.5 minutes a piece..
> >
> >05/09/2008 
> >15:33:07|dnode0-bkp.mayo.edu|riter.db.Database.executeQuery|D|Execute sql: 
> >SELECT hv_id FROM sge_host_values WHERE hv_time_end < {ts '2008-08-15 
> >11:00:00.0'} AND hv_variable IN ('np_load_avg', 'cpu', 'mem_free', 
> >'virtual_free') limit 500
> >05/09/2008 
> >15:34:35|dnode0-bkp.mayo.edu|iter.db.Database.executeUpdate|D|Execute sql: 
> >DELETE FROM sge_host_values WHERE hv_id IN 
> >(115390991,115390993,115390997,115391000,115391004,115391006,115391010,115391012,115391016,115391020,115391024,115391028,115391032,115391037,115391041,115391043,115391047,115391050,115391054,115391057,115391061,115391063,115391067,115391070,115391074,115391076,115391080,115391082,115391086,115391088,115391092,115391097,115391101,115391103,115391107,115391109,115391113,115391115,115391119,115391121,115391125,115391127,115391131,115391135,115391139,115391141,115391145,115391147,115391151,115391153,115391157,115391160,115391164,115391166,115391170,115391174,115391178,115391180,115391184,115391187,115391191,115391194,115391198,115391200,115391204,115391206,115391210,115391213,115391217,115391219,115391223,115391225,115391229,115391233,115391237,115391240,115391244,115391246,115391250,115391253,115391257,115391260,115391264,115391266,115391270,115391272,115391276,115391278,115391282,115391284,115391288,115391290,115391294,115391297,115391301,115391303,115391307,115391309,115391313,115391316,115391320,115391322,11
> >5391326,115391328,115391332,115391334,115391338,115391340,115391344,115391346,115391350,115391353,115391355,115391357,115391361,115391363,115391367,115391369,115391373,115391375,115391379,115391381,115391385,115391387,115391391,115391393,115391397,115391399,115391403,115391405,115391409,115391412,115391416,115391418,115391422,115391424,115391428,115391430,115391434,115391437,115391441,115391443,115391447,115391449,115391453,115391455,115391459,115391461,115391465,115391467,115391471,115391473,115391477,115391479,115391483,115391485,115391489,115391491,115391495,115391497,115391501,115391503,115391507,115391510,115391514,115391516,115391520,115391526,115391530,115391534,115391538,115391541,115391545,115391547,115391551,115391554,115391558,115391561,115391565,115391567,115391571,115391574,115391578,115391580,115391584,115391586,115391590,115391592,115391596,115391601,115391605,115391607,115391611,115391613,115391617,115391619,115391623,115391625,115391629,115391631,115391635,115391639,115391643,115391645,11539
> >1649,115391651,115391655,115391657,115391661,115391664,115391668,115391671,115391675,115391677,115391681,115391683,115391687,115391691,115391695,115391697,115391701,115391703,115391707,115391710,115391714,115391716,115391720,115391723,115391727,115391729,115391733,115391737,115391741,115391744,115391748,115391750,115391754,115391756,115391760,115391763,115391767,115391770,115391774,115391776,115391780,115391782,115391786,115391789,115391793,115391795,115391799,115391801,115391805,115391807,115391811,115391813,115391817,115391820,115391824,115391826,115391830,115391832,115391836,115391838,115391842,115391844,115391848,115391850,115391854,115391858,115391860,115391861,115391865,115391867,115391871,115391873,115391877,115391879,115391883,115391885,115391889,115391891,115391895,115391897,115391901,115391903,115391907,115391909,115391913,115391916,115391920,115391922,115391926,115391928,115391932,115391934,115391938,115391941,115391945,115391947,115391951,115391953,115391957,115391959,115391963,115391965,11539196
> >9,115391971,115391975,115391977,115391981,115391983,115391987,115391989,115391993,115391995,115391999,115392001,115392005,115392007,115392011,115392014,115392018,115392020,115392024,115392030,115392034,115392038,115392042,115392045,115392049,115392051,115392055,115392059,115392063,115392065,115392069,115392071,115392075,115392077,115392081,115392084,115392088,115392090,115392094,115392096,115392100,115392104,115392108,115392111,115392115,115392117,115392121,115392123,115392127,115392130,115392134,115392136,115392140,115392143,115392147,115392149,115392153,115392155,115392159,115392161,115392165,115392169,115392173,115392175,115392179,115392181,115392185,115392187,115392191,115392195,115392199,115392201,115392205,115392207,115392211,115392214,115392218,115392220,115392224,115392227,115392231,115392233,115392237,115392241,115392245,115392248,115392252,115392255,115392259,115392261,115392265,115392267,115392271,115392274,115392278,115392280,115392284,115392286,115392290,115392293,115392297,115392299,115392303,1
> >15392305,115392309,115392311,115392315,115392317,115392321,115392324,115392328,115392330,115392334,115392336,115392340,115392342,115392346,115392348,115392352,115392354,115392358,115392362,115392364,115392365,115392369,115392371,115392375,115392377,115392381,115392383,115392387,115392389,115392393,115392395,115392399,115392401,115392405,115392407,115392411,115392413,115392417,115392419,115392423,115392425,115392429,115392432,115392436,115392438,115392442,115392445,115392449,115392451,115392455,115392457,115392461,115392463,115392467,115392469,115392473,115392475,115392479,115392481,115392485,115392487,115392491,115392493,115392497,115392499,115392503,115392505,115392509,115392512,115392516,115392518,115392522,115392524,115392528,115392534,115392538,115392543,115392547,115392549,115392553,115392555,115392559,115392563,115392567,115392570,115392574,115392576,115392580,115392582,115392586,115392588,0)
> >05/09/2008 
> >15:34:35|dnode0-bkp.mayo.edu|ng.dbwriter.db.Database.commit|D|Thread 
> >derived commits Connection 2 
> >(null at jdbc:mysql://rcfclusterdb.mayo.edu:3306/arco)
> >05/09/2008 
> >15:34:35|dnode0-bkp.mayo.edu|riter.db.Database.executeQuery|D|Execute sql: 
> >SELECT hv_id FROM sge_host_values WHERE hv_time_end < {ts '2008-08-15 
> >11:00:00.0'} AND hv_variable IN ('np_load_avg', 'cpu', 'mem_free', 
> >'virtual_free') limit 500
> >
> >
> >
> >
> >
> >  
> 

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
-------
Karen Magee                 Unix Systems Coordinator - RCF
Mayo Clinic                 Internet: magee at mayo.edu
200 1st St SW               Phone: (507) 284-1806
Rochester, MN 55905

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list