[GE users] qmaster dying again....

Iwona Sakrejda isakrejda at lbl.gov
Sun Aug 26 04:37:08 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Where should I look for the core dump? Looks to me like none is produced 
although
just before starting the daemon I change the corelimit to 1GB. Maybe I 
am not
looking at the right place....

If all is quiet I'll try to play with it again during the next maintenance.

Iwona

Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
>
> according the gdb info your qmaster died in function 
> double_print_to_dstring() from a Bus error. Usually this means invalid 
> memory was accessed. I tried
> to find a point in qmaster source code where double_print_to_dstring()
> is called with a possibly invalid arguments, but so far I couldn't.
>
> Do you still have the the core dump? Unfortunately the gdb info below 
> shows merely SIGBUS was thrown in double_print_to_dstring() of thread 
> #9, but it does not unveil the full stack trace of this thread. gdb 
> commands to get this must be something like
>
>    # thread 9
>    # where
>
> Regards,
> Andreas
>
>
> On Wed, 15 Aug 2007, Iwona Sakrejda wrote:
>
>> So today I had my maintenance. I set reporting to true (qconf -mconf)
>> and that kill the master. Then I tried to restart it the way you 
>> suggested
>> and it would not start - a screen dump follows.
>> So I started it once more with gdb and it crashed again - some gdb 
>> info is appended too.
>> Then I edited by hand the configuration file and changed the 
>> reporting to true
>> and I was able to start it. But it already died a few times during 
>> last hour.
>>
>> This is 6.0u11 on RHEL3.
>>
>> Could you suggest next debugging steps?
>>
>> Thanks a lot,
>>
>> Iwona
>>
>> [root at pc2533 root]# ps -elf|grep sge
>> 0 S root     11962 11562  0  75   0    -  1191 -      14:58 pts/3    
>> 00:00:00 grep sge
>> [root at pc2533 root]# export SGE_ND=""
>> [root at pc2533 root]# echo $SGE_ND
>>
>> [root at pc2533 root]#  /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
>> Reading in complex attributes.
>> Reading in execution hosts.
>> Reading in administrative hosts.
>> Reading in submit hosts.
>> Reading in host group entries:
>>       Host group entries for group "@allhosts".
>>       Host group entries for group "@xeon04".
>>       Host group entries for group "@athlon03".
>>       Host group entries for group "@athlon02".
>>       Host group entries for group "@star".
>>       Host group entries for group "@kamland".
>>       Host group entries for group "@test".
>>       Host group entries for group "@intel01".
>>       Host group entries for group "@opteron05".
>>       Host group entries for group "@express".
>>       Host group entries for group "@debug".
>> Reading in usersets:
>>       Userset "defaultdepartment".
>>       Userset "deadlineusers".
>>       Userset "star".
>>       Userset "alice".
>>       Userset "atlas".
>>       Userset "snfactry".
>>       Userset "deepsrch".
>>       Userset "e871".
>>       Userset "kamland".
>>       Userset "sno".
>>       Userset "cdf".
>>       Userset "e896".
>>       Userset "other".
>>       Userset "admin".
>>       Userset "icecube".
>>       Userset "majorana".
>>       Userset "euso".
>>       Userset "astrogfs".
>>       Userset "staradmin".
>>       Userset "rhicthry".
>>       Userset "starprod".
>>       Userset "kamlanda".
>>       Userset "suspended".
>>       Userset "imcg".
>>       Userset "snap".
>>       Userset "starspinprod".
>>       Userset "emcal".
>> Reading in queues:
>>       Queue "all.q".
>>       Queue "starprod.q".
>>       Queue "test.q".
>>       Queue "adm.q".
>>       Queue "debug.q".
>>       Queue "big.q".
>> Reading in parallel environments:
>>       PE "make".
>>       PE "lam_loose_qrsh".
>>       PE "lammpi".
>>       PE "lam_tight_qrsh".
>>       PE "mpi".
>>       PE "simple".
>> Reading in Master_Job_List.
>> ........................
>>
>> read job database with 2390 entries in 78 seconds
>> Reading in users:
>>       User "lma".
>>       User "aya".
>>       User "dujx".
>>       User "yanwen".
>>       User "hamblen".
>>       User "shossain".
>>       User "danielx".
>>       User "guardi".
>>       User "affolder".
>>       User "kushner".
>>       User "labbe".
>>       User "lane".
>>       User "qattan".
>>       User "akio".
>>       User "dereke".
>>       User "ryd".
>>       User "rvogel".
>>       User "alexst".
>>       User "cardo".
>>       User "enoki".
>>       User "jarguin".
>>       User "cambell6".
>>       User "conesa1".
>>       User "kipnis".
>>       User "kisiel".
>>       User "kewu".
>>       User "jrkonzer".
>>       User "zimm".
>>       User "cerri".
>>       User "keith".
>>       User "yuchen".
>>       User "junmin".
>>       User "kadota".
>>       User "asim".
>>       User "witt".
>>       User "whitney".
>>       User "jrgordon".
>>       User "weiming".
>>       User "vineeth".
>>       User "chee".
>>       User "tjsymons".
>>       User "tompkins".
>>       User "ullrich".
>>       User "uscms01".
>>       User "tierney".
>>       User "balewski".
>>       User "terryh".
>>       User "sumbera".
>>       User "jklay".
>>       User "connolly".
>>       User "cormier".
>>       User "battagl".
>>       User "srini".
>>       User "solomey".
>>       User "baumgart".
>>       User "jhthomas".
>>       User "bclee".
>>       User "soliver".
>>       User "croy".
>>       User "shreyas".
>>       User "sirena".
>>       User "cwhite".
>>       User "jedynak".
>>       User "dahl".
>>       User "benedos".
>>       User "sahal".
>>       User "daues".
>>       User "romero".
>>       User "rodenm".
>>       User "relyea".
>>       User "rexwg".
>>       User "rmfarber".
>>       User "rwg".
>>       User "javiera".
>>       User "rajkumar".
>>       User "decowski".
>>       User "ivdgl".
>>       User "bravina".
>>       User "dhale".
>>       User "ijohnson".
>>       User "nxu".
>>       User "pfachini".
>>       User "piotr".
>>       User "planinic".
>>       User "porter".
>>       User "didenko".
>>       User "perry".
>>       User "pavetter".
>>       User "pavlinov".
>>       User "omall".
>>       User "pandola".
>>       User "dkettler".
>>       User "nurcan".
>>       User "nordberg".
>>       User "nataliak".
>>       User "neha".
>>       User "noblath".
>>       User "msd".
>>       User "nan".
>>       User "mgmarino".
>>       User "milford".
>>       User "milne".
>>       User "misawa".
>>       User "mcguigan".
>>       User "mcmc".
>>       User "markoff".
>>       User "maya".
>>       User "fgabler".
>>       User "maleyton".
>>       User "may".
>>       User "fisyak".
>>       User "glma".
>>       User "goldman".
>>       User "ma3d".
>>       User "lou".
>>       User "gene".
>>       User "lyu".
>>       User "gans".
>>       User "gelor".
>>       User "ealbin".
>>       User "llhsu".
>>       User "lianjunj".
>>       User "adler".
>>       User "leecl".
>>       User "admarino".
>>       User "bihonger".
>>       User "sss".
>>       User "bweaver".
>>       User "lecompte".
>>       User "lansdell".
>>       User "lauer".
>>       User "agupta".
>>       User "aihong".
>>       User "bseilhan".
>>       User "kkrueger".
>>       User "calaf".
>>       User "langley".
>>       User "lbetev".
>>       User "chenjy".
>>       User "mcsuarez".
>>       User "dipo".
>>       User "canon".
>>       User "kollegge".
>>       User "lapointe".
>>       User "carither".
>>       User "zdrazil".
>>       User "silvermy".
>>       User "kfushimi".
>>       User "lgreiner".
>>       User "cebra".
>>       User "alimvl".
>>       User "kelly".
>>       User "yzchu".
>>       User "zawisza".
>>       User "ysmirnov".
>>       User "doyen".
>>       User "yangj".
>>       User "ycoadou".
>>       User "ypang".
>>       User "chadm".
>>       User "jwebb".
>>       User "jzulr".
>>       User "jshalf".
>>       User "willson".
>>       User "awetzler".
>>       User "verdier".
>>       User "ayoung".
>>       User "joshi".
>>       User "tjoubert".
>>       User "johnbrow".
>>       User "jonaytac".
>>       User "cmironov".
>>       User "stradlin".
>>       User "jmuelmen".
>>       User "jodi".
>>       User "barnby".
>>       User "sritchey".
>>       User "costanzo".
>>       User "batygov".
>>       User "baudot".
>>       User "cristina".
>>       User "beamer".
>>       User "jhfu".
>>       User "shester".
>>       User "shigaki".
>>       User "bedaque".
>>       User "seluzhen".
>>       User "jenant".
>>       User "belaga".
>>       User "belaurik".
>>       User "sdss".
>>       User "belt".
>>       User "jed".
>>       User "sanshiro".
>>       User "sarblyth".
>>       User "saulys".
>>       User "schaffer".
>>       User "jecc".
>>       User "hha".
>>       User "rojo".
>>       User "rscalzo".
>>       User "rthomas".
>>       User "dbarnes".
>>       User "bigdeli".
>>       User "jberger".
>>       User "boercher".
>>       User "raw".
>>       User "randrup".
>>       User "jacobsen".
>>       User "raines".
>>       User "deph".
>>       User "quarrie".
>>       User "huovinen".
>>       User "bstone".
>>       User "dywue".
>>       User "hpark".
>>       User "nws".
>>       User "dietel".
>>       User "pawan".
>>       User "hjiang".
>>       User "hgritter".
>>       User "dmsteven".
>>       User "msearle".
>>       User "mshupe".
>>       User "murat".
>>       User "helbing".
>>       User "dschmier".
>>       User "half".
>>       User "hamed".
>>       User "gxrai".
>>       User "eleanor".
>>       User "mccauley".
>>       User "mckinny".
>>       User "meidm".
>>       User "faivre".
>>       User "mbotje".
>>       User "jcs".
>>       User "hma".
>>       User "macross".
>>       User "fu".
>>       User "gowdy".
>>       User "macl".
>>       User "fqwang".
>>       User "glanzman".
>>       User "fujikawa".
>>       User "fvhale".
>>       User "gfg".
>>       User "lsc01".
>>       User "geno".
>>       User "gas".
>>       User "passmore".
>>       User "yisun".
>>       User "geurts".
>>       User "yfzhang".
>>       User "tpb".
>>       User "miu".
>>       User "mhluk".
>>       User "gprior".
>>       User "spitz".
>>       User "kurca".
>>       User "koschke".
>>       User "fpaige".
>>       User "markert".
>>       User "sakuma".
>>       User "martina".
>>       User "bockjoo".
>>       User "lulc".
>>       User "manderso".
>>       User "marcel".
>>       User "mvl".
>>       User "aart".
>>       User "luehring".
>>       User "ydc".
>>       User "dunlop".
>>       User "earl".
>>       User "einsweil".
>>       User "weizhou".
>>       User "dhevang".
>>       User "mecoving".
>>       User "gweber".
>>       User "ernst".
>>       User "estienne".
>>       User "trenk".
>>       User "guojilin".
>>       User "rbf".
>>       User "feldmann".
>>       User "fergie".
>>       User "mcnp".
>>       User "grodid".
>>       User "pmf".
>>       User "brubaker".
>>       User "munhoz".
>>       User "mwhite".
>>       User "mswanger".
>>       User "molnarl".
>>       User "busenitz".
>>       User "hippolyt".
>>       User "molnard".
>>       User "horsley".
>>       User "djschleg".
>>       User "dleonard".
>>       User "herston".
>>       User "drabinow".
>>       User "millane".
>>       User "mischke".
>>       User "mjfisher".
>>       User "dannytb".
>>       User "bhaag".
>>       User "davidk".
>>       User "jcfree".
>>       User "ogilvie".
>>       User "billmei".
>>       User "obuncic".
>>       User "jaym".
>>       User "janik".
>>       User "nystrand".
>>       User "nugent".
>>       User "jacobs".
>>       User "nikolai".
>>       User "nagaslae".
>>       User "dhazen".
>>       User "ibhadju".
>>       User "ricaud".
>>       User "rjm".
>>       User "rkowen".
>>       User "rhenning".
>>       User "rcabrera".
>>       User "rellen".
>>       User "rfatemi".
>>       User "rgareus".
>>       User "pruneau".
>>       User "jgma".
>>       User "petrchal".
>>       User "orejudos".
>>       User "pclarke".
>>       User "olga".
>>       User "opspdsf".
>>       User "jin".
>>       User "bergevin".
>>       User "antai".
>>       User "spitzer".
>>       User "chaber".
>>       User "arcarter".
>>       User "smithj4".
>>       User "sixie".
>>       User "chaoz".
>>       User "siegrist".
>>       User "awes".
>>       User "chunhuih".
>>       User "sdazeley".
>>       User "scottc".
>>       User "barannik".
>>       User "saraf".
>>       User "rpicha".
>>       User "russcher".
>>       User "tgoodale".
>>       User "thenry".
>>       User "kopytin".
>>       User "ktlesko".
>>       User "kunz".
>>       User "carcassi".
>>       User "tbutler".
>>       User "cardenas".
>>       User "tbanks".
>>       User "tanya".
>>       User "kocevski".
>>       User "catalin".
>>       User "amonett".
>>       User "stergar".
>>       User "dougr".
>>       User "srikumar".
>>       User "xzb".
>>       User "bleicher".
>>       User "aarond".
>>       User "aarose".
>>       User "dimac".
>>       User "bmonreal".
>>       User "wehle".
>>       User "wenaus".
>>       User "ward".
>>       User "caines".
>>       User "vacavant".
>>       User "trattner".
>>       User "tuntsfaa".
>>       User "umatov".
>>       User "uscms02".
>>       User "wwoodvas".
>>       User "xjd".
>>       User "msun".
>>       User "howley".
>>       User "hhuang".
>>       User "moed".
>>       User "mmoura".
>>       User "dmitry".
>>       User "mlgreen".
>>       User "dougsim".
>>       User "downum".
>>       User "helge".
>>       User "drescher".
>>       User "hatake".
>>       User "hallin".
>>       User "mhorner".
>>       User "haibin".
>>       User "betya".
>>       User "ojacobsen".
>>       User "bielcik".
>>       User "ofine".
>>       User "ogreben".
>>       User "bonachea".
>>       User "jakeking".
>>       User "brandonp".
>>       User "nilsen".
>>       User "deisher".
>>       User "nickb".
>>       User "nielsenj".
>>       User "brdraney".
>>       User "nattrass".
>>       User "hypercp".
>>       User "mustapha".
>>       User "jgreid".
>>       User "potekhin".
>>       User "beckmann".
>>       User "bedanga".
>>       User "jfoster".
>>       User "pibero".
>>       User "poon".
>>       User "jelena".
>>       User "panitkin".
>>       User "jedraper".
>>       User "okorokov".
>>       User "reb".
>>       User "dang".
>>       User "beringer".
>>       User "jdanders".
>>       User "okada".
>>       User "azriel".
>>       User "joong".
>>       User "bagwell".
>>       User "classen".
>>       User "scherzer".
>>       User "schutz".
>>       User "cmauger".
>>       User "jkephart".
>>       User "rmiquel".
>>       User "romosan".
>>       User "ruda".
>>       User "bartelt".
>>       User "rhodes".
>>       User "jillings".
>>       User "cperkins".
>>       User "renault".
>>       User "kaneta".
>>       User "kareem".
>>       User "kdawson".
>>       User "sjbailey".
>>       User "skluth".
>>       User "sliwa".
>>       User "soneale".
>>       User "spadafor".
>>       User "chajecki".
>>       User "atwong".
>>       User "charles".
>>       User "shabetai".
>>       User "dtliu".
>>       User "sferrell".
>>       User "sguertin".
>>       User "cherney".
>>       User "vogt".
>>       User "vdmolen".
>>       User "kurnadi".
>>       User "tofr".
>>       User "tatsuno".
>>       User "allen".
>>       User "kkarr".
>>       User "stokstad".
>>       User "supriya".
>>       User "szeto".
>>       User "amsgc5".
>>       User "steiner".
>>       User "kerasha".
>>       User "stardb".
>>       User "keefer".
>>       User "speltz".
>>       User "liuls".
>>       User "abha".
>>       User "wjdong".
>>       User "liubo".
>>       User "westfall".
>>       User "xin".
>>       User "wayneh".
>>       User "wbaird".
>>       User "lbland".
>>       User "cadler".
>>       User "vernet".
>>       User "vkoch".
>>       User "wes".
>>       User "blyth".
>>       User "alandav".
>>       User "kmontag".
>>       User "rderradi".
>>       User "matteo".
>>       User "dlamenti".
>>       User "u16301".
>>       User "markp".
>>       User "alexis3".
>>       User "fsimon".
>>       User "yoshiu".
>>       User "zarzhit".
>>       User "zhliu".
>>       User "fyodor".
>>       User "ynara".
>>       User "luis".
>>       User "xzcai".
>>       User "loken".
>>       User "lsadler".
>>       User "mheffner".
>>       User "emit0".
>>       User "emorris".
>>       User "schaefer".
>>       User "bombara".
>>       User "mcvady".
>>       User "mmeijer".
>>       User "mnorman".
>>       User "kyba".
>>       User "greatkei".
>>       User "hai".
>>       User "wuyf".
>>       User "mauri".
>>       User "atang".
>>       User "nrl".
>>       User "cyberman".
>>       User "jmonroe".
>>       User "gaillard".
>>       User "mlisa".
>>       User "gaudiche".
>>       User "mkaplan".
>>       User "rmaruyam".
>>       User "xuyichun".
>>       User "mira".
>>       User "nastone".
>>       User "nayla".
>>       User "gorbunov".
>>       User "nancy".
>>       User "fliu".
>>       User "golling".
>>       User "mucci".
>>       User "mweber".
>>       User "fross".
>>       User "gidal".
>>       User "ftaylor".
>>       User "gedanken".
>>       User "mstewart".
>>       User "fwh".
>>       User "mmiller".
>>       User "msar".
>>       User "pck".
>>       User "hardtke".
>>       User "oldi".
>>       User "putschke".
>>       User "canonrs".
>>       User "nilanthi".
>>       User "oana".
>>       User "ofisyak".
>>       User "engelage".
>>       User "greiman".
>>       User "nieuwhzn".
>>       User "nikas".
>>       User "fcp".
>>       User "fegray".
>>       User "nevski".
>>       User "gpdf".
>>       User "gregoire".
>>       User "hoo".
>>       User "dlesage".
>>       User "rajeshn".
>>       User "hgray".
>>       User "hhholmes".
>>       User "pilcher".
>>       User "pollney".
>>       User "hcfang".
>>       User "partlan".
>>       User "peitzma".
>>       User "pharvey".
>>       User "dskinner".
>>       User "dsmith".
>>       User "osiegrist".
>>       User "parsons".
>>       User "e871code".
>>       User "olson".
>>       User "ivanshin".
>>       User "sakamil".
>>       User "sasmith".
>>       User "sethzenz".
>>       User "btev".
>>       User "ianh".
>>       User "buncic".
>>       User "rknop".
>>       User "rreddy".
>>       User "ruanlj".
>>       User "dinofm".
>>       User "djengh".
>>       User "rayd".
>>       User "rclee".
>>       User "rcwells".
>>       User "resconi".
>>       User "struck".
>>       User "debasish".
>>       User "sosebee".
>>       User "starofl".
>>       User "snelling".
>>       User "sorensen".
>>       User "brant".
>>       User "smckee".
>>       User "jason".
>>       User "jasondet".
>>       User "shirley".
>>       User "shjang".
>>       User "sjoelin".
>>       User "iwona".
>>       User "brijesh".
>>       User "severini".
>>       User "berryhil".
>>       User "dart".
>>       User "tdonnell".
>>       User "jkiryluk".
>>       User "jiafei".
>>       User "subhasis".
>>       User "svl".
>>       User "swing".
>>       User "sychan".
>>       User "szarwas".
>>       User "taluc".
>>       User "tdavis".
>>       User "tjt".
>>       User "jeromel".
>>       User "suaide".
>>       User "jdodd".
>>       User "stone".
>>       User "justin".
>>       User "jvirzi".
>>       User "bekele".
>>       User "czhong".
>>       User "timser".
>>       User "d3c724".
>>       User "julery".
>>       User "dtyu".
>>       User "timmins".
>>       User "josephf".
>>       User "tgutierr".
>>       User "therese".
>>       User "timh".
>>       User "JLA550".
>>       User "jmeyer".
>>       User "jnovotny".
>>       User "kkowalik".
>>       User "cottrell".
>>       User "kdatta".
>>       User "kenss".
>>       User "kazumi".
>>       User "voloshin".
>>       User "kammel".
>>       User "karl".
>>       User "kaushikd".
>>       User "vlmrz".
>>       User "beberger".
>>       User "u70004".
>>       User "kabana".
>>       User "ctday".
>>       User "tipton".
>>       User "tull".
>>       User "baiyt".
>>       User "xuw".
>>       User "yakushev".
>>       User "za".
>>       User "kowalski".
>>       User "klaush".
>>       User "xichen".
>>       User "barden".
>>       User "cmouser".
>>       User "barish".
>>       User "wlav".
>>       User "kfornaz".
>>       User "wieman".
>>       User "khodinov".
>>       User "khudek".
>>       User "wcs".
>>       User "wuj".
>>       User "kjr".
>>       User "liq".
>>       User "aknospe".
>>       User "soltz".
>>       User "druss".
>>       User "leggett".
>>       User "kvetter".
>>       User "aragon".
>>       User "jorrell".
>>       User "lanou".
>>       User "lasiuk".
>>       User "zdjurcic".
>>       User "lauss".
>>       User "lblsrb".
>>       User "bachacou".
>>       User "ciocio".
>>       User "yepes".
>>       User "zberecki".
>>       User "ghoulam".
>>       User "kazuhiro".
>>       User "llope".
>>       User "kechech".
>>       User "lixh".
>>       User "arie".
>>       User "kapitan".
>>       User "aroy".
>>       User "liuzx".
>>       User "artthurs".
>>       User "shakoori".
>>       User "runge".
>>       User "koheik".
>>       User "starreco".
>>       User "levesj".
>>       User "mcosent".
>>       User "peterlos".
>>       User "wangxb".
>>       User "dkoetke".
>>       User "xwq1985".
>>       User "xinghua".
>>       User "alai".
>>       User "amol".
>>       User "cbum".
>>       User "threefay".
>>       User "cdfsoft".
>>       User "longacre".
>>       User "nbarkas".
>>       User "voeckler".
>>       User "sudhir".
>>       User "testpsff".
>>       User "bliao".
>>       User "marino".
>>       User "markh".
>>       User "marsiske".
>>       User "greenc".
>>       User "littlejo".
>>       User "rosheck".
>>       User "marco".
>>       User "aldering".
>>       User "lys".
>>       User "mavrekh".
>>       User "bongard".
>>       User "maguire".
>>       User "alvarez".
>>       User "dmeyers".
>>       User "amako".
>>       User "posk".
>>       User "hew".
>>       User "bland".
>>       User "mheinz".
>>       User "mhoemmen".
>>       User "bnorman".
>>       User "mercedes".
>>       User "mgadost".
>>       User "mgarcia".
>>       User "aconley".
>>       User "bobw".
>>       User "mendi".
>>       User "meissner".
>>       User "mattheww".
>>       User "mayes".
>>       User "afleming".
>>       User "agibson".
>>       User "cadman".
>>       User "aalseth".
>>       User "mendonca".
>>       User "zbtang".
>>       User "binet".
>>       User "bystersk".
>>       User "jhpalice".
>>       User "butter".
>>       User "vmg".
>>       User "gopalb".
>>       User "morsch".
>>       User "moyse".
>>       User "mng".
>>       User "gcosmo".
>>       User "gabriel".
>>       User "mlam".
>>       User "ekw".
>>       User "ely".
>>       User "guangqin".
>>       User "griem".
>>       User "nita".
>>       User "finch".
>>       User "okikawa".
>>       User "heeger".
>>       User "hdliu".
>>       User "draper".
>>       User "ojha".
>>       User "drkent".
>>       User "hazama".
>>       User "ofgabler".
>>       User "ofretiere".
>>       User "dthein".
>>       User "odyniec".
>>       User "hanna".
>>       User "ikelley".
>>       User "pater".
>>       User "pinkenbu".
>>       User "pastor".
>>       User "djordan".
>>       User "hlong".
>>       User "hongyu".
>>       User "omargetis".
>>       User "brandste".
>>       User "jbielcik".
>>       User "predrag".
>>       User "jasonk".
>>       User "jbk".
>>       User "brent".
>>       User "ppching".
>>       User "dhbailey".
>>       User "dibari".
>>       User "jinhui".
>>       User "rdolan".
>>       User "jiaxu".
>>       User "jingbo".
>>       User "du".
>>       User "big".
>>       User "dbest".
>>       User "jdodge".
>>       User "dclayton".
>>       User "bozek".
>>       User "qjliu".
>>       User "schweda".
>>       User "julio".
>>       User "salur".
>>       User "sandro".
>>       User "sarah".
>>       User "johnj".
>>       User "sakrejda".
>>       User "jla550".
>>       User "reichhol".
>>       User "rubind".
>>       User "sabh".
>>       User "slhuang".
>>       User "crawford".
>>       User "shimansk".
>>       User "slblyth".
>>       User "crivelli".
>>       User "shichijo".
>>       User "sevahsen".
>>       User "shapiro".
>>       User "currat".
>>       User "schuelke".
>>       User "klausk".
>>       User "tmai".
>>       User "tpavel".
>>       User "tinad".
>>       User "kgarg".
>>       User "kirill".
>>       User "smrenna".
>>       User "wleight".
>>       User "wdeng".
>>       User "lacunza".
>>       User "vanyashi".
>>       User "wbetts".
>>       User "laue".
>>       User "u7142".
>>       User "u767".
>>       User "usatlas1".
>>       User "kurts".
>>       User "trent".
>>       User "xliu".
>>       User "xylin".
>>       User "chafik".
>>       User "lesko".
>>       User "wurzel".
>>       User "lelchuk".
>>       User "dongx".
>>       User "sowinski".
>>       User "dkonerd".
>>       User "amueller".
>>       User "lockman".
>>       User "andr".
>>       User "dipak".
>>       User "akim".
>>       User "alansill".
>>       User "madaras".
>>       User "magestro".
>>       User "kadel".
>>       User "adair".
>>       User "schansen".
>>       User "herrera".
>>       User "bchkim".
>>       User "matis".
>>       User "romano".
>>       User "elnimr".
>>       User "davej".
>>       User "weigand".
>>       User "garand".
>>       User "gos".
>>       User "mreddick".
>>       User "skoby".
>>       User "reitzner".
>>       User "rquick".
>>       User "tdss".
>>       User "betan".
>>       User "wiggy13".
>>       User "flierl".
>>       User "mstoufer".
>>       User "mt".
>>       User "ter".
>>       User "fretiere".
>>       User "mora".
>>       User "mrkallen".
>>       User "gibbo".
>>       User "galtieri".
>>       User "mjchen".
>>       User "objy".
>>       User "groysman".
>>       User "guillian".
>>       User "nurit".
>>       User "nystrom".
>>       User "greiner".
>>       User "nstone".
>>       User "goupell".
>>       User "nlfarr".
>>       User "fine".
>>       User "nbeckett".
>>       User "okreylos".
>>       User "dpturner".
>>       User "drjohn".
>>       User "harsh".
>>       User "ioji".
>>       User "petar".
>>       User "hshan".
>>       User "htp".
>>       User "igv".
>>       User "dimarcom".
>>       User "opachich".
>>       User "hjort".
>>       User "heng".
>>       User "deboni".
>>       User "lorenzo".
>>       User "qhxu".
>>       User "japar".
>>       User "prindle".
>>       User "jay".
>>       User "plujan".
>>       User "pjones".
>>       User "rdiaz".
>>       User "jgwacker".
>>       User "raghu".
>>       User "dbury".
>>       User "ragerber".
>>       User "jcarter".
>>       User "seng".
>>       User "jseger".
>>       User "dandwyer".
>>       User "robbins".
>>       User "rzep".
>>       User "stavrop".
>>       User "suire".
>>       User "skmandal".
>>       User "jygabler".
>>       User "jw_lee".
>>       User "saroka".
>>       User "kocolosk".
>>       User "tradke".
>>       User "tpmccaul".
>>       User "tjwalker".
>>       User "cnepali".
>>       User "tcase".
>>       User "tim88899".
>>       User "spencer".
>>       User "ssoff".
>>       User "weaver".
>>       User "kvtsang".
>>       User "turcotte".
>>       User "kramer".
>>       User "clifford".
>>       User "yury".
>>       User "lluvia".
>>       User "ykkim".
>>       User "youngil".
>>       User "dpaul".
>>       User "xnwang".
>>       User "ashmansk".
>>       User "drew".
>>       User "atartir".
>>       User "lehocka".
>>       User "luyan".
>>       User "ikuro".
>>       User "mdunford".
>>       User "lundqvis".
>>       User "zgarrett".
>>       User "losecco".
>>       User "lmark".
>>       User "lmp".
>>       User "luk".
>>       User "cgrant".
>>       User "apiepke".
>>       User "akbar".
>>       User "majdi".
>>       User "malon".
>>       User "staszak".
>>       User "akorn".
>>       User "canson".
>>       User "mahsa".
>>       User "clendvai".
>>       User "lwinslow".
>>       User "bryleung".
>>       User "jianglai".
>>       User "parag".
>>       User "bzhangtx".
>>       User "andream".
>>       User "margetis".
>>       User "ailea".
>>       User "calderon".
>> Reading in projects:
>>       Project "admin".
>>       Project "alice".
>>       Project "astrogfs".
>>       Project "atlas".
>>       Project "cdf".
>>       Project "deepsrch".
>>       Project "e871".
>>       Project "e895".
>>       Project "e896".
>>       Project "euso".
>>       Project "icecube".
>>       Project "kamland".
>>       Project "majorana".
>>       Project "other".
>>       Project "rhicthry".
>>       Project "snfactry".
>>       Project "sno".
>>       Project "star".
>>       Project "imcg".
>>       Project "starspinprod".
>>       Project "emcal".
>> qmaster hard descriptor limit is set to 8192
>> qmaster soft descriptor limit is set to 8192
>> qmaster will use max. 8172 file descriptors for communication
>> qmaster will accept max. 99 dynamic event clients
>> starting up GE 6.0u11 (lx24-x86)
>> Bus error
>>
>>
>> [New Thread -1313866832 (LWP 12258)]
>> [New Thread -1324356688 (LWP 12259)]
>> [New Thread -1334846544 (LWP 12260)]
>>
>> Program received signal SIGBUS, Bus error.
>> [Switching to Thread -1324356688 (LWP 12259)]
>> 0x0812f450 in double_print_to_dstring ()
>> (gdb) (gdb) info threads
>> 10 Thread -1334846544 (LWP 12260)  0xb75adebd in 
>> pthread_rwlock_wrlock ()
>>  from /lib/tls/libpthread.so.0
>> * 9 Thread -1324356688 (LWP 12259)  0x0812f450 in 
>> double_print_to_dstring ()
>> 8 Thread -1313866832 (LWP 12258)  0xb75b1c84 in sigwait () from 
>> /lib/tls/libpthread.so.0
>> 7 Thread -1301283920 (LWP 12257)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>  from /lib/tls/libpthread.so.0
>> 6 Thread -1265353808 (LWP 12181)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>  from /lib/tls/libpthread.so.0
>> 5 Thread -1254863952 (LWP 12180)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>  from /lib/tls/libpthread.so.0
>> 4 Thread -1244374096 (LWP 12179)  0xb7544077 in ___newselect_nocancel ()
>>  from /lib/tls/libc.so.6
>> 3 Thread -1233884240 (LWP 12178)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>  from /lib/tls/libpthread.so.0
>> 2 Thread -1223394384 (LWP 12177)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>  from /lib/tls/libpthread.so.0
>> 1 Thread -1220095328 (LWP 12168)  0xb75acd58 in pthread_join ()
>>  from /lib/tls/libpthread.so.0
>> Cannot access memory at address 0x812f450
>>
>>
>>
>> Andreas.Haas at Sun.COM wrote:
>>> Hi Iwona,
>>>
>>> watching memory consumption patterns of deamons can be like tea 
>>> leave reading. Since
>>>
>>>    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187
>>>
>>> was fixed for 6.0u11 I have not heard of anything that sounds like a 
>>> memory leak and Andrea's memory consumption records disclose qmaster 
>>> was memory leak free already before 6.0u11.
>>>
>>> Below you say
>>>
>>>   "I cannot enable reporting either. When I try those daemons
>>>    (the master and the scheduler) crash right away too."
>>>
>>> or are you refering here to reporting(5) or is it the outcome of 
>>> running daemons undeamonized as I suggested it?
>>>
>>> Regards,
>>> Andreas
>>>
>>>
>>> On Tue, 31 Jul 2007, Iwona Sakrejda wrote:
>>>
>>>> Hi,
>>>>
>>>> Nobody picked up on this thread and today both the master and the 
>>>> scheduling
>>>> daemon are 0.5GB each. Is that normal? They have not crashed since 
>>>> 07/27,
>>>> but even if the load goes down they never shrink, they just grow 
>>>> slower.
>>>> That looks to me like a memory leak, but I am not sure how to approach
>>>> debugging of this problem.
>>>>
>>>> I can schedule maintenance period and try debugging, but would like to
>>>> have a better plan of what and how to debug.
>>>>
>>>>
>>>> Thank You,
>>>>
>>>> Iwona
>>>>
>>>> Iwona Sakrejda wrote:
>>>>> Since my qmaster and the scheduler daemons toppled over lately for
>>>>> "no good reason" I started watching their size. I started them ~27h
>>>>> ago and they were at ~50MB each. Now they both tripled in size.
>>>>>
>>>>> When I started there were about 4k jobs in the system. Now there are
>>>>> about 9k. But during last 27h the number of jobs would sometimes 
>>>>> decrease
>>>>> and the daemons are slowly but steadily growing. I have only serial
>>>>> jobs, about 450 running at any time on ~230 hosts, the rest is 
>>>>> pending.
>>>>>
>>>>> I run 6.0u11 on RHEL3.
>>>>>
>>>>> Is that growth normal or should it be a reason for concern?
>>>>> Does anybody run a comparable configuration and load?
>>>>> I cannot enable reporting either. When I try those daemons
>>>>> (the master and the scheduler) crash right away too.
>>>>> I enabled core dumping so I hope to have more info next time
>>>>> the system crashes.
>>>>>
>>>>> Thank You,
>>>>>
>>>>> Iwona
>>>>>
>>>>>
>>>>> Andreas.Haas at Sun.COM wrote:
>>>>>> Hi Iwona,
>>>>>>
>>>>>> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Andreas.Haas at Sun.COM wrote:
>>>>>>>> Hi Iwona,
>>>>>>>>
>>>>>>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this 
>>>>>>>>> morning my qmaster started dying.
>>>>>>>>
>>>>>>>> You did this as foreseen?
>>>>>>>>
>>>>>>>>    http://gridengine.sunsource.net/install60patch.txt
>>>>>>> Yes, all went through ok, no problems encountered during the 
>>>>>>> upgrade.
>>>>>>> I was very happy about that.
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> When I look at the logs I see messages:
>>>>>>>>>
>>>>>>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor 
>>>>>>>>> limit is set to 8192
>>>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor 
>>>>>>>>> limit is set to 8192
>>>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 
>>>>>>>>> 8172 file descriptors for communication
>>>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 
>>>>>>>>> 99 dynamic event clients
>>>>>>>>
>>>>>>>> That is fine. It says qmaster got enough file descriptors 
>>>>>>>> available.
>>>>>>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per 
>>>>>>> CPU.
>>>>>>> We routinely have a few thousand jobs pending and in peak it 
>>>>>>> goes up to ~15k.
>>>>>>> I am not sure what file descriptors and dynamic events are used 
>>>>>>> for....
>>>>>>
>>>>>> Dynamic event clients are only needed for DRMAA clients and when
>>>>>>
>>>>>>    qsub -sync y
>>>>>>
>>>>>> is used. Usually the 99 default is ample amount. The same is true 
>>>>>> with the 8192 file descriptors. If you estimate 1 file descriptor 
>>>>>> for each node you still have 8192-250 spare fd's for client 
>>>>>> commands connecting to qmaster. So this one can safely exclude as 
>>>>>> root of your qmaster problem.
>>>>>>
>>>>>>>>
>>>>>>>>> Other than that nothing special.
>>>>>>>>>
>>>>>>>>> Also when I restart the qmaster I get messages:
>>>>>>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>>>>>>  starting sge_qmaster
>>>>>>>>>  starting sge_schedd
>>>>>>>>> daemonize error: timeout while waiting for daemonize state
>>>>>>>>
>>>>>>>> That means scheduler is having some problem during start-up. 
>>>>>>>> From the message one can not say what is causing the problems, 
>>>>>>>> but it could be due to qmaster in-turn having problems.
>>>>>>> I am restarting them after the crash when the cluster is full 
>>>>>>> loaded. Is it possible that it just needs more time to re-read all
>>>>>>> the info about running and pending jobs?
>>>>>>
>>>>>> Actually this I would rule out.
>>>>>>
>>>>>>> Where would the scheduler print any messages about problems it 
>>>>>>> is having?
>>>>>>
>>>>>> For investigating the problem I suggest you launch qmaster and 
>>>>>> scheduler separately as binaries rather than using sgemaster 
>>>>>> script. All you need is two root-shells with Grid Engine 
>>>>>> environment (settings.{sh|csh}) be set.
>>>>>>
>>>>>> Then you do this:
>>>>>>
>>>>>>    # setenv SGE_ND
>>>>>>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>>>>>>
>>>>>> if you see everything wen't well with qmaster start-up (e.g. test 
>>>>>> whether qhost gets you reasonable output) you continue with 
>>>>>> launching the scheduler from the other shell:
>>>>>>
>>>>>>    # setenv SGE_ND
>>>>>>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>>>>>>
>>>>>> but my expectation is already qmaster will report some problem 
>>>>>> and exit.
>>>>>> Normally qmaster may not exit with SGE_ND in environemnt as it 
>>>>>> prevents daemonizing.
>>>>>>
>>>>>>>>>  starting sge_shadowd
>>>>>>>>> error: getting configuration: failed receiving gdi request
>>>>>>>>
>>>>>>>> Next indication for a crashed or sick qmaster.
>>>>>>>>
>>>>>>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>>>>>>
>>>>>>>>> How bad is any of that, could crashes be related to it?
>>>>>>>>
>>>>>>>> Very likely.
>>>>>>>>
>>>>>>>>> I am running on RHEL3 .
>>>>>>>>
>>>>>>>> Have you tried some other OS?
>>>>>>> We will be upgrading shortly but at this time I have no choice, 
>>>>>>> I have to keep the cluster
>>>>>>> running with the OS I have.
>>>>>>>
>>>>>>> Yesterday I gathered some more empirical evidence about the 
>>>>>>> crashes - might be just
>>>>>>> a coincidence. The story is long and related to a filesystem we 
>>>>>>> are using (GPFS) but here is the part related to SGE.
>>>>>>
>>>>>> Actually I'm not aware of any problem with GPFS, but it could be 
>>>>>> related.
>>>>>> Is qmaster spooling located on the GPFS volume? Are you using 
>>>>>> classic or BDB spooling?
>>>>>>
>>>>>>
>>>>>>> Sometimes on the client host the filesystem daemons get killed 
>>>>>>> and that leaves the SGE processes on the client defunct - still 
>>>>>>> there, but master cannot communicate with them. qdel will not 
>>>>>>> dispose of the user's job, the load is not reported.
>>>>>>> The easiest is to just reboot the node - it does not happen very 
>>>>>>> often,
>>>>>>> just a few nodes per day at most.
>>>>>>>
>>>>>>> But even if I reboot the node, the client will not start 
>>>>>>> properly unless I clean the local spool directory. I did not 
>>>>>>> figure out which files are interfering, but if I delete the 
>>>>>>> whole local spool,  the directory gets recreated and everybody 
>>>>>>> is ok, so that's what I have been doing. Reboot, delete the 
>>>>>>> local spool subdirectory, restart the SGE client.
>>>>>>
>>>>>> Usually there are no problems with execution nodes if local 
>>>>>> spooling is used. Ugh!
>>>>>>
>>>>>>
>>>>>>> Yesterday I decided to streamline my procedure and delete that 
>>>>>>> local
>>>>>>> spool directory, before I reboot the node. The moment I delete 
>>>>>>> that local
>>>>>>> spool, the master that runs on a different host crashes right away.
>>>>>>>
>>>>>>> I managed to crash it a few times, then I went to my old procedure
>>>>>>> - first reboot, then remove the local scratch and all has been 
>>>>>>> running well.
>>>>>>>
>>>>>>> (the startup messages about problems are still there, but once 
>>>>>>> started SGE run well and
>>>>>>> I do not see any other problems).
>>>>>>
>>>>>> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to 
>>>>>> move away
>>>>>> from GPFS ... at least for SGE spooling. Can't you switch to a 
>>>>>> more conventional FS for that purpose?
>>>>>>
>>>>>> Regards,
>>>>>> Andreas
>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> http://gridengine.info/
>>>
>>> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
>>> Kirchheim-Heimstetten
>>> Amtsgericht Muenchen: HRB 161028
>>> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
>>> Vorsitzender des Aufsichtsrates: Martin Haering
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list