[GE users] qmaster dying again....

Iwona Sakrejda isakrejda at lbl.gov
Thu Aug 16 02:31:42 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

So today I had my maintenance. I set reporting to true (qconf -mconf)
and that kill the master. Then I tried to restart it the way you suggested
and it would not start - a screen dump follows.
So I started it once more with gdb and it crashed again - some gdb info 
is appended too.
Then I edited by hand the configuration file and changed the reporting 
to true
and I was able to start it. But it already died a few times during last 
hour.

This is 6.0u11 on RHEL3.

Could you suggest next debugging steps?

Thanks a lot,

Iwona

[root at pc2533 root]# ps -elf|grep sge
0 S root     11962 11562  0  75   0    -  1191 -      14:58 pts/3    00:00:00 grep sge
[root at pc2533 root]# export SGE_ND=""
[root at pc2533 root]# echo $SGE_ND

[root at pc2533 root]#  /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
Reading in complex attributes.
Reading in execution hosts.
Reading in administrative hosts.
Reading in submit hosts.
Reading in host group entries:
        Host group entries for group "@allhosts".
        Host group entries for group "@xeon04".
        Host group entries for group "@athlon03".
        Host group entries for group "@athlon02".
        Host group entries for group "@star".
        Host group entries for group "@kamland".
        Host group entries for group "@test".
        Host group entries for group "@intel01".
        Host group entries for group "@opteron05".
        Host group entries for group "@express".
        Host group entries for group "@debug".
Reading in usersets:
        Userset "defaultdepartment".
        Userset "deadlineusers".
        Userset "star".
        Userset "alice".
        Userset "atlas".
        Userset "snfactry".
        Userset "deepsrch".
        Userset "e871".
        Userset "kamland".
        Userset "sno".
        Userset "cdf".
        Userset "e896".
        Userset "other".
        Userset "admin".
        Userset "icecube".
        Userset "majorana".
        Userset "euso".
        Userset "astrogfs".
        Userset "staradmin".
        Userset "rhicthry".
        Userset "starprod".
        Userset "kamlanda".
        Userset "suspended".
        Userset "imcg".
        Userset "snap".
        Userset "starspinprod".
        Userset "emcal".
Reading in queues:
        Queue "all.q".
        Queue "starprod.q".
        Queue "test.q".
        Queue "adm.q".
        Queue "debug.q".
        Queue "big.q".
Reading in parallel environments:
        PE "make".
        PE "lam_loose_qrsh".
        PE "lammpi".
        PE "lam_tight_qrsh".
        PE "mpi".
        PE "simple".
Reading in Master_Job_List.
........................

read job database with 2390 entries in 78 seconds
Reading in users:
        User "lma".
        User "aya".
        User "dujx".
        User "yanwen".
        User "hamblen".
        User "shossain".
        User "danielx".
        User "guardi".
        User "affolder".
        User "kushner".
        User "labbe".
        User "lane".
        User "qattan".
        User "akio".
        User "dereke".
        User "ryd".
        User "rvogel".
        User "alexst".
        User "cardo".
        User "enoki".
        User "jarguin".
        User "cambell6".
        User "conesa1".
        User "kipnis".
        User "kisiel".
        User "kewu".
        User "jrkonzer".
        User "zimm".
        User "cerri".
        User "keith".
        User "yuchen".
        User "junmin".
        User "kadota".
        User "asim".
        User "witt".
        User "whitney".
        User "jrgordon".
        User "weiming".
        User "vineeth".
        User "chee".
        User "tjsymons".
        User "tompkins".
        User "ullrich".
        User "uscms01".
        User "tierney".
        User "balewski".
        User "terryh".
        User "sumbera".
        User "jklay".
        User "connolly".
        User "cormier".
        User "battagl".
        User "srini".
        User "solomey".
        User "baumgart".
        User "jhthomas".
        User "bclee".
        User "soliver".
        User "croy".
        User "shreyas".
        User "sirena".
        User "cwhite".
        User "jedynak".
        User "dahl".
        User "benedos".
        User "sahal".
        User "daues".
        User "romero".
        User "rodenm".
        User "relyea".
        User "rexwg".
        User "rmfarber".
        User "rwg".
        User "javiera".
        User "rajkumar".
        User "decowski".
        User "ivdgl".
        User "bravina".
        User "dhale".
        User "ijohnson".
        User "nxu".
        User "pfachini".
        User "piotr".
        User "planinic".
        User "porter".
        User "didenko".
        User "perry".
        User "pavetter".
        User "pavlinov".
        User "omall".
        User "pandola".
        User "dkettler".
        User "nurcan".
        User "nordberg".
        User "nataliak".
        User "neha".
        User "noblath".
        User "msd".
        User "nan".
        User "mgmarino".
        User "milford".
        User "milne".
        User "misawa".
        User "mcguigan".
        User "mcmc".
        User "markoff".
        User "maya".
        User "fgabler".
        User "maleyton".
        User "may".
        User "fisyak".
        User "glma".
        User "goldman".
        User "ma3d".
        User "lou".
        User "gene".
        User "lyu".
        User "gans".
        User "gelor".
        User "ealbin".
        User "llhsu".
        User "lianjunj".
        User "adler".
        User "leecl".
        User "admarino".
        User "bihonger".
        User "sss".
        User "bweaver".
        User "lecompte".
        User "lansdell".
        User "lauer".
        User "agupta".
        User "aihong".
        User "bseilhan".
        User "kkrueger".
        User "calaf".
        User "langley".
        User "lbetev".
        User "chenjy".
        User "mcsuarez".
        User "dipo".
        User "canon".
        User "kollegge".
        User "lapointe".
        User "carither".
        User "zdrazil".
        User "silvermy".
        User "kfushimi".
        User "lgreiner".
        User "cebra".
        User "alimvl".
        User "kelly".
        User "yzchu".
        User "zawisza".
        User "ysmirnov".
        User "doyen".
        User "yangj".
        User "ycoadou".
        User "ypang".
        User "chadm".
        User "jwebb".
        User "jzulr".
        User "jshalf".
        User "willson".
        User "awetzler".
        User "verdier".
        User "ayoung".
        User "joshi".
        User "tjoubert".
        User "johnbrow".
        User "jonaytac".
        User "cmironov".
        User "stradlin".
        User "jmuelmen".
        User "jodi".
        User "barnby".
        User "sritchey".
        User "costanzo".
        User "batygov".
        User "baudot".
        User "cristina".
        User "beamer".
        User "jhfu".
        User "shester".
        User "shigaki".
        User "bedaque".
        User "seluzhen".
        User "jenant".
        User "belaga".
        User "belaurik".
        User "sdss".
        User "belt".
        User "jed".
        User "sanshiro".
        User "sarblyth".
        User "saulys".
        User "schaffer".
        User "jecc".
        User "hha".
        User "rojo".
        User "rscalzo".
        User "rthomas".
        User "dbarnes".
        User "bigdeli".
        User "jberger".
        User "boercher".
        User "raw".
        User "randrup".
        User "jacobsen".
        User "raines".
        User "deph".
        User "quarrie".
        User "huovinen".
        User "bstone".
        User "dywue".
        User "hpark".
        User "nws".
        User "dietel".
        User "pawan".
        User "hjiang".
        User "hgritter".
        User "dmsteven".
        User "msearle".
        User "mshupe".
        User "murat".
        User "helbing".
        User "dschmier".
        User "half".
        User "hamed".
        User "gxrai".
        User "eleanor".
        User "mccauley".
        User "mckinny".
        User "meidm".
        User "faivre".
        User "mbotje".
        User "jcs".
        User "hma".
        User "macross".
        User "fu".
        User "gowdy".
        User "macl".
        User "fqwang".
        User "glanzman".
        User "fujikawa".
        User "fvhale".
        User "gfg".
        User "lsc01".
        User "geno".
        User "gas".
        User "passmore".
        User "yisun".
        User "geurts".
        User "yfzhang".
        User "tpb".
        User "miu".
        User "mhluk".
        User "gprior".
        User "spitz".
        User "kurca".
        User "koschke".
        User "fpaige".
        User "markert".
        User "sakuma".
        User "martina".
        User "bockjoo".
        User "lulc".
        User "manderso".
        User "marcel".
        User "mvl".
        User "aart".
        User "luehring".
        User "ydc".
        User "dunlop".
        User "earl".
        User "einsweil".
        User "weizhou".
        User "dhevang".
        User "mecoving".
        User "gweber".
        User "ernst".
        User "estienne".
        User "trenk".
        User "guojilin".
        User "rbf".
        User "feldmann".
        User "fergie".
        User "mcnp".
        User "grodid".
        User "pmf".
        User "brubaker".
        User "munhoz".
        User "mwhite".
        User "mswanger".
        User "molnarl".
        User "busenitz".
        User "hippolyt".
        User "molnard".
        User "horsley".
        User "djschleg".
        User "dleonard".
        User "herston".
        User "drabinow".
        User "millane".
        User "mischke".
        User "mjfisher".
        User "dannytb".
        User "bhaag".
        User "davidk".
        User "jcfree".
        User "ogilvie".
        User "billmei".
        User "obuncic".
        User "jaym".
        User "janik".
        User "nystrand".
        User "nugent".
        User "jacobs".
        User "nikolai".
        User "nagaslae".
        User "dhazen".
        User "ibhadju".
        User "ricaud".
        User "rjm".
        User "rkowen".
        User "rhenning".
        User "rcabrera".
        User "rellen".
        User "rfatemi".
        User "rgareus".
        User "pruneau".
        User "jgma".
        User "petrchal".
        User "orejudos".
        User "pclarke".
        User "olga".
        User "opspdsf".
        User "jin".
        User "bergevin".
        User "antai".
        User "spitzer".
        User "chaber".
        User "arcarter".
        User "smithj4".
        User "sixie".
        User "chaoz".
        User "siegrist".
        User "awes".
        User "chunhuih".
        User "sdazeley".
        User "scottc".
        User "barannik".
        User "saraf".
        User "rpicha".
        User "russcher".
        User "tgoodale".
        User "thenry".
        User "kopytin".
        User "ktlesko".
        User "kunz".
        User "carcassi".
        User "tbutler".
        User "cardenas".
        User "tbanks".
        User "tanya".
        User "kocevski".
        User "catalin".
        User "amonett".
        User "stergar".
        User "dougr".
        User "srikumar".
        User "xzb".
        User "bleicher".
        User "aarond".
        User "aarose".
        User "dimac".
        User "bmonreal".
        User "wehle".
        User "wenaus".
        User "ward".
        User "caines".
        User "vacavant".
        User "trattner".
        User "tuntsfaa".
        User "umatov".
        User "uscms02".
        User "wwoodvas".
        User "xjd".
        User "msun".
        User "howley".
        User "hhuang".
        User "moed".
        User "mmoura".
        User "dmitry".
        User "mlgreen".
        User "dougsim".
        User "downum".
        User "helge".
        User "drescher".
        User "hatake".
        User "hallin".
        User "mhorner".
        User "haibin".
        User "betya".
        User "ojacobsen".
        User "bielcik".
        User "ofine".
        User "ogreben".
        User "bonachea".
        User "jakeking".
        User "brandonp".
        User "nilsen".
        User "deisher".
        User "nickb".
        User "nielsenj".
        User "brdraney".
        User "nattrass".
        User "hypercp".
        User "mustapha".
        User "jgreid".
        User "potekhin".
        User "beckmann".
        User "bedanga".
        User "jfoster".
        User "pibero".
        User "poon".
        User "jelena".
        User "panitkin".
        User "jedraper".
        User "okorokov".
        User "reb".
        User "dang".
        User "beringer".
        User "jdanders".
        User "okada".
        User "azriel".
        User "joong".
        User "bagwell".
        User "classen".
        User "scherzer".
        User "schutz".
        User "cmauger".
        User "jkephart".
        User "rmiquel".
        User "romosan".
        User "ruda".
        User "bartelt".
        User "rhodes".
        User "jillings".
        User "cperkins".
        User "renault".
        User "kaneta".
        User "kareem".
        User "kdawson".
        User "sjbailey".
        User "skluth".
        User "sliwa".
        User "soneale".
        User "spadafor".
        User "chajecki".
        User "atwong".
        User "charles".
        User "shabetai".
        User "dtliu".
        User "sferrell".
        User "sguertin".
        User "cherney".
        User "vogt".
        User "vdmolen".
        User "kurnadi".
        User "tofr".
        User "tatsuno".
        User "allen".
        User "kkarr".
        User "stokstad".
        User "supriya".
        User "szeto".
        User "amsgc5".
        User "steiner".
        User "kerasha".
        User "stardb".
        User "keefer".
        User "speltz".
        User "liuls".
        User "abha".
        User "wjdong".
        User "liubo".
        User "westfall".
        User "xin".
        User "wayneh".
        User "wbaird".
        User "lbland".
        User "cadler".
        User "vernet".
        User "vkoch".
        User "wes".
        User "blyth".
        User "alandav".
        User "kmontag".
        User "rderradi".
        User "matteo".
        User "dlamenti".
        User "u16301".
        User "markp".
        User "alexis3".
        User "fsimon".
        User "yoshiu".
        User "zarzhit".
        User "zhliu".
        User "fyodor".
        User "ynara".
        User "luis".
        User "xzcai".
        User "loken".
        User "lsadler".
        User "mheffner".
        User "emit0".
        User "emorris".
        User "schaefer".
        User "bombara".
        User "mcvady".
        User "mmeijer".
        User "mnorman".
        User "kyba".
        User "greatkei".
        User "hai".
        User "wuyf".
        User "mauri".
        User "atang".
        User "nrl".
        User "cyberman".
        User "jmonroe".
        User "gaillard".
        User "mlisa".
        User "gaudiche".
        User "mkaplan".
        User "rmaruyam".
        User "xuyichun".
        User "mira".
        User "nastone".
        User "nayla".
        User "gorbunov".
        User "nancy".
        User "fliu".
        User "golling".
        User "mucci".
        User "mweber".
        User "fross".
        User "gidal".
        User "ftaylor".
        User "gedanken".
        User "mstewart".
        User "fwh".
        User "mmiller".
        User "msar".
        User "pck".
        User "hardtke".
        User "oldi".
        User "putschke".
        User "canonrs".
        User "nilanthi".
        User "oana".
        User "ofisyak".
        User "engelage".
        User "greiman".
        User "nieuwhzn".
        User "nikas".
        User "fcp".
        User "fegray".
        User "nevski".
        User "gpdf".
        User "gregoire".
        User "hoo".
        User "dlesage".
        User "rajeshn".
        User "hgray".
        User "hhholmes".
        User "pilcher".
        User "pollney".
        User "hcfang".
        User "partlan".
        User "peitzma".
        User "pharvey".
        User "dskinner".
        User "dsmith".
        User "osiegrist".
        User "parsons".
        User "e871code".
        User "olson".
        User "ivanshin".
        User "sakamil".
        User "sasmith".
        User "sethzenz".
        User "btev".
        User "ianh".
        User "buncic".
        User "rknop".
        User "rreddy".
        User "ruanlj".
        User "dinofm".
        User "djengh".
        User "rayd".
        User "rclee".
        User "rcwells".
        User "resconi".
        User "struck".
        User "debasish".
        User "sosebee".
        User "starofl".
        User "snelling".
        User "sorensen".
        User "brant".
        User "smckee".
        User "jason".
        User "jasondet".
        User "shirley".
        User "shjang".
        User "sjoelin".
        User "iwona".
        User "brijesh".
        User "severini".
        User "berryhil".
        User "dart".
        User "tdonnell".
        User "jkiryluk".
        User "jiafei".
        User "subhasis".
        User "svl".
        User "swing".
        User "sychan".
        User "szarwas".
        User "taluc".
        User "tdavis".
        User "tjt".
        User "jeromel".
        User "suaide".
        User "jdodd".
        User "stone".
        User "justin".
        User "jvirzi".
        User "bekele".
        User "czhong".
        User "timser".
        User "d3c724".
        User "julery".
        User "dtyu".
        User "timmins".
        User "josephf".
        User "tgutierr".
        User "therese".
        User "timh".
        User "JLA550".
        User "jmeyer".
        User "jnovotny".
        User "kkowalik".
        User "cottrell".
        User "kdatta".
        User "kenss".
        User "kazumi".
        User "voloshin".
        User "kammel".
        User "karl".
        User "kaushikd".
        User "vlmrz".
        User "beberger".
        User "u70004".
        User "kabana".
        User "ctday".
        User "tipton".
        User "tull".
        User "baiyt".
        User "xuw".
        User "yakushev".
        User "za".
        User "kowalski".
        User "klaush".
        User "xichen".
        User "barden".
        User "cmouser".
        User "barish".
        User "wlav".
        User "kfornaz".
        User "wieman".
        User "khodinov".
        User "khudek".
        User "wcs".
        User "wuj".
        User "kjr".
        User "liq".
        User "aknospe".
        User "soltz".
        User "druss".
        User "leggett".
        User "kvetter".
        User "aragon".
        User "jorrell".
        User "lanou".
        User "lasiuk".
        User "zdjurcic".
        User "lauss".
        User "lblsrb".
        User "bachacou".
        User "ciocio".
        User "yepes".
        User "zberecki".
        User "ghoulam".
        User "kazuhiro".
        User "llope".
        User "kechech".
        User "lixh".
        User "arie".
        User "kapitan".
        User "aroy".
        User "liuzx".
        User "artthurs".
        User "shakoori".
        User "runge".
        User "koheik".
        User "starreco".
        User "levesj".
        User "mcosent".
        User "peterlos".
        User "wangxb".
        User "dkoetke".
        User "xwq1985".
        User "xinghua".
        User "alai".
        User "amol".
        User "cbum".
        User "threefay".
        User "cdfsoft".
        User "longacre".
        User "nbarkas".
        User "voeckler".
        User "sudhir".
        User "testpsff".
        User "bliao".
        User "marino".
        User "markh".
        User "marsiske".
        User "greenc".
        User "littlejo".
        User "rosheck".
        User "marco".
        User "aldering".
        User "lys".
        User "mavrekh".
        User "bongard".
        User "maguire".
        User "alvarez".
        User "dmeyers".
        User "amako".
        User "posk".
        User "hew".
        User "bland".
        User "mheinz".
        User "mhoemmen".
        User "bnorman".
        User "mercedes".
        User "mgadost".
        User "mgarcia".
        User "aconley".
        User "bobw".
        User "mendi".
        User "meissner".
        User "mattheww".
        User "mayes".
        User "afleming".
        User "agibson".
        User "cadman".
        User "aalseth".
        User "mendonca".
        User "zbtang".
        User "binet".
        User "bystersk".
        User "jhpalice".
        User "butter".
        User "vmg".
        User "gopalb".
        User "morsch".
        User "moyse".
        User "mng".
        User "gcosmo".
        User "gabriel".
        User "mlam".
        User "ekw".
        User "ely".
        User "guangqin".
        User "griem".
        User "nita".
        User "finch".
        User "okikawa".
        User "heeger".
        User "hdliu".
        User "draper".
        User "ojha".
        User "drkent".
        User "hazama".
        User "ofgabler".
        User "ofretiere".
        User "dthein".
        User "odyniec".
        User "hanna".
        User "ikelley".
        User "pater".
        User "pinkenbu".
        User "pastor".
        User "djordan".
        User "hlong".
        User "hongyu".
        User "omargetis".
        User "brandste".
        User "jbielcik".
        User "predrag".
        User "jasonk".
        User "jbk".
        User "brent".
        User "ppching".
        User "dhbailey".
        User "dibari".
        User "jinhui".
        User "rdolan".
        User "jiaxu".
        User "jingbo".
        User "du".
        User "big".
        User "dbest".
        User "jdodge".
        User "dclayton".
        User "bozek".
        User "qjliu".
        User "schweda".
        User "julio".
        User "salur".
        User "sandro".
        User "sarah".
        User "johnj".
        User "sakrejda".
        User "jla550".
        User "reichhol".
        User "rubind".
        User "sabh".
        User "slhuang".
        User "crawford".
        User "shimansk".
        User "slblyth".
        User "crivelli".
        User "shichijo".
        User "sevahsen".
        User "shapiro".
        User "currat".
        User "schuelke".
        User "klausk".
        User "tmai".
        User "tpavel".
        User "tinad".
        User "kgarg".
        User "kirill".
        User "smrenna".
        User "wleight".
        User "wdeng".
        User "lacunza".
        User "vanyashi".
        User "wbetts".
        User "laue".
        User "u7142".
        User "u767".
        User "usatlas1".
        User "kurts".
        User "trent".
        User "xliu".
        User "xylin".
        User "chafik".
        User "lesko".
        User "wurzel".
        User "lelchuk".
        User "dongx".
        User "sowinski".
        User "dkonerd".
        User "amueller".
        User "lockman".
        User "andr".
        User "dipak".
        User "akim".
        User "alansill".
        User "madaras".
        User "magestro".
        User "kadel".
        User "adair".
        User "schansen".
        User "herrera".
        User "bchkim".
        User "matis".
        User "romano".
        User "elnimr".
        User "davej".
        User "weigand".
        User "garand".
        User "gos".
        User "mreddick".
        User "skoby".
        User "reitzner".
        User "rquick".
        User "tdss".
        User "betan".
        User "wiggy13".
        User "flierl".
        User "mstoufer".
        User "mt".
        User "ter".
        User "fretiere".
        User "mora".
        User "mrkallen".
        User "gibbo".
        User "galtieri".
        User "mjchen".
        User "objy".
        User "groysman".
        User "guillian".
        User "nurit".
        User "nystrom".
        User "greiner".
        User "nstone".
        User "goupell".
        User "nlfarr".
        User "fine".
        User "nbeckett".
        User "okreylos".
        User "dpturner".
        User "drjohn".
        User "harsh".
        User "ioji".
        User "petar".
        User "hshan".
        User "htp".
        User "igv".
        User "dimarcom".
        User "opachich".
        User "hjort".
        User "heng".
        User "deboni".
        User "lorenzo".
        User "qhxu".
        User "japar".
        User "prindle".
        User "jay".
        User "plujan".
        User "pjones".
        User "rdiaz".
        User "jgwacker".
        User "raghu".
        User "dbury".
        User "ragerber".
        User "jcarter".
        User "seng".
        User "jseger".
        User "dandwyer".
        User "robbins".
        User "rzep".
        User "stavrop".
        User "suire".
        User "skmandal".
        User "jygabler".
        User "jw_lee".
        User "saroka".
        User "kocolosk".
        User "tradke".
        User "tpmccaul".
        User "tjwalker".
        User "cnepali".
        User "tcase".
        User "tim88899".
        User "spencer".
        User "ssoff".
        User "weaver".
        User "kvtsang".
        User "turcotte".
        User "kramer".
        User "clifford".
        User "yury".
        User "lluvia".
        User "ykkim".
        User "youngil".
        User "dpaul".
        User "xnwang".
        User "ashmansk".
        User "drew".
        User "atartir".
        User "lehocka".
        User "luyan".
        User "ikuro".
        User "mdunford".
        User "lundqvis".
        User "zgarrett".
        User "losecco".
        User "lmark".
        User "lmp".
        User "luk".
        User "cgrant".
        User "apiepke".
        User "akbar".
        User "majdi".
        User "malon".
        User "staszak".
        User "akorn".
        User "canson".
        User "mahsa".
        User "clendvai".
        User "lwinslow".
        User "bryleung".
        User "jianglai".
        User "parag".
        User "bzhangtx".
        User "andream".
        User "margetis".
        User "ailea".
        User "calderon".
Reading in projects:
        Project "admin".
        Project "alice".
        Project "astrogfs".
        Project "atlas".
        Project "cdf".
        Project "deepsrch".
        Project "e871".
        Project "e895".
        Project "e896".
        Project "euso".
        Project "icecube".
        Project "kamland".
        Project "majorana".
        Project "other".
        Project "rhicthry".
        Project "snfactry".
        Project "sno".
        Project "star".
        Project "imcg".
        Project "starspinprod".
        Project "emcal".
qmaster hard descriptor limit is set to 8192
qmaster soft descriptor limit is set to 8192
qmaster will use max. 8172 file descriptors for communication
qmaster will accept max. 99 dynamic event clients
starting up GE 6.0u11 (lx24-x86)
Bus error


[New Thread -1313866832 (LWP 12258)]
[New Thread -1324356688 (LWP 12259)]
[New Thread -1334846544 (LWP 12260)]

Program received signal SIGBUS, Bus error.
[Switching to Thread -1324356688 (LWP 12259)]
0x0812f450 in double_print_to_dstring ()
(gdb) 
(gdb) info threads
  10 Thread -1334846544 (LWP 12260)  0xb75adebd in pthread_rwlock_wrlock ()
   from /lib/tls/libpthread.so.0
* 9 Thread -1324356688 (LWP 12259)  0x0812f450 in double_print_to_dstring ()
  8 Thread -1313866832 (LWP 12258)  0xb75b1c84 in sigwait () from /lib/tls/libpthread.so.0
  7 Thread -1301283920 (LWP 12257)  0xb75ae59b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
  6 Thread -1265353808 (LWP 12181)  0xb75ae59b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
  5 Thread -1254863952 (LWP 12180)  0xb75ae59b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
  4 Thread -1244374096 (LWP 12179)  0xb7544077 in ___newselect_nocancel ()
   from /lib/tls/libc.so.6
  3 Thread -1233884240 (LWP 12178)  0xb75ae59b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
  2 Thread -1223394384 (LWP 12177)  0xb75ae59b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
  1 Thread -1220095328 (LWP 12168)  0xb75acd58 in pthread_join ()
   from /lib/tls/libpthread.so.0
Cannot access memory at address 0x812f450



Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
>
> watching memory consumption patterns of deamons can be like tea leave 
> reading. Since
>
>    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187
>
> was fixed for 6.0u11 I have not heard of anything that sounds like a 
> memory leak and Andrea's memory consumption records disclose qmaster 
> was memory leak free already before 6.0u11.
>
> Below you say
>
>   "I cannot enable reporting either. When I try those daemons
>    (the master and the scheduler) crash right away too."
>
> or are you refering here to reporting(5) or is it the outcome of 
> running daemons undeamonized as I suggested it?
>
> Regards,
> Andreas
>
>
> On Tue, 31 Jul 2007, Iwona Sakrejda wrote:
>
>> Hi,
>>
>> Nobody picked up on this thread and today both the master and the 
>> scheduling
>> daemon are 0.5GB each. Is that normal? They have not crashed since 
>> 07/27,
>> but even if the load goes down they never shrink, they just grow slower.
>> That looks to me like a memory leak, but I am not sure how to approach
>> debugging of this problem.
>>
>> I can schedule maintenance period and try debugging, but would like to
>> have a better plan of what and how to debug.
>>
>>
>> Thank You,
>>
>> Iwona
>>
>> Iwona Sakrejda wrote:
>>> Since my qmaster and the scheduler daemons toppled over lately for
>>> "no good reason" I started watching their size. I started them ~27h
>>> ago and they were at ~50MB each. Now they both tripled in size.
>>>
>>> When I started there were about 4k jobs in the system. Now there are
>>> about 9k. But during last 27h the number of jobs would sometimes 
>>> decrease
>>> and the daemons are slowly but steadily growing. I have only serial
>>> jobs, about 450 running at any time on ~230 hosts, the rest is pending.
>>>
>>> I run 6.0u11 on RHEL3.
>>>
>>> Is that growth normal or should it be a reason for concern?
>>> Does anybody run a comparable configuration and load?
>>> I cannot enable reporting either. When I try those daemons
>>> (the master and the scheduler) crash right away too.
>>> I enabled core dumping so I hope to have more info next time
>>> the system crashes.
>>>
>>> Thank You,
>>>
>>> Iwona
>>>
>>>
>>> Andreas.Haas at Sun.COM wrote:
>>>> Hi Iwona,
>>>>
>>>> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>>>>
>>>>>
>>>>>
>>>>> Andreas.Haas at Sun.COM wrote:
>>>>>> Hi Iwona,
>>>>>>
>>>>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning 
>>>>>>> my qmaster started dying.
>>>>>>
>>>>>> You did this as foreseen?
>>>>>>
>>>>>>    http://gridengine.sunsource.net/install60patch.txt
>>>>> Yes, all went through ok, no problems encountered during the upgrade.
>>>>> I was very happy about that.
>>>>
>>>> Ok.
>>>>
>>>>>>
>>>>>>
>>>>>>> When I look at the logs I see messages:
>>>>>>>
>>>>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor 
>>>>>>> limit is set to 8192
>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor 
>>>>>>> limit is set to 8192
>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 
>>>>>>> file descriptors for communication
>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>>>>>>> dynamic event clients
>>>>>>
>>>>>> That is fine. It says qmaster got enough file descriptors available.
>>>>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
>>>>> We routinely have a few thousand jobs pending and in peak it goes 
>>>>> up to ~15k.
>>>>> I am not sure what file descriptors and dynamic events are used 
>>>>> for....
>>>>
>>>> Dynamic event clients are only needed for DRMAA clients and when
>>>>
>>>>    qsub -sync y
>>>>
>>>> is used. Usually the 99 default is ample amount. The same is true 
>>>> with the 8192 file descriptors. If you estimate 1 file descriptor 
>>>> for each node you still have 8192-250 spare fd's for client 
>>>> commands connecting to qmaster. So this one can safely exclude as 
>>>> root of your qmaster problem.
>>>>
>>>>>>
>>>>>>> Other than that nothing special.
>>>>>>>
>>>>>>> Also when I restart the qmaster I get messages:
>>>>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>>>>  starting sge_qmaster
>>>>>>>  starting sge_schedd
>>>>>>> daemonize error: timeout while waiting for daemonize state
>>>>>>
>>>>>> That means scheduler is having some problem during start-up. From 
>>>>>> the message one can not say what is causing the problems, but it 
>>>>>> could be due to qmaster in-turn having problems.
>>>>> I am restarting them after the crash when the cluster is full 
>>>>> loaded. Is it possible that it just needs more time to re-read all
>>>>> the info about running and pending jobs?
>>>>
>>>> Actually this I would rule out.
>>>>
>>>>> Where would the scheduler print any messages about problems it is 
>>>>> having?
>>>>
>>>> For investigating the problem I suggest you launch qmaster and 
>>>> scheduler separately as binaries rather than using sgemaster 
>>>> script. All you need is two root-shells with Grid Engine 
>>>> environment (settings.{sh|csh}) be set.
>>>>
>>>> Then you do this:
>>>>
>>>>    # setenv SGE_ND
>>>>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>>>>
>>>> if you see everything wen't well with qmaster start-up (e.g. test 
>>>> whether qhost gets you reasonable output) you continue with 
>>>> launching the scheduler from the other shell:
>>>>
>>>>    # setenv SGE_ND
>>>>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>>>>
>>>> but my expectation is already qmaster will report some problem and 
>>>> exit.
>>>> Normally qmaster may not exit with SGE_ND in environemnt as it 
>>>> prevents daemonizing.
>>>>
>>>>>>>  starting sge_shadowd
>>>>>>> error: getting configuration: failed receiving gdi request
>>>>>>
>>>>>> Next indication for a crashed or sick qmaster.
>>>>>>
>>>>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>>>>
>>>>>>> How bad is any of that, could crashes be related to it?
>>>>>>
>>>>>> Very likely.
>>>>>>
>>>>>>> I am running on RHEL3 .
>>>>>>
>>>>>> Have you tried some other OS?
>>>>> We will be upgrading shortly but at this time I have no choice, I 
>>>>> have to keep the cluster
>>>>> running with the OS I have.
>>>>>
>>>>> Yesterday I gathered some more empirical evidence about the 
>>>>> crashes - might be just
>>>>> a coincidence. The story is long and related to a filesystem we 
>>>>> are using (GPFS) but here is the part related to SGE.
>>>>
>>>> Actually I'm not aware of any problem with GPFS, but it could be 
>>>> related.
>>>> Is qmaster spooling located on the GPFS volume? Are you using 
>>>> classic or BDB spooling?
>>>>
>>>>
>>>>> Sometimes on the client host the filesystem daemons get killed and 
>>>>> that leaves the SGE processes on the client defunct - still there, 
>>>>> but master cannot communicate with them. qdel will not dispose of 
>>>>> the user's job, the load is not reported.
>>>>> The easiest is to just reboot the node - it does not happen very 
>>>>> often,
>>>>> just a few nodes per day at most.
>>>>>
>>>>> But even if I reboot the node, the client will not start properly 
>>>>> unless I clean the local spool directory. I did not figure out 
>>>>> which files are interfering, but if I delete the whole local 
>>>>> spool,  the directory gets recreated and everybody is ok, so 
>>>>> that's what I have been doing. Reboot, delete the local spool 
>>>>> subdirectory, restart the SGE client.
>>>>
>>>> Usually there are no problems with execution nodes if local 
>>>> spooling is used. Ugh!
>>>>
>>>>
>>>>> Yesterday I decided to streamline my procedure and delete that local
>>>>> spool directory, before I reboot the node. The moment I delete 
>>>>> that local
>>>>> spool, the master that runs on a different host crashes right away.
>>>>>
>>>>> I managed to crash it a few times, then I went to my old procedure
>>>>> - first reboot, then remove the local scratch and all has been 
>>>>> running well.
>>>>>
>>>>> (the startup messages about problems are still there, but once 
>>>>> started SGE run well and
>>>>> I do not see any other problems).
>>>>
>>>> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to 
>>>> move away
>>>> from GPFS ... at least for SGE spooling. Can't you switch to a more 
>>>> conventional FS for that purpose?
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list