[GE users] qmaster dying again....

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Mon Aug 20 17:30:11 BST 2007


Hi Iwona,

according the gdb info your qmaster died in function double_print_to_dstring() 
from a Bus error. Usually this means invalid memory was accessed. I tried
to find a point in qmaster source code where double_print_to_dstring()
is called with a possibly invalid arguments, but so far I couldn't.

Do you still have the the core dump? Unfortunately the gdb info below shows 
merely SIGBUS was thrown in double_print_to_dstring() of thread #9, but it does 
not unveil the full stack trace of this thread. gdb commands to get this must be 
something like

    # thread 9
    # where

Regards,
Andreas


On Wed, 15 Aug 2007, Iwona Sakrejda wrote:

> So today I had my maintenance. I set reporting to true (qconf -mconf)
> and that kill the master. Then I tried to restart it the way you suggested
> and it would not start - a screen dump follows.
> So I started it once more with gdb and it crashed again - some gdb info is 
> appended too.
> Then I edited by hand the configuration file and changed the reporting to 
> true
> and I was able to start it. But it already died a few times during last hour.
>
> This is 6.0u11 on RHEL3.
>
> Could you suggest next debugging steps?
>
> Thanks a lot,
>
> Iwona
>
> [root at pc2533 root]# ps -elf|grep sge
> 0 S root     11962 11562  0  75   0    -  1191 -      14:58 pts/3    00:00:00 
> grep sge
> [root at pc2533 root]# export SGE_ND=""
> [root at pc2533 root]# echo $SGE_ND
>
> [root at pc2533 root]#  /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
> Reading in complex attributes.
> Reading in execution hosts.
> Reading in administrative hosts.
> Reading in submit hosts.
> Reading in host group entries:
>       Host group entries for group "@allhosts".
>       Host group entries for group "@xeon04".
>       Host group entries for group "@athlon03".
>       Host group entries for group "@athlon02".
>       Host group entries for group "@star".
>       Host group entries for group "@kamland".
>       Host group entries for group "@test".
>       Host group entries for group "@intel01".
>       Host group entries for group "@opteron05".
>       Host group entries for group "@express".
>       Host group entries for group "@debug".
> Reading in usersets:
>       Userset "defaultdepartment".
>       Userset "deadlineusers".
>       Userset "star".
>       Userset "alice".
>       Userset "atlas".
>       Userset "snfactry".
>       Userset "deepsrch".
>       Userset "e871".
>       Userset "kamland".
>       Userset "sno".
>       Userset "cdf".
>       Userset "e896".
>       Userset "other".
>       Userset "admin".
>       Userset "icecube".
>       Userset "majorana".
>       Userset "euso".
>       Userset "astrogfs".
>       Userset "staradmin".
>       Userset "rhicthry".
>       Userset "starprod".
>       Userset "kamlanda".
>       Userset "suspended".
>       Userset "imcg".
>       Userset "snap".
>       Userset "starspinprod".
>       Userset "emcal".
> Reading in queues:
>       Queue "all.q".
>       Queue "starprod.q".
>       Queue "test.q".
>       Queue "adm.q".
>       Queue "debug.q".
>       Queue "big.q".
> Reading in parallel environments:
>       PE "make".
>       PE "lam_loose_qrsh".
>       PE "lammpi".
>       PE "lam_tight_qrsh".
>       PE "mpi".
>       PE "simple".
> Reading in Master_Job_List.
> ........................
>
> read job database with 2390 entries in 78 seconds
> Reading in users:
>       User "lma".
>       User "aya".
>       User "dujx".
>       User "yanwen".
>       User "hamblen".
>       User "shossain".
>       User "danielx".
>       User "guardi".
>       User "affolder".
>       User "kushner".
>       User "labbe".
>       User "lane".
>       User "qattan".
>       User "akio".
>       User "dereke".
>       User "ryd".
>       User "rvogel".
>       User "alexst".
>       User "cardo".
>       User "enoki".
>       User "jarguin".
>       User "cambell6".
>       User "conesa1".
>       User "kipnis".
>       User "kisiel".
>       User "kewu".
>       User "jrkonzer".
>       User "zimm".
>       User "cerri".
>       User "keith".
>       User "yuchen".
>       User "junmin".
>       User "kadota".
>       User "asim".
>       User "witt".
>       User "whitney".
>       User "jrgordon".
>       User "weiming".
>       User "vineeth".
>       User "chee".
>       User "tjsymons".
>       User "tompkins".
>       User "ullrich".
>       User "uscms01".
>       User "tierney".
>       User "balewski".
>       User "terryh".
>       User "sumbera".
>       User "jklay".
>       User "connolly".
>       User "cormier".
>       User "battagl".
>       User "srini".
>       User "solomey".
>       User "baumgart".
>       User "jhthomas".
>       User "bclee".
>       User "soliver".
>       User "croy".
>       User "shreyas".
>       User "sirena".
>       User "cwhite".
>       User "jedynak".
>       User "dahl".
>       User "benedos".
>       User "sahal".
>       User "daues".
>       User "romero".
>       User "rodenm".
>       User "relyea".
>       User "rexwg".
>       User "rmfarber".
>       User "rwg".
>       User "javiera".
>       User "rajkumar".
>       User "decowski".
>       User "ivdgl".
>       User "bravina".
>       User "dhale".
>       User "ijohnson".
>       User "nxu".
>       User "pfachini".
>       User "piotr".
>       User "planinic".
>       User "porter".
>       User "didenko".
>       User "perry".
>       User "pavetter".
>       User "pavlinov".
>       User "omall".
>       User "pandola".
>       User "dkettler".
>       User "nurcan".
>       User "nordberg".
>       User "nataliak".
>       User "neha".
>       User "noblath".
>       User "msd".
>       User "nan".
>       User "mgmarino".
>       User "milford".
>       User "milne".
>       User "misawa".
>       User "mcguigan".
>       User "mcmc".
>       User "markoff".
>       User "maya".
>       User "fgabler".
>       User "maleyton".
>       User "may".
>       User "fisyak".
>       User "glma".
>       User "goldman".
>       User "ma3d".
>       User "lou".
>       User "gene".
>       User "lyu".
>       User "gans".
>       User "gelor".
>       User "ealbin".
>       User "llhsu".
>       User "lianjunj".
>       User "adler".
>       User "leecl".
>       User "admarino".
>       User "bihonger".
>       User "sss".
>       User "bweaver".
>       User "lecompte".
>       User "lansdell".
>       User "lauer".
>       User "agupta".
>       User "aihong".
>       User "bseilhan".
>       User "kkrueger".
>       User "calaf".
>       User "langley".
>       User "lbetev".
>       User "chenjy".
>       User "mcsuarez".
>       User "dipo".
>       User "canon".
>       User "kollegge".
>       User "lapointe".
>       User "carither".
>       User "zdrazil".
>       User "silvermy".
>       User "kfushimi".
>       User "lgreiner".
>       User "cebra".
>       User "alimvl".
>       User "kelly".
>       User "yzchu".
>       User "zawisza".
>       User "ysmirnov".
>       User "doyen".
>       User "yangj".
>       User "ycoadou".
>       User "ypang".
>       User "chadm".
>       User "jwebb".
>       User "jzulr".
>       User "jshalf".
>       User "willson".
>       User "awetzler".
>       User "verdier".
>       User "ayoung".
>       User "joshi".
>       User "tjoubert".
>       User "johnbrow".
>       User "jonaytac".
>       User "cmironov".
>       User "stradlin".
>       User "jmuelmen".
>       User "jodi".
>       User "barnby".
>       User "sritchey".
>       User "costanzo".
>       User "batygov".
>       User "baudot".
>       User "cristina".
>       User "beamer".
>       User "jhfu".
>       User "shester".
>       User "shigaki".
>       User "bedaque".
>       User "seluzhen".
>       User "jenant".
>       User "belaga".
>       User "belaurik".
>       User "sdss".
>       User "belt".
>       User "jed".
>       User "sanshiro".
>       User "sarblyth".
>       User "saulys".
>       User "schaffer".
>       User "jecc".
>       User "hha".
>       User "rojo".
>       User "rscalzo".
>       User "rthomas".
>       User "dbarnes".
>       User "bigdeli".
>       User "jberger".
>       User "boercher".
>       User "raw".
>       User "randrup".
>       User "jacobsen".
>       User "raines".
>       User "deph".
>       User "quarrie".
>       User "huovinen".
>       User "bstone".
>       User "dywue".
>       User "hpark".
>       User "nws".
>       User "dietel".
>       User "pawan".
>       User "hjiang".
>       User "hgritter".
>       User "dmsteven".
>       User "msearle".
>       User "mshupe".
>       User "murat".
>       User "helbing".
>       User "dschmier".
>       User "half".
>       User "hamed".
>       User "gxrai".
>       User "eleanor".
>       User "mccauley".
>       User "mckinny".
>       User "meidm".
>       User "faivre".
>       User "mbotje".
>       User "jcs".
>       User "hma".
>       User "macross".
>       User "fu".
>       User "gowdy".
>       User "macl".
>       User "fqwang".
>       User "glanzman".
>       User "fujikawa".
>       User "fvhale".
>       User "gfg".
>       User "lsc01".
>       User "geno".
>       User "gas".
>       User "passmore".
>       User "yisun".
>       User "geurts".
>       User "yfzhang".
>       User "tpb".
>       User "miu".
>       User "mhluk".
>       User "gprior".
>       User "spitz".
>       User "kurca".
>       User "koschke".
>       User "fpaige".
>       User "markert".
>       User "sakuma".
>       User "martina".
>       User "bockjoo".
>       User "lulc".
>       User "manderso".
>       User "marcel".
>       User "mvl".
>       User "aart".
>       User "luehring".
>       User "ydc".
>       User "dunlop".
>       User "earl".
>       User "einsweil".
>       User "weizhou".
>       User "dhevang".
>       User "mecoving".
>       User "gweber".
>       User "ernst".
>       User "estienne".
>       User "trenk".
>       User "guojilin".
>       User "rbf".
>       User "feldmann".
>       User "fergie".
>       User "mcnp".
>       User "grodid".
>       User "pmf".
>       User "brubaker".
>       User "munhoz".
>       User "mwhite".
>       User "mswanger".
>       User "molnarl".
>       User "busenitz".
>       User "hippolyt".
>       User "molnard".
>       User "horsley".
>       User "djschleg".
>       User "dleonard".
>       User "herston".
>       User "drabinow".
>       User "millane".
>       User "mischke".
>       User "mjfisher".
>       User "dannytb".
>       User "bhaag".
>       User "davidk".
>       User "jcfree".
>       User "ogilvie".
>       User "billmei".
>       User "obuncic".
>       User "jaym".
>       User "janik".
>       User "nystrand".
>       User "nugent".
>       User "jacobs".
>       User "nikolai".
>       User "nagaslae".
>       User "dhazen".
>       User "ibhadju".
>       User "ricaud".
>       User "rjm".
>       User "rkowen".
>       User "rhenning".
>       User "rcabrera".
>       User "rellen".
>       User "rfatemi".
>       User "rgareus".
>       User "pruneau".
>       User "jgma".
>       User "petrchal".
>       User "orejudos".
>       User "pclarke".
>       User "olga".
>       User "opspdsf".
>       User "jin".
>       User "bergevin".
>       User "antai".
>       User "spitzer".
>       User "chaber".
>       User "arcarter".
>       User "smithj4".
>       User "sixie".
>       User "chaoz".
>       User "siegrist".
>       User "awes".
>       User "chunhuih".
>       User "sdazeley".
>       User "scottc".
>       User "barannik".
>       User "saraf".
>       User "rpicha".
>       User "russcher".
>       User "tgoodale".
>       User "thenry".
>       User "kopytin".
>       User "ktlesko".
>       User "kunz".
>       User "carcassi".
>       User "tbutler".
>       User "cardenas".
>       User "tbanks".
>       User "tanya".
>       User "kocevski".
>       User "catalin".
>       User "amonett".
>       User "stergar".
>       User "dougr".
>       User "srikumar".
>       User "xzb".
>       User "bleicher".
>       User "aarond".
>       User "aarose".
>       User "dimac".
>       User "bmonreal".
>       User "wehle".
>       User "wenaus".
>       User "ward".
>       User "caines".
>       User "vacavant".
>       User "trattner".
>       User "tuntsfaa".
>       User "umatov".
>       User "uscms02".
>       User "wwoodvas".
>       User "xjd".
>       User "msun".
>       User "howley".
>       User "hhuang".
>       User "moed".
>       User "mmoura".
>       User "dmitry".
>       User "mlgreen".
>       User "dougsim".
>       User "downum".
>       User "helge".
>       User "drescher".
>       User "hatake".
>       User "hallin".
>       User "mhorner".
>       User "haibin".
>       User "betya".
>       User "ojacobsen".
>       User "bielcik".
>       User "ofine".
>       User "ogreben".
>       User "bonachea".
>       User "jakeking".
>       User "brandonp".
>       User "nilsen".
>       User "deisher".
>       User "nickb".
>       User "nielsenj".
>       User "brdraney".
>       User "nattrass".
>       User "hypercp".
>       User "mustapha".
>       User "jgreid".
>       User "potekhin".
>       User "beckmann".
>       User "bedanga".
>       User "jfoster".
>       User "pibero".
>       User "poon".
>       User "jelena".
>       User "panitkin".
>       User "jedraper".
>       User "okorokov".
>       User "reb".
>       User "dang".
>       User "beringer".
>       User "jdanders".
>       User "okada".
>       User "azriel".
>       User "joong".
>       User "bagwell".
>       User "classen".
>       User "scherzer".
>       User "schutz".
>       User "cmauger".
>       User "jkephart".
>       User "rmiquel".
>       User "romosan".
>       User "ruda".
>       User "bartelt".
>       User "rhodes".
>       User "jillings".
>       User "cperkins".
>       User "renault".
>       User "kaneta".
>       User "kareem".
>       User "kdawson".
>       User "sjbailey".
>       User "skluth".
>       User "sliwa".
>       User "soneale".
>       User "spadafor".
>       User "chajecki".
>       User "atwong".
>       User "charles".
>       User "shabetai".
>       User "dtliu".
>       User "sferrell".
>       User "sguertin".
>       User "cherney".
>       User "vogt".
>       User "vdmolen".
>       User "kurnadi".
>       User "tofr".
>       User "tatsuno".
>       User "allen".
>       User "kkarr".
>       User "stokstad".
>       User "supriya".
>       User "szeto".
>       User "amsgc5".
>       User "steiner".
>       User "kerasha".
>       User "stardb".
>       User "keefer".
>       User "speltz".
>       User "liuls".
>       User "abha".
>       User "wjdong".
>       User "liubo".
>       User "westfall".
>       User "xin".
>       User "wayneh".
>       User "wbaird".
>       User "lbland".
>       User "cadler".
>       User "vernet".
>       User "vkoch".
>       User "wes".
>       User "blyth".
>       User "alandav".
>       User "kmontag".
>       User "rderradi".
>       User "matteo".
>       User "dlamenti".
>       User "u16301".
>       User "markp".
>       User "alexis3".
>       User "fsimon".
>       User "yoshiu".
>       User "zarzhit".
>       User "zhliu".
>       User "fyodor".
>       User "ynara".
>       User "luis".
>       User "xzcai".
>       User "loken".
>       User "lsadler".
>       User "mheffner".
>       User "emit0".
>       User "emorris".
>       User "schaefer".
>       User "bombara".
>       User "mcvady".
>       User "mmeijer".
>       User "mnorman".
>       User "kyba".
>       User "greatkei".
>       User "hai".
>       User "wuyf".
>       User "mauri".
>       User "atang".
>       User "nrl".
>       User "cyberman".
>       User "jmonroe".
>       User "gaillard".
>       User "mlisa".
>       User "gaudiche".
>       User "mkaplan".
>       User "rmaruyam".
>       User "xuyichun".
>       User "mira".
>       User "nastone".
>       User "nayla".
>       User "gorbunov".
>       User "nancy".
>       User "fliu".
>       User "golling".
>       User "mucci".
>       User "mweber".
>       User "fross".
>       User "gidal".
>       User "ftaylor".
>       User "gedanken".
>       User "mstewart".
>       User "fwh".
>       User "mmiller".
>       User "msar".
>       User "pck".
>       User "hardtke".
>       User "oldi".
>       User "putschke".
>       User "canonrs".
>       User "nilanthi".
>       User "oana".
>       User "ofisyak".
>       User "engelage".
>       User "greiman".
>       User "nieuwhzn".
>       User "nikas".
>       User "fcp".
>       User "fegray".
>       User "nevski".
>       User "gpdf".
>       User "gregoire".
>       User "hoo".
>       User "dlesage".
>       User "rajeshn".
>       User "hgray".
>       User "hhholmes".
>       User "pilcher".
>       User "pollney".
>       User "hcfang".
>       User "partlan".
>       User "peitzma".
>       User "pharvey".
>       User "dskinner".
>       User "dsmith".
>       User "osiegrist".
>       User "parsons".
>       User "e871code".
>       User "olson".
>       User "ivanshin".
>       User "sakamil".
>       User "sasmith".
>       User "sethzenz".
>       User "btev".
>       User "ianh".
>       User "buncic".
>       User "rknop".
>       User "rreddy".
>       User "ruanlj".
>       User "dinofm".
>       User "djengh".
>       User "rayd".
>       User "rclee".
>       User "rcwells".
>       User "resconi".
>       User "struck".
>       User "debasish".
>       User "sosebee".
>       User "starofl".
>       User "snelling".
>       User "sorensen".
>       User "brant".
>       User "smckee".
>       User "jason".
>       User "jasondet".
>       User "shirley".
>       User "shjang".
>       User "sjoelin".
>       User "iwona".
>       User "brijesh".
>       User "severini".
>       User "berryhil".
>       User "dart".
>       User "tdonnell".
>       User "jkiryluk".
>       User "jiafei".
>       User "subhasis".
>       User "svl".
>       User "swing".
>       User "sychan".
>       User "szarwas".
>       User "taluc".
>       User "tdavis".
>       User "tjt".
>       User "jeromel".
>       User "suaide".
>       User "jdodd".
>       User "stone".
>       User "justin".
>       User "jvirzi".
>       User "bekele".
>       User "czhong".
>       User "timser".
>       User "d3c724".
>       User "julery".
>       User "dtyu".
>       User "timmins".
>       User "josephf".
>       User "tgutierr".
>       User "therese".
>       User "timh".
>       User "JLA550".
>       User "jmeyer".
>       User "jnovotny".
>       User "kkowalik".
>       User "cottrell".
>       User "kdatta".
>       User "kenss".
>       User "kazumi".
>       User "voloshin".
>       User "kammel".
>       User "karl".
>       User "kaushikd".
>       User "vlmrz".
>       User "beberger".
>       User "u70004".
>       User "kabana".
>       User "ctday".
>       User "tipton".
>       User "tull".
>       User "baiyt".
>       User "xuw".
>       User "yakushev".
>       User "za".
>       User "kowalski".
>       User "klaush".
>       User "xichen".
>       User "barden".
>       User "cmouser".
>       User "barish".
>       User "wlav".
>       User "kfornaz".
>       User "wieman".
>       User "khodinov".
>       User "khudek".
>       User "wcs".
>       User "wuj".
>       User "kjr".
>       User "liq".
>       User "aknospe".
>       User "soltz".
>       User "druss".
>       User "leggett".
>       User "kvetter".
>       User "aragon".
>       User "jorrell".
>       User "lanou".
>       User "lasiuk".
>       User "zdjurcic".
>       User "lauss".
>       User "lblsrb".
>       User "bachacou".
>       User "ciocio".
>       User "yepes".
>       User "zberecki".
>       User "ghoulam".
>       User "kazuhiro".
>       User "llope".
>       User "kechech".
>       User "lixh".
>       User "arie".
>       User "kapitan".
>       User "aroy".
>       User "liuzx".
>       User "artthurs".
>       User "shakoori".
>       User "runge".
>       User "koheik".
>       User "starreco".
>       User "levesj".
>       User "mcosent".
>       User "peterlos".
>       User "wangxb".
>       User "dkoetke".
>       User "xwq1985".
>       User "xinghua".
>       User "alai".
>       User "amol".
>       User "cbum".
>       User "threefay".
>       User "cdfsoft".
>       User "longacre".
>       User "nbarkas".
>       User "voeckler".
>       User "sudhir".
>       User "testpsff".
>       User "bliao".
>       User "marino".
>       User "markh".
>       User "marsiske".
>       User "greenc".
>       User "littlejo".
>       User "rosheck".
>       User "marco".
>       User "aldering".
>       User "lys".
>       User "mavrekh".
>       User "bongard".
>       User "maguire".
>       User "alvarez".
>       User "dmeyers".
>       User "amako".
>       User "posk".
>       User "hew".
>       User "bland".
>       User "mheinz".
>       User "mhoemmen".
>       User "bnorman".
>       User "mercedes".
>       User "mgadost".
>       User "mgarcia".
>       User "aconley".
>       User "bobw".
>       User "mendi".
>       User "meissner".
>       User "mattheww".
>       User "mayes".
>       User "afleming".
>       User "agibson".
>       User "cadman".
>       User "aalseth".
>       User "mendonca".
>       User "zbtang".
>       User "binet".
>       User "bystersk".
>       User "jhpalice".
>       User "butter".
>       User "vmg".
>       User "gopalb".
>       User "morsch".
>       User "moyse".
>       User "mng".
>       User "gcosmo".
>       User "gabriel".
>       User "mlam".
>       User "ekw".
>       User "ely".
>       User "guangqin".
>       User "griem".
>       User "nita".
>       User "finch".
>       User "okikawa".
>       User "heeger".
>       User "hdliu".
>       User "draper".
>       User "ojha".
>       User "drkent".
>       User "hazama".
>       User "ofgabler".
>       User "ofretiere".
>       User "dthein".
>       User "odyniec".
>       User "hanna".
>       User "ikelley".
>       User "pater".
>       User "pinkenbu".
>       User "pastor".
>       User "djordan".
>       User "hlong".
>       User "hongyu".
>       User "omargetis".
>       User "brandste".
>       User "jbielcik".
>       User "predrag".
>       User "jasonk".
>       User "jbk".
>       User "brent".
>       User "ppching".
>       User "dhbailey".
>       User "dibari".
>       User "jinhui".
>       User "rdolan".
>       User "jiaxu".
>       User "jingbo".
>       User "du".
>       User "big".
>       User "dbest".
>       User "jdodge".
>       User "dclayton".
>       User "bozek".
>       User "qjliu".
>       User "schweda".
>       User "julio".
>       User "salur".
>       User "sandro".
>       User "sarah".
>       User "johnj".
>       User "sakrejda".
>       User "jla550".
>       User "reichhol".
>       User "rubind".
>       User "sabh".
>       User "slhuang".
>       User "crawford".
>       User "shimansk".
>       User "slblyth".
>       User "crivelli".
>       User "shichijo".
>       User "sevahsen".
>       User "shapiro".
>       User "currat".
>       User "schuelke".
>       User "klausk".
>       User "tmai".
>       User "tpavel".
>       User "tinad".
>       User "kgarg".
>       User "kirill".
>       User "smrenna".
>       User "wleight".
>       User "wdeng".
>       User "lacunza".
>       User "vanyashi".
>       User "wbetts".
>       User "laue".
>       User "u7142".
>       User "u767".
>       User "usatlas1".
>       User "kurts".
>       User "trent".
>       User "xliu".
>       User "xylin".
>       User "chafik".
>       User "lesko".
>       User "wurzel".
>       User "lelchuk".
>       User "dongx".
>       User "sowinski".
>       User "dkonerd".
>       User "amueller".
>       User "lockman".
>       User "andr".
>       User "dipak".
>       User "akim".
>       User "alansill".
>       User "madaras".
>       User "magestro".
>       User "kadel".
>       User "adair".
>       User "schansen".
>       User "herrera".
>       User "bchkim".
>       User "matis".
>       User "romano".
>       User "elnimr".
>       User "davej".
>       User "weigand".
>       User "garand".
>       User "gos".
>       User "mreddick".
>       User "skoby".
>       User "reitzner".
>       User "rquick".
>       User "tdss".
>       User "betan".
>       User "wiggy13".
>       User "flierl".
>       User "mstoufer".
>       User "mt".
>       User "ter".
>       User "fretiere".
>       User "mora".
>       User "mrkallen".
>       User "gibbo".
>       User "galtieri".
>       User "mjchen".
>       User "objy".
>       User "groysman".
>       User "guillian".
>       User "nurit".
>       User "nystrom".
>       User "greiner".
>       User "nstone".
>       User "goupell".
>       User "nlfarr".
>       User "fine".
>       User "nbeckett".
>       User "okreylos".
>       User "dpturner".
>       User "drjohn".
>       User "harsh".
>       User "ioji".
>       User "petar".
>       User "hshan".
>       User "htp".
>       User "igv".
>       User "dimarcom".
>       User "opachich".
>       User "hjort".
>       User "heng".
>       User "deboni".
>       User "lorenzo".
>       User "qhxu".
>       User "japar".
>       User "prindle".
>       User "jay".
>       User "plujan".
>       User "pjones".
>       User "rdiaz".
>       User "jgwacker".
>       User "raghu".
>       User "dbury".
>       User "ragerber".
>       User "jcarter".
>       User "seng".
>       User "jseger".
>       User "dandwyer".
>       User "robbins".
>       User "rzep".
>       User "stavrop".
>       User "suire".
>       User "skmandal".
>       User "jygabler".
>       User "jw_lee".
>       User "saroka".
>       User "kocolosk".
>       User "tradke".
>       User "tpmccaul".
>       User "tjwalker".
>       User "cnepali".
>       User "tcase".
>       User "tim88899".
>       User "spencer".
>       User "ssoff".
>       User "weaver".
>       User "kvtsang".
>       User "turcotte".
>       User "kramer".
>       User "clifford".
>       User "yury".
>       User "lluvia".
>       User "ykkim".
>       User "youngil".
>       User "dpaul".
>       User "xnwang".
>       User "ashmansk".
>       User "drew".
>       User "atartir".
>       User "lehocka".
>       User "luyan".
>       User "ikuro".
>       User "mdunford".
>       User "lundqvis".
>       User "zgarrett".
>       User "losecco".
>       User "lmark".
>       User "lmp".
>       User "luk".
>       User "cgrant".
>       User "apiepke".
>       User "akbar".
>       User "majdi".
>       User "malon".
>       User "staszak".
>       User "akorn".
>       User "canson".
>       User "mahsa".
>       User "clendvai".
>       User "lwinslow".
>       User "bryleung".
>       User "jianglai".
>       User "parag".
>       User "bzhangtx".
>       User "andream".
>       User "margetis".
>       User "ailea".
>       User "calderon".
> Reading in projects:
>       Project "admin".
>       Project "alice".
>       Project "astrogfs".
>       Project "atlas".
>       Project "cdf".
>       Project "deepsrch".
>       Project "e871".
>       Project "e895".
>       Project "e896".
>       Project "euso".
>       Project "icecube".
>       Project "kamland".
>       Project "majorana".
>       Project "other".
>       Project "rhicthry".
>       Project "snfactry".
>       Project "sno".
>       Project "star".
>       Project "imcg".
>       Project "starspinprod".
>       Project "emcal".
> qmaster hard descriptor limit is set to 8192
> qmaster soft descriptor limit is set to 8192
> qmaster will use max. 8172 file descriptors for communication
> qmaster will accept max. 99 dynamic event clients
> starting up GE 6.0u11 (lx24-x86)
> Bus error
>
>
> [New Thread -1313866832 (LWP 12258)]
> [New Thread -1324356688 (LWP 12259)]
> [New Thread -1334846544 (LWP 12260)]
>
> Program received signal SIGBUS, Bus error.
> [Switching to Thread -1324356688 (LWP 12259)]
> 0x0812f450 in double_print_to_dstring ()
> (gdb) (gdb) info threads
> 10 Thread -1334846544 (LWP 12260)  0xb75adebd in pthread_rwlock_wrlock ()
>  from /lib/tls/libpthread.so.0
> * 9 Thread -1324356688 (LWP 12259)  0x0812f450 in double_print_to_dstring ()
> 8 Thread -1313866832 (LWP 12258)  0xb75b1c84 in sigwait () from 
> /lib/tls/libpthread.so.0
> 7 Thread -1301283920 (LWP 12257)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>  from /lib/tls/libpthread.so.0
> 6 Thread -1265353808 (LWP 12181)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>  from /lib/tls/libpthread.so.0
> 5 Thread -1254863952 (LWP 12180)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>  from /lib/tls/libpthread.so.0
> 4 Thread -1244374096 (LWP 12179)  0xb7544077 in ___newselect_nocancel ()
>  from /lib/tls/libc.so.6
> 3 Thread -1233884240 (LWP 12178)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>  from /lib/tls/libpthread.so.0
> 2 Thread -1223394384 (LWP 12177)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>  from /lib/tls/libpthread.so.0
> 1 Thread -1220095328 (LWP 12168)  0xb75acd58 in pthread_join ()
>  from /lib/tls/libpthread.so.0
> Cannot access memory at address 0x812f450
>
>
>
> Andreas.Haas at Sun.COM wrote:
>> Hi Iwona,
>> 
>> watching memory consumption patterns of deamons can be like tea leave 
>> reading. Since
>>
>>    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187
>> 
>> was fixed for 6.0u11 I have not heard of anything that sounds like a memory 
>> leak and Andrea's memory consumption records disclose qmaster was memory 
>> leak free already before 6.0u11.
>> 
>> Below you say
>>
>>   "I cannot enable reporting either. When I try those daemons
>>    (the master and the scheduler) crash right away too."
>> 
>> or are you refering here to reporting(5) or is it the outcome of running 
>> daemons undeamonized as I suggested it?
>> 
>> Regards,
>> Andreas
>> 
>> 
>> On Tue, 31 Jul 2007, Iwona Sakrejda wrote:
>> 
>>> Hi,
>>> 
>>> Nobody picked up on this thread and today both the master and the 
>>> scheduling
>>> daemon are 0.5GB each. Is that normal? They have not crashed since 07/27,
>>> but even if the load goes down they never shrink, they just grow slower.
>>> That looks to me like a memory leak, but I am not sure how to approach
>>> debugging of this problem.
>>> 
>>> I can schedule maintenance period and try debugging, but would like to
>>> have a better plan of what and how to debug.
>>> 
>>> 
>>> Thank You,
>>> 
>>> Iwona
>>> 
>>> Iwona Sakrejda wrote:
>>>> Since my qmaster and the scheduler daemons toppled over lately for
>>>> "no good reason" I started watching their size. I started them ~27h
>>>> ago and they were at ~50MB each. Now they both tripled in size.
>>>> 
>>>> When I started there were about 4k jobs in the system. Now there are
>>>> about 9k. But during last 27h the number of jobs would sometimes decrease
>>>> and the daemons are slowly but steadily growing. I have only serial
>>>> jobs, about 450 running at any time on ~230 hosts, the rest is pending.
>>>> 
>>>> I run 6.0u11 on RHEL3.
>>>> 
>>>> Is that growth normal or should it be a reason for concern?
>>>> Does anybody run a comparable configuration and load?
>>>> I cannot enable reporting either. When I try those daemons
>>>> (the master and the scheduler) crash right away too.
>>>> I enabled core dumping so I hope to have more info next time
>>>> the system crashes.
>>>> 
>>>> Thank You,
>>>> 
>>>> Iwona
>>>> 
>>>> 
>>>> Andreas.Haas at Sun.COM wrote:
>>>>> Hi Iwona,
>>>>> 
>>>>> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Andreas.Haas at Sun.COM wrote:
>>>>>>> Hi Iwona,
>>>>>>> 
>>>>>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my 
>>>>>>>> qmaster started dying.
>>>>>>> 
>>>>>>> You did this as foreseen?
>>>>>>>
>>>>>>>    http://gridengine.sunsource.net/install60patch.txt
>>>>>> Yes, all went through ok, no problems encountered during the upgrade.
>>>>>> I was very happy about that.
>>>>> 
>>>>> Ok.
>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> When I look at the logs I see messages:
>>>>>>>> 
>>>>>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit is 
>>>>>>>> set to 8192
>>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit is 
>>>>>>>> set to 8192
>>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 file 
>>>>>>>> descriptors for communication
>>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>>>>>>>> dynamic event clients
>>>>>>> 
>>>>>>> That is fine. It says qmaster got enough file descriptors available.
>>>>>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
>>>>>> We routinely have a few thousand jobs pending and in peak it goes up to 
>>>>>> ~15k.
>>>>>> I am not sure what file descriptors and dynamic events are used for....
>>>>> 
>>>>> Dynamic event clients are only needed for DRMAA clients and when
>>>>>
>>>>>    qsub -sync y
>>>>> 
>>>>> is used. Usually the 99 default is ample amount. The same is true with 
>>>>> the 8192 file descriptors. If you estimate 1 file descriptor for each 
>>>>> node you still have 8192-250 spare fd's for client commands connecting 
>>>>> to qmaster. So this one can safely exclude as root of your qmaster 
>>>>> problem.
>>>>> 
>>>>>>> 
>>>>>>>> Other than that nothing special.
>>>>>>>> 
>>>>>>>> Also when I restart the qmaster I get messages:
>>>>>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>>>>>  starting sge_qmaster
>>>>>>>>  starting sge_schedd
>>>>>>>> daemonize error: timeout while waiting for daemonize state
>>>>>>> 
>>>>>>> That means scheduler is having some problem during start-up. From the 
>>>>>>> message one can not say what is causing the problems, but it could be 
>>>>>>> due to qmaster in-turn having problems.
>>>>>> I am restarting them after the crash when the cluster is full loaded. 
>>>>>> Is it possible that it just needs more time to re-read all
>>>>>> the info about running and pending jobs?
>>>>> 
>>>>> Actually this I would rule out.
>>>>> 
>>>>>> Where would the scheduler print any messages about problems it is 
>>>>>> having?
>>>>> 
>>>>> For investigating the problem I suggest you launch qmaster and scheduler 
>>>>> separately as binaries rather than using sgemaster script. All you need 
>>>>> is two root-shells with Grid Engine environment (settings.{sh|csh}) be 
>>>>> set.
>>>>> 
>>>>> Then you do this:
>>>>>
>>>>>    # setenv SGE_ND
>>>>>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>>>>> 
>>>>> if you see everything wen't well with qmaster start-up (e.g. test 
>>>>> whether qhost gets you reasonable output) you continue with launching 
>>>>> the scheduler from the other shell:
>>>>>
>>>>>    # setenv SGE_ND
>>>>>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>>>>> 
>>>>> but my expectation is already qmaster will report some problem and exit.
>>>>> Normally qmaster may not exit with SGE_ND in environemnt as it prevents 
>>>>> daemonizing.
>>>>>
>>>>>>>>  starting sge_shadowd
>>>>>>>> error: getting configuration: failed receiving gdi request
>>>>>>> 
>>>>>>> Next indication for a crashed or sick qmaster.
>>>>>>>
>>>>>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>>>>> 
>>>>>>>> How bad is any of that, could crashes be related to it?
>>>>>>> 
>>>>>>> Very likely.
>>>>>>> 
>>>>>>>> I am running on RHEL3 .
>>>>>>> 
>>>>>>> Have you tried some other OS?
>>>>>> We will be upgrading shortly but at this time I have no choice, I have 
>>>>>> to keep the cluster
>>>>>> running with the OS I have.
>>>>>> 
>>>>>> Yesterday I gathered some more empirical evidence about the crashes - 
>>>>>> might be just
>>>>>> a coincidence. The story is long and related to a filesystem we are 
>>>>>> using (GPFS) but here is the part related to SGE.
>>>>> 
>>>>> Actually I'm not aware of any problem with GPFS, but it could be 
>>>>> related.
>>>>> Is qmaster spooling located on the GPFS volume? Are you using classic or 
>>>>> BDB spooling?
>>>>> 
>>>>> 
>>>>>> Sometimes on the client host the filesystem daemons get killed and that 
>>>>>> leaves the SGE processes on the client defunct - still there, but 
>>>>>> master cannot communicate with them. qdel will not dispose of the 
>>>>>> user's job, the load is not reported.
>>>>>> The easiest is to just reboot the node - it does not happen very often,
>>>>>> just a few nodes per day at most.
>>>>>> 
>>>>>> But even if I reboot the node, the client will not start properly 
>>>>>> unless I clean the local spool directory. I did not figure out which 
>>>>>> files are interfering, but if I delete the whole local spool,  the 
>>>>>> directory gets recreated and everybody is ok, so that's what I have 
>>>>>> been doing. Reboot, delete the local spool subdirectory, restart the 
>>>>>> SGE client.
>>>>> 
>>>>> Usually there are no problems with execution nodes if local spooling is 
>>>>> used. Ugh!
>>>>> 
>>>>> 
>>>>>> Yesterday I decided to streamline my procedure and delete that local
>>>>>> spool directory, before I reboot the node. The moment I delete that 
>>>>>> local
>>>>>> spool, the master that runs on a different host crashes right away.
>>>>>> 
>>>>>> I managed to crash it a few times, then I went to my old procedure
>>>>>> - first reboot, then remove the local scratch and all has been running 
>>>>>> well.
>>>>>> 
>>>>>> (the startup messages about problems are still there, but once started 
>>>>>> SGE run well and
>>>>>> I do not see any other problems).
>>>>> 
>>>>> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to move 
>>>>> away
>>>>> from GPFS ... at least for SGE spooling. Can't you switch to a more 
>>>>> conventional FS for that purpose?
>>>>> 
>>>>> Regards,
>>>>> Andreas
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> 
>>> 
>> 
>> http://gridengine.info/
>> 
>> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
>> Kirchheim-Heimstetten
>> Amtsgericht Muenchen: HRB 161028
>> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
>> Vorsitzender des Aufsichtsrates: Martin Haering
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list