[GE users] qmaster SEGVs

mhanby mhanby at uab.edu
Tue May 4 16:39:54 BST 2010


We generally have a mix of PE and non PE jobs, some of the PEs are tightly integrated while others are not.

I don't have any /tmp/qmaster_* files on the qmaster node.

Here's the contents of the messages file the last 2 times it crashed and started (5h 10m between crashes in this case):

04/16/2010 18:09:32|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
04/16/2010 18:09:32|  main|cluster1|I|starting up GE 6.2u5 (lx26-amd64)
04/16/2010 18:09:32|  main|cluster1|I|2 worker threads are enabled
04/16/2010 18:09:32|  main|cluster1|I|2 listener threads are enabled
04/16/2010 18:09:32|  main|cluster1|I|scheduler has been started
04/16/2010 18:09:32|  main|cluster1|I|qmaster startup took 0 seconds
04/16/2010 18:09:32|schedu|cluster1|I|"scheduler" registers as event client with id 1 event delivery interval 10
04/16/2010 18:09:32|schedu|cluster1|I|sge at cluster1.local added "scheduler" to event client list
.....
04/16/2010 23:19:22|worker|cluster1|I|task 1.compute-0-4 at compute-0-4.local of job 236202.1 finished
04/16/2010 23:19:24|worker|cluster1|I|task 1.compute-0-8 at compute-0-8.local of job 236202.1 finished
04/16/2010 23:19:24|worker|cluster1|I|task 1.compute-0-13 at compute-0-13.local of job 236202.1 finished
04/16/2010 23:19:24|worker|cluster1|I|task 1.compute-1-2 at compute-1-2.local of job 236202.1 finished
04/16/2010 23:19:25|worker|cluster1|I|task 1.compute-1-1 at compute-1-1.local of job 236202.1 finished
04/16/2010 23:19:26|worker|cluster1|I|task 1.compute-0-7 at compute-0-7.local of job 236202.1 finished
04/16/2010 23:19:26|worker|cluster1|I|task 1.compute-0-1 at compute-0-1.local of job 236202.1 finished
04/16/2010 23:30:10|  main|cluster1|I|using "/opt/gridengine/default/spool" for execd_spool_dir
04/16/2010 23:30:10|  main|cluster1|I|using "/bin/mail" for mailer
04/16/2010 23:30:10|  main|cluster1|I|using "/usr/bin/X11/xterm" for xterm
04/16/2010 23:30:10|  main|cluster1|I|using "none" for load_sensor
04/16/2010 23:30:10|  main|cluster1|I|using "none" for prolog
04/16/2010 23:30:10|  main|cluster1|I|using "none" for epilog
04/16/2010 23:30:10|  main|cluster1|I|using "posix_compliant" for shell_start_mode
04/16/2010 23:30:10|  main|cluster1|I|using "bash,sh,ksh,csh,tcsh" for login_shells
04/16/2010 23:30:10|  main|cluster1|I|using "0" for min_uid
04/16/2010 23:30:10|  main|cluster1|I|using "0" for min_gid
04/16/2010 23:30:10|  main|cluster1|I|using "20000-20100" for gid_range
04/16/2010 23:30:10|  main|cluster1|I|using "00:00:40" for load_report_time
04/16/2010 23:30:10|  main|cluster1|I|using "false" for enforce_project
04/16/2010 23:30:10|  main|cluster1|I|using "auto" for enforce_user
04/16/2010 23:30:10|  main|cluster1|I|using "00:05:00" for max_unheard
04/16/2010 23:30:10|  main|cluster1|I|using "log_info" for loglevel
04/16/2010 23:30:10|  main|cluster1|I|using "none" for administrator_mail
04/16/2010 23:30:10|  main|cluster1|I|using "none" for set_token_cmd
04/16/2010 23:30:10|  main|cluster1|I|using "none" for pag_cmd
04/16/2010 23:30:10|  main|cluster1|I|using "none" for token_extend_time
04/16/2010 23:30:10|  main|cluster1|I|using "none" for shepherd_cmd
04/16/2010 23:30:10|  main|cluster1|I|using "none" for qmaster_params
04/16/2010 23:30:10|  main|cluster1|I|using "H_MEMORYLOCKED=infinity" for execd_params
04/16/2010 23:30:10|  main|cluster1|I|using "accounting=true reporting=true flush_time=00:00:15 joblog=true sharelog=00:00:00" for reporting_params
04/16/2010 23:30:10|  main|cluster1|I|using "100" for finished_jobs
04/16/2010 23:30:10|  main|cluster1|I|using "builtin" for qlogin_daemon
04/16/2010 23:30:10|  main|cluster1|I|using "builtin" for qlogin_command
04/16/2010 23:30:10|  main|cluster1|I|using "builtin" for rsh_daemon
04/16/2010 23:30:10|  main|cluster1|I|using "builtin" for rsh_command
04/16/2010 23:30:10|  main|cluster1|I|using "none" for jsv_url
04/16/2010 23:30:10|  main|cluster1|I|using "ac,h,i,e,o,j,M,N,p,w" for jsv_allowed_mod
04/16/2010 23:30:10|  main|cluster1|I|using "builtin" for rlogin_daemon
04/16/2010 23:30:10|  main|cluster1|I|using "builtin" for rlogin_command
04/16/2010 23:30:10|  main|cluster1|I|using "00:00:00" for reschedule_unknown
04/16/2010 23:30:10|  main|cluster1|I|using "2000" for max_aj_instances
04/16/2010 23:30:10|  main|cluster1|I|using "75000" for max_aj_tasks
04/16/2010 23:30:10|  main|cluster1|I|using "0" for max_u_jobs
04/16/2010 23:30:10|  main|cluster1|I|using "0" for max_jobs
04/16/2010 23:30:10|  main|cluster1|I|using "0" for max_advance_reservations
04/16/2010 23:30:10|  main|cluster1|I|using "0" for reprioritize
04/16/2010 23:30:10|  main|cluster1|I|using "0" for auto_user_oticket
04/16/2010 23:30:10|  main|cluster1|I|using "100" for auto_user_fshare
04/16/2010 23:30:10|  main|cluster1|I|using "none" for auto_user_default_project
04/16/2010 23:30:10|  main|cluster1|I|using "86400" for auto_user_delete_time
04/16/2010 23:30:10|  main|cluster1|I|using "false" for delegated_file_staging
04/16/2010 23:30:10|  main|cluster1|I|using "" for libjvm_path
04/16/2010 23:30:10|  main|cluster1|I|using "" for additional_jvm_args
04/16/2010 23:30:10|  main|cluster1|I|read job database with 876 entries in 0 seconds
04/16/2010 23:30:10|  main|cluster1|W|removing reference to no longer existing job 236227 of user "gcampos"

04/16/2010 23:30:10|  main|cluster1|W|removing reference to no longer existing job 236213 of user "sandeepk"

04/16/2010 23:30:10|  main|cluster1|E|error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for reading: No such file or directory
04/16/2010 23:30:10|  main|cluster1|I|max dynamic event clients is set to 99
04/16/2010 23:30:10|  main|cluster1|I|qmaster hard descriptor limit is set to 8192
04/16/2010 23:30:10|  main|cluster1|I|qmaster soft descriptor limit is set to 8192
04/16/2010 23:30:10|  main|cluster1|I|qmaster will use max. 8172 file descriptors for communication
04/16/2010 23:30:10|  main|cluster1|I|qmaster will accept max. 99 dynamic event clients
04/16/2010 23:30:10|  main|cluster1|I|starting up GE 6.2u5 (lx26-amd64)
04/16/2010 23:30:10|  main|cluster1|I|2 worker threads are enabled
04/16/2010 23:30:10|  main|cluster1|I|2 listener threads are enabled
04/16/2010 23:30:10|  main|cluster1|I|scheduler has been started
04/16/2010 23:30:10|  main|cluster1|I|qmaster startup took 0 seconds
04/16/2010 23:30:10|schedu|cluster1|I|"scheduler" registers as event client with id 1 event delivery interval 10
04/16/2010 23:30:10|schedu|cluster1|I|sge at cluster1.local added "scheduler" to event client list
04/16/2010 23:30:10|schedu|cluster1|I|using "default" as algorithm
04/16/2010 23:30:10|schedu|cluster1|I|using "0:0:10" for schedule_interval
04/16/2010 23:30:10|schedu|cluster1|I|using "0:7:30" for load_adjustment_decay_time
04/16/2010 23:30:10|schedu|cluster1|I|using "np_load_avg" for load_formula
04/16/2010 23:30:10|schedu|cluster1|I|using "true" for schedd_job_info
04/16/2010 23:30:10|schedu|cluster1|I|using param: "none"
04/16/2010 23:30:10|schedu|cluster1|I|using "0:0:0" for reprioritize_interval
04/16/2010 23:30:10|schedu|cluster1|I|using "cpu=1,mem=0,io=0" for usage_weight_list
04/16/2010 23:30:10|schedu|cluster1|I|using "none" for halflife_decay_list
04/16/2010 23:30:10|schedu|cluster1|I|using "OFS" for policy_hierarchy
04/16/2010 23:30:10|schedu|cluster1|I|using "np_load_avg=0.50" for job_load_adjustments
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for maxujobs
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for queue_sort_method
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for flush_submit_sec
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for flush_finish_sec
04/16/2010 23:30:10|schedu|cluster1|I|using 168 for halftime
04/16/2010 23:30:10|schedu|cluster1|I|using 5 for compensation_factor
04/16/2010 23:30:10|schedu|cluster1|I|using 0.25 for weight_user
04/16/2010 23:30:10|schedu|cluster1|I|using 0.25 for weight_project
04/16/2010 23:30:10|schedu|cluster1|I|using 0.25 for weight_department
04/16/2010 23:30:10|schedu|cluster1|I|using 0.25 for weight_job
04/16/2010 23:30:10|schedu|cluster1|I|using 10000 for weight_tickets_functional
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for weight_tickets_share
04/16/2010 23:30:10|schedu|cluster1|I|using 1 for share_override_tickets
04/16/2010 23:30:10|schedu|cluster1|I|using 1 for share_functional_shares
04/16/2010 23:30:10|schedu|cluster1|I|using 200 for max_functional_jobs_to_schedule
04/16/2010 23:30:10|schedu|cluster1|I|using 1 for report_pjob_tickets
04/16/2010 23:30:10|schedu|cluster1|I|using 50 for max_pending_tasks_per_job
04/16/2010 23:30:10|schedu|cluster1|I|using 0.01 for weight_ticket
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for weight_waiting_time
04/16/2010 23:30:10|schedu|cluster1|I|using 3.6e+06 for weight_deadline
04/16/2010 23:30:10|schedu|cluster1|I|using 0.1 for weight_urgency
04/16/2010 23:30:10|schedu|cluster1|I|using 1 for weight_priority
04/16/2010 23:30:10|schedu|cluster1|I|using 0 for max_reservation
04/16/2010 23:30:33|worker|cluster1|I|execd on compute-1-4.local registered
04/16/2010 23:30:34|worker|cluster1|I|execd on compute-0-2.local registered


-----Original Message-----
From: andy [mailto:andy.schwierskott at sun.com]
Sent: Tuesday, May 04, 2010 9:26 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] qmaster SEGVs

Hi,

Do you have PE jobs running when this happens? Is it tightly integrated
parallel jobs?

What messages do you see in the qmaster messages file (or in
/tmp/qmaster_messages.<pid>)?

Andy



On Tue, 4 May 2010, mhanby wrote:

> I haven't found any solution. My SEGV happened in 6.2u4 and after upgrading to 6.2u5 continued.
>
> For me, it seems to always happen following a reboot. After several crashes, it seems to stabilize for a while (days, weeks) before it starts again.
>
> My workaround is to use Nagios and event handlers to start it back up if it isn't running.
>
> -----Original Message-----
> From: heywood [mailto:heywood at cshl.edu]
> Sent: Monday, May 03, 2010 12:51 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] qmaster SEGVs
>
> We rebooted the node running qmaster, and we are now also getting qmaster
> crashes. I see in the archive there is another thread "sgemaster keeps
> crashing 6.2u4" from February which apparently is the same issue. After a
> number of crashes I got qmaster to keep running (for now!).
>
> We are running 6.2u5 with RHEL4.
>
> I guess there is no solution/resolution?
>
> Todd
>
>
> sge_qmaster[5851]: segfault at 0000000000000080 rip 00000039fa470560 rsp
> 000000004780aa38 error 4
> sge_qmaster[6163]: segfault at 0000000000000080 rip 00000039fa470560 rsp
> 000000004780aa38 error 4
> sge_qmaster[6573]: segfault at 0000000000000000 rip 00000000005bf6c7 rsp
> 0000000047809ec0 error 4
>
> On 3/17/10 12:14 PM, "abrookfield" <a.brookfield at sheffield.ac.uk> wrote:
>
> > I'm also having problems with qmaster SEGVs in 6.2u5, running on RHEL5,
> > x86_64.
> >
> > Crashes seem to be correlated with users deleting jobs, particularly (but not
> > exclusively) OpenMPI parallel jobs which have been running for 'a while'.
> > Other than updating to u5 we've not made any config changes to our setup.
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=249
> > 186
> >
> > To unsubscribe from this discussion, e-mail:
> > [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255955
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256103
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256104

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=256115

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list