Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (94 - 96 of 431)

Ticket Resolution Summary Owner Reporter
#1444 fixed qsub -c r not documented in qsub manpage Dave Love <d.love@…> wish
Description

The -c option to qsub/qalter can take an 'r' flag as part of the occasion specifier which governs whether a checkpointing job can be rerun(as with the checkpointing environment's when parameter). This is not documented which means the default value can be easily overwritten by accident.

#1443 fixed execd crash sending mail Dave Love <d.love@…> dlove
Description

Happening frequently with a particular set of jobs:

Program terminated with signal 7, Bus error.
#0  0x00002b223c9c2018 in __deregister_frame_info () from /lib64/libgcc_s.so.1
(gdb) bt
#0  0x00002b223c9c2018 in __deregister_frame_info () from /lib64/libgcc_s.so.1
#1  0x00002b223b935a44 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00002b223cbf8515 in exit () from /lib64/libc.so.6
#3  0x000000000045b322 in sge_send_mail (progid=<value optimized out>, 
    mailer=0x2b223baca500 "/bin/mail", user=0x2b223baca200 "***", 
    host=0x2b223baca210 "***", subj=<value optimized out>, 
    buf=0x2b223babf800 "Job 166647 (chr1chunk19) Started\n User       = nmirza\n Queue      = serial\n Host       = node165\n Start Time = 01/01/2013 15:46:44", mailer_has_subj_line=1) at ../daemons/common/mail.c:288
#4  0x000000000045b9dd in cull_mail (progid=15, 
    user_list=<value optimized out>, 
    subj=0x2b223babf400 "Job 166647 (chr1chunk19) Started", 
    buf=0x2b223babf800 "Job 166647 (chr1chunk19) Started\n User       = nmirza\n Queue      = serial\n Host       = node165\n Start Time = 01/01/2013 15:46:44", mail_type=0x5617a1 "job start") at ../daemons/common/mail.c:113
#5  0x00000000004344b5 in sge_exec_job (ctx=0x2b223ba36000, 
    jep=<value optimized out>, jatep=0x2b223bac8700, petep=0x0, 
    err_str=0x7fffa3a30a30 "\200\207\254;\"+", err_length=256)
    at ../daemons/execd/exec_job.c:1781
#1441 fixed SoGE 8.1.2 qmaster segfault problem Dave Love <d.love@…> Andreas.Loong@…
Description

I couldn't get the debug binaries to run properly, however another user sent me the below output - I hope it's enough:

From: baf035 baf035@… Sent: den 2 november 2012 14:34 To: Loong, Andreas Subject: Re: [gridengine users] SoGE 8.1.2 segfault problem

Hi,

I can validate described behaviour: SoGE compiled with -debug, qmaster server system: cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 1 ~# uname -r 2.6.32.59-0.7-xen

SGE_ND=1 gdb -batch -ex run -ex 'bt full' sge_qmaster | tee sge_master_gdb2.log


Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2 Try: zypper install -C "debuginfo(build-id)=c1807b5762068e6c5f4a6a0ed48d9d4469965351" Missing separate debuginfo for /usr/lib64/libssl.so.0.9.8 Try: zypper install -C "debuginfo(build-id)=d18ef9c9ddb90ed79b550ba6399c00874bc86345" Missing separate debuginfo for /usr/lib64/libcrypto.so.0.9.8 Try: zypper install -C "debuginfo(build-id)=abcd98fb64029fea0fc96116be5f178a429e63d5" Missing separate debuginfo for /lib64/libdl.so.2 Try: zypper install -C "debuginfo(build-id)=f607b21f9a513c99bba9539050c01236d19bf22b" Missing separate debuginfo for /lib64/libm.so.6 Try: zypper install -C "debuginfo(build-id)=4e9fa1a2c1141fc0123a142783efd044c40bdaaf" Missing separate debuginfo for /lib64/libpthread.so.0 Try: zypper install -C "debuginfo(build-id)=341d7c595fd2db49df98b8a6ae2c319f46b43c5b" Missing separate debuginfo for /lib64/libc.so.6 Try: zypper install -C "debuginfo(build-id)=9e0264386fde8570b215fd4c32465fdda3c1c996" [Thread debugging using libthread_db enabled] Missing separate debuginfo for /lib64/libz.so.1 Try: zypper install -C "debuginfo(build-id)=4c05d1eb180f9c02b81a0c559c813dada91e0ca4" [New Thread 0x7ffff69ff700 (LWP 31685)] [New Thread 0x7ffff61fe700 (LWP 31686)] [New Thread 0x7ffff59fd700 (LWP 31687)] [New Thread 0x7ffff51fc700 (LWP 31688)] Reading in Master_Job_List.

read job database with 0 entries in 0 seconds error: error opening file "/jms/spool/i005/sge_spool/qmaster/./sharetree" for reading: No such file or directory nr of dynamic event clients exceeds max file descriptor limit, setting MAX_DYN_EC=979 qmaster hard descriptor limit is set to 8192 qmaster soft descriptor limit is set to 1024 qmaster will use max. 1004 file descriptors for communication qmaster will accept max. 979 dynamic event clients starting up SGE 8.1.3pre (lx-amd64) Q:1, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3, PR:9, RQS:0, AR:0, S:nd:0/lf:0


Q:277, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3, PR:9, RQS:0, AR:0, S:nd:0/lf:0


Q:279, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3, PR:9, RQS:0, AR:0, S:nd:0/lf:0


..

:281, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3, PR:9, RQS:0, AR:0, S:nd:0/lf:0


Q:281, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, C[New Thread 0x7ffff41ff700 (LWP 31691)] [New Thread 0x7ffff39fe700 (LWP 31692)] [New Thread 0x7ffff31fd700 (LWP 31693)] [New Thread 0x7ffff29fc700 (LWP 31694)] [New Thread 0x7ffff21fb700 (LWP 31695)] [New Thread 0x7ffff19fa700 (LWP 31696)] [New Thread 0x7ffff11f9700 (LWP 31697)] [New Thread 0x7ffff09f8700 (LWP 31698)]

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff19fa700 (LWP 31696)] 0x00000000004995c6 in do_gdi_packet (monitor=<optimized out>, aMsg=<optimized out>, answer_list=<optimized out>, ctx=<optimized out>) at ../daemons/qmaster/sge_qmaster_process_message.c:195 195 ../daemons/qmaster/sge_qmaster_process_message.c: No such file or directory.

in ../daemons/qmaster/sge_qmaster_process_message.c

#0 0x00000000004995c6 in do_gdi_packet (monitor=<optimized out>, aMsg=<optimized out>, answer_list=<optimized out>, ctx=<optimized out>) at ../daemons/qmaster/sge_qmaster_process_message.c:195

packet = 0x0 local_ret = false SGE_FUNC = "do_gdi_packet"

#1 sge_qmaster_process_message (ctx=0x7ffff42b6800, monitor=<optimized out>) at ../daemons/qmaster/sge_qmaster_process_message.c:158

res = <optimized out> msg = {snd_host = "sget5.hpc.domain.com", '\000' <repeats 40

times>, snd_name = "qstat", '\000' <repeats 58 times>, snd_id = 637, tag

2, request_mid = 2, buf = {head_ptr = 0x7fffefc17800 "", cur_ptr

0x7fffefc1783e "", mem_size = 1241, bytes_used = 62, just_count = 0, version = 268566528}}

SGE_FUNC = "sge_qmaster_process_message"

#2 0x000000000042fb74 in sge_listener_main (arg=<optimized out>) at ../daemons/qmaster/sge_thread_listener.c:169

thread_config = 0x7ffff4693de0 monitor = {thread_name = 0x7ffff42a7360 "listener000",

monitor_time = 0, log_monitor_mes = false, output_line1 = 0x7ffff42ab1a0, output_line2 = 0x7ffff42ab1c0, work_line = 0x7ffff42ab1c0, pos = 6, now = {tv_sec = 0, tv_usec = 0}, output = false, message_in_count = 0, message_out_count = 0, idle = 0, wait = 0, ext_type = LIS_EXT, ext_data = 0x7ffff42a7370, ext_data_size = 16, ext_output = 0x5c8ac0 <ext_lis_output>}

ctx = 0x7ffff42b6800 next_prof_output = 0 SGE_FUNC = "sge_listener_main"

#3 0x00007ffff717b6a6 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007ffff6eeaf7d in clone () from /lib64/libc.so.6 No symbol table info available. #5 0x0000000000000000 in ?? () No symbol table info available.


Sge_qmaster died randomly without significant load on server:

2012-10-31 18:42:03] sge_qmaster[8181]: segfault at 68 ip

00000000004995c6 sp 00007f40acbf9c50 error 6 in sge_qmaster[400000+267000] [2012-10-31 20:35:03] sge_qmaster[8895]: segfault at 68 ip 00000000004995c6 sp 00007fb8b8dfac50 error 6 in sge_qmaster[400000+267000] [2012-11-01 04:50:03] sge_qmaster[13225]: segfault at 68 ip 00000000004995c6 sp 00007f62cd4f9c50 error 6 in sge_qmaster[400000+267000] [2012-11-01 08:56:04] sge_qmaster[15224]: segfault at 68 ip 00000000004995c6 sp 00007f9b727f9c50 error 6 in sge_qmaster[400000+267000] [2012-11-01 10:56:02] sge_qmaster[16860]: segfault at 68 ip 00000000004995c6 sp 00007f0c183f9c50 error 6 in sge_qmaster[400000+267000]

Bye

BaF035


Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.

Note: See TracQuery for help on using queries.