[GE users] schedd hangs with infinite loop :-((

Andy Schwierskott andy.schwierskott at sun.com
Fri Apr 2 09:38:15 BST 2004


Christian,

we assume  that this might be a glibc 2.3.2 related problem. In our test
cases (same glibc version from a Suse distribution, but on an Opteron
machine) the same happens when there are about 200 jobs or more in the
system:

   - either busy looping
   - or crash

in the same code section. This is a code section where in extremely big amount of
very small free()'s of previously malloc()'ed areas are done. Could it be
that the malloc()/free() memory management of this version of glibc is
broken? We didn't detect a bug i nthe code. If someone could volunteer to
look at the affected code we'll be happy to have provide further information
on the "dev" mailing list.

Since the code itself is quite inefficient it was easy for us to change the
code and we did not experience this problem again. You are now the first user
who experiences the same problem after us.

I assume that any site who is using the functional policy and has more then
about 200 jobs in the system will experience the same problem with the
scheduler.

As a workaround I suggest your are installing the previous glibc version
again.

I don't know how the potential problem best could be reported back to the
Linux glibc maintainers - if anyone has a suggestion please let me know.

Andy

> Hello!
>
> I've got news on this issue - since the scheduler was hanging again, I
> tried to restart it. This time it crashed. Fortunately I've set the
> debug level to 1, so I've got a log and a core dump.
>
> This is the backtrace:
>
> $ gdb bin/glinux/sge_schedd default/spool/qmaster/schedd/core.7041
> GNU gdb 5.3-debian
> This GDB was configured as "i386-linux"...
> Core was generated by `sge_schedd'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /lib/libm.so.6...done.
> Loaded symbols for /lib/libm.so.6
> Reading symbols from /lib/libc.so.6...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> #0  0x0807ba34 in free_fcategories ()
> (gdb) bt
> #0  0x0807ba34 in free_fcategories ()
> #1  0x0807f135 in sge_calc_tickets ()
> #2  0x08080f59 in sge_scheduler ()
> #3  0x0804e7a3 in dispatch_jobs ()
> #4  0x0804df06 in scheduler ()
> #5  0x0805168a in event_handler_default_scheduler ()
> #6  0x0804a86a in main ()
> #7  0x4005adc6 in __libc_start_main () from /lib/libc.so.6
> (gdb)
>
> The log file is attached.
>
> I hope this can help to identify the cause...
>
> Regards
>   Christian
>
>


Regards,
Mit freundlichen Gruessen,
Andy
Schwierskott

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andy Schwierskott           Tel:     +49 941 3075-200  (x60200)
Sun Grid Engine Engineering Support: +49 941 3075-250  (x60250)
Sun Microsystems GmbH       Fax:     +49 941 3075-222  (x60222)
Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
D-93049 Regensburg          http://www.sun.com/gridware

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list