[GE users] schedd hangs with infinite loop :-((

Christian Kauhaus ckauhaus at informatik.uni-jena.de
Thu Apr 1 13:35:20 BST 2004

Andy Schwierskott <andy.schwierskott at sun.com>:
>   - did you recently upgrade your glibc version

We did an upgrade ca. one week ago from libc6 Debian package
2.3.2.ds1-9 to 2.3.2.ds1-11. But it went quite well for one week,
scheduling something about 1500 Jobs.

>   - or did you move the master machine to this new machine

No, the master has been on the same host all the time.

>   - or did you begin to use functional tickets

We use functional tickets for a while and never had any problems
with 5.3p5. 

>Please send your scheduler config (qconf -ssconf) as well.

# qconf -ssconf
algorithm                  default
schedule_interval          00:00:30
maxujobs                   30
queue_sort_method          seqno
user_sort                  true
job_load_adjustments       np_load_avg=0.9
load_adjustment_decay_time 0:02:00
load_formula               np_load_avg*100000+swap_rate
schedd_job_info            true
sgeee_schedule_interval    00:02:30
halftime                   168
usage_weight_list          cpu=0.5,mem=0.25,io=0.25
compensation_factor        5
weight_user                0.2
weight_project             0.2
weight_jobclass            0.2
weight_department          0.2
weight_job                 0.2
weight_tickets_functional  10000
weight_tickets_share       100000
weight_tickets_deadline    10000

The complex value 'swap_rate' comes from a custom load sensor script,
since the built in load sensor seems not to work on arch glinux. It is
measured in bytes/sec. We need this because some of our machines tend to
run short on memory due to interactive usage.

It is also noteworthy that I've actually got sge_schedd running again by
removing all jobs from the directories
$SGE_ROOT/default/spool/qmaster/jobs and
$SGE_ROOT/default/spool/qmaster/job_scripts. Of cause I got some angry


