[GE users] sched_job_info problems

skylar2 skylar2 at u.washington.edu
Thu Aug 6 14:47:30 BST 2009

reuti wrote:
> Hi,
> Am 06.08.2009 um 00:48 schrieb skylar2:
>> We're running into problems with GE 6.1u6 where having schedd_job_info
>> enabled makes sge_schedd eat up all the RAM on the system when we have
>> users submit jobs with PE slot range requests. This appears to be an
>> issue that is reported here and supposedly fixed:
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187
>> Disabling schedd_job_info makes the problem go away, but our users
>> depend on schedd_job_info output to debug their jobs. Does anyone know
>> of a workaround for this problem?
> what dou you mean by "debug"? Usually it happens to users who are new  
> to a cluster to request too much resources and they want to  
> investigate why the jobs aren't starting. One test could be:
> $ qalster -w v <jobid>
> instead to get a reply whether there are any suitables queues at all,  
> "-w p" to get a reply under the current load.

I ran across that qsub/qalter trick right after I sent my email. It
looks like it'll work for us, but we'll just have to re-educate our users.

> In special cases I turn  
> the schedd_job_info on for some minutes for investigation and turn it  
> off again.

I tried this, but by the time you get useful "qstat -j" output memory
usage has already ballooned and the only way to get it back down is to
restart the scheduler.

-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S048, (206)-685-7354
-- University of Washington School of Medicine


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "OpenPGP digital signature"  Application/PGP-SIGNATURE ]
    [ (Name: "signature.asc") 261 bytes. ]
    [ Unable to print this part. ]

More information about the gridengine-users mailing list