[GE users] No scheduler registered at qmaster

Sofia Bassil sofia.bassil at fra.se
Mon Oct 27 15:03:32 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti skrev:
> Am 23.10.2008 um 16:50 schrieb Sofia Bassil:
> <snip>
>>>> The jobs in qw state are array jobs, although no new jobs are 
>>>> accepted wheather array or non-array job.
>>>
>>> Do you mean you get an error when issuing qsub?
>> No error. Non-array job seem to be submittable and runnable as long 
>> as no array jobs are ahead in the queue. Array jobs end up in qw 
>> state, or one starts but not all tasks of it, even though there are 
>> cores available. Eventually the cores are unallocated even for the 
>> tasks that started and the entire array job ends up in qw state.
>
> How many tasks are waiting? There is a setting in the scheduler 
> configuration "max_pending_tasks_per_job" for the maximum number of 
> tasks of an array job which will be scheduled at once. Another entry 
> "max_aj_instances" in SGE's configrations controls the amount of 
> running tasks of an array job.
>
> -- Reuti
I removed the pending tasks to get the non-array jobs waiting running, 
so I don't really know. Max aj_instances I changed to 0 for no limit. 
The only error output I found from qstat -j when I first started 
troubleshooting was that max_aj_instances was set too low. That's why I 
removed the limit entirely, assuming the cluster would show if it got 
overloaded, but it seems to handle it so far, at least load-wise on the 
machines. Max_pending_tasks_per_job was set to 50, but I just increased 
it to 75. How can I see the effect of max_pending_tasks_per_job in qstat 
output?

//Sofia
>
>
>>>> There are non-array jobs running that seem to work just fine 
>>>> however. They are running through all my restarts of qmaster and 
>>>> those cores remain allocated. When I restart the qmaster, the top 
>>>> array job in qw state starts allocating cores, but there always 
>>>> remains one line in the output of 'qstat -u \*' in qw state for 
>>>> that job, and there is no change in the status of the other waiting 
>>>> jobs. When the cores are unallocated again, all array jobs are in 
>>>> qw state.
>>>
>>> When at least one task of the array job is not running, you will 
>>> have this line there.
>>>
>>>> When I stop and start the qmaster and scheduler, I get a "commlib 
>>>> error: got read error (closing "sgemaster/qconf/2")" or 
>>>> "sgemaster/schedd/1" or "sgemaster/qstat/2", but I find no other 
>>>> errors in connection with the restart (nothing in the schedd_runlog 
>>>> or /tmp or on the nodes). After a while I get "acknowledge timeout 
>>>> after 1200 seconds for event client (schedd:1) on host "sgemaster"".
>>>> 'qstat -j <jobid>' shows nothing in particular at first and later 
>>>> it shows the "error: can't unpack gdi request" "error: error 
>>>> unpacking gdi request: bad argument" "failed receiving gdi request" 
>>>> messages plus the job_number...script_file information.
>>>
>>> One of the reason could be a version mix in the cluster.
>> The night before last the sge_schedd died with a segfault error. 
>> There is nothing wrong with any hardware as far as can be seen. I 
>> have been stracing the process and turned on sar, but so far I can't 
>> say I have seen anything particularly strange. I have removed the 
>> offending array jobs and now things seem much more stable. I have a 
>> reboot of the machine scheduled for tomorrow though.
>>
>> //Sofia
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list