[GE users] job requests unknown resource

Adrian Lang lang at fhi-berlin.mpg.de
Wed Feb 7 15:31:09 GMT 2007


On 05.02.2007, at 16:10, Reuti wrote:


> Quoting Adrian Lang <lang at fhi-berlin.mpg.de>:
>
>
>>
>>
>>> Am 02.02.2007 um 22:58 schrieb Adrian Lang:
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Am 02.02.2007 um 21:28 schrieb Adrian Lang:
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Adrian,
>>>>>>>
>>>>>>> Am 02.02.2007 um 14:39 schrieb Adrian Lang:
>>>>>>>
>>>>>>>
>>>>>>>> dear gridengine users,
>>>>>>>>
>>>>>>>> I encounter a rather interesting problem - qstat prints the
>>>>>>>> following informations (same for all other execution hosts):
>>>>>>>> ---
>>>>>>>> hard resource_list:         h_fsize=1500M,h_cpu=5400
>>>>>>>> soft resource_list:         own=TRUE
>>>>>>>> [...]
>>>>>>>> scheduling info:
>>>>>>>> [...]
>>>>>>>>                           (-l h_cpu=5400,h_fsize=1500M)  
>>>>>>>> cannot run
>>>>>>>> in queue instance "fhiforeign.q at compute-1-10" because job  
>>>>>>>> requests
>>>>>>>> unknown resource (h_fsize)
>>>>>>>> [...]
>>>>>>>> ---
>>>>>>>>
>>>>>>>>
>>>>>>>> qconf -sc prints:
>>>>>>>> ---
>>>>>>>> #name               shortcut   type        relop requestable
>>>>>>>> consumable default  urgency
>>>>>>>> #-------------------------------------------------------------- 
>>>>>>>> ---
>>>>>>>> --
>>>>>>>> --
>>>>>>>> -------------------
>>>>>>>> [...]
>>>>>>>> h_fsize             h_fsize    MEMORY      <=    FORCED
>>>>>>>> YES        0        0
>>>>>>>> [...]
>>>>>>>> ---
>>>>>>>>
>>>>>>>
>>>>>>> You attached it also in the definition of the exec host with an
>>>>>>> initial value set there (as it's consumable)?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> actually, the h_fsize host value is supplied by a costum load
>>>>>> sensor, so I
>>>>>> cannot add a default value, or?
>>>>>>
>>>>>
>>>>> Don't get the things mixed:
>>>>>
>>>>> - A default value (for the consumption) is set in the complex
>>>>> definition. As h_fsize is FORCED, you don't need one.
>>>>>
>>>>> - You defined h_fsize as consumable - so consume from which inital
>>>>> amount? This is set in the exec host definition in  
>>>>> "complex_values".
>>>>>
>>>>> You can either remove the consumable attribut or add the inital
>>>>> amount in the exechost definition. It's indeed possible to use a
>>>>> complex as a consumable and a load sensor at the same time. The  
>>>>> lower
>>>>> of the calculated/reported values is used.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>
>>>> I meant "initial value", not "default value", you're right.
>>>> However, an
>>>> initial value wouldn't be useful. I would expect the SGE to  
>>>> subtract
>>>> actual declared consume from the reported value, like
>>>>
>>>> 100G (reported h_fsize for the host) - 60G (requested h_fsize for
>>>> job 1
>>>> running on the host) - 20G (requested h_fsize for job 2 running  
>>>> on the
>>>> host) = 20G remaining for pending jobs.
>>>>
>>>
>>> No, this will not happen. The reported value and the consumed value
>>> are two different things - the only combination is to take the lower
>>> amount of both. Why would you like to implement such a behavior? I
>>> could only think of a different size of the installed disks in
>>> different machines for which you would like to automate the setting
>>> of the correct value, but then you could use a short script to set
>>> the correct value (according to the installed disk) in the shell  
>>> like:
>>>
>>> qconf -mattr exechost complex_values hfsize=`ssh node01 geth_fsize`
>>> node01
>>>
>>
>> the jobs may leave files in the temp-dir; so, the free space on  
>> the temp
>> drive is not predictable.
>>
>
> How is your cluster setup? If the files are left on the nodes, they  
> seem to be written not to the $TMPDIR directory, as this would be  
> removed. What about copying the files back to the submit machine  
> for further processing. So the space on the nodes would be fix, and  
> the files could easier be accessed by the users.
>
> You could copy them back in the jobscript or a queue epilog.
>

It's not really a problem, you see. The users are allowed to leave  
files after completion of the job. Actually, we are running fine with  
the current setup, even if it doesn't work as expected. However, the  
problem mentioned in the first post is really annoying -  some of my  
users are not able to get any jobs running at the moment. help would  
be appreciated ;-)
---
Adrian Lang (PP&B)
eMail: lang at fhi-berlin.mpg.de
Tel: (+49 30) (8413) 4277


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list