[GE users] SGE6 does not backfill

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Wed Apr 13 10:54:09 BST 2005


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

we know the problem, but were not able to replicate it ourself. What
kind of hardware are you using? How big is your grid?

It could be a file descriptor limit problem. During the startup of the
qmaster, the qmaster logs the amount of file descriptors it will use.
This number as to be larger than the number of execds + average
connected clients.

In the case of job vanishing, it usually takes very long for that
job to start up. In a previous reported case 2 min. Somehow the
qmaster send a signal to one of the execd, which had not gotten
the job yet and therefor reported it as unknown. The qmaster
than thinks that the job had failed and removes it.

One of the possible signals could be the reprioritization. Is it
turned off in your case?

You can use qping -dump to monitor that communication of the
qmaster with its execd. Could you try to capture a failing pe job?

Cheers,
Stephan

Juha Jäykkä wrote:

>>Another problem surfaced, though: 
>>
>>The parallel jobs NEVER run! They transfer to the exec hosts fine (go
>>from state "qw" to state "t"), but the vanish without leaving a trace
>>ANYWHERE! I can never see them in state "r". What's up here? Is some
>>change in my config required?
>>    
>>
>
>Ok, this is the reason:
>
>04/13/2005 12:30:49|qmaster|topaasi|W|job 259.1 failed on host compute-0-0.local in recognizing job 
>because: execd doesn't know this job
>
>How do I fix it? It appears SOME parallel jobs work, but some do not.
>Strange. I got two parallel jobs to run fine but the rest (some dozen or
>so) did not. All of them gave this same error. (It's in
>$SGE_ROOT/default/spool/qmaster/messages, by the way.)
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list