[GE users] SGE6 does not backfill

Juha Jäykkä juhaj at iki.fi
Wed Apr 13 11:53:52 BST 2005


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

> we know the problem, but were not able to replicate it ourself. What
> kind of hardware are you using? How big is your grid?

12 2-way amd64-nodes plus 2-way front end (which belongs to the cluster
queue, but has slots=0). Front end is the only submit host. Hardware is HP
DL145's for all nodes (identical, except one has one hard disc from a
different manufacturer since the original died) and HP DL585 for front
end.

> It could be a file descriptor limit problem. During the startup of the
> qmaster, the qmaster logs the amount of file descriptors it will use.
> This number as to be larger than the number of execds + average
> connected clients.

Hmm... can I check this number somewhere and increase it to see what
happens?

Note, that with SGE6.0u3 this *never* occurred during the couple of weeks
we used it, so it most likely is related to some change between u3 and
last Monday's CVS version (which is what I used).

> In the case of job vanishing, it usually takes very long for that
> job to start up. In a previous reported case 2 min. Somehow the
> qmaster send a signal to one of the execd, which had not gotten
> the job yet and therefor reported it as unknown. The qmaster
> than thinks that the job had failed and removes it.

I did not notice this previously, but obviously the jobs goes to the exec
host, but for some reason qmaster thinks it did not. First, it thinks
execd does not know of the job and then, after half a minute, it thinks
that the job is in the exec host, though it's not supposed (since the
execd just 30 secs ago did not know of the job!):

04/13/2005 13:18:52|qmaster|topaasi|W|job 299.1 failed on host compute-0-11.local in recognizing job
 because: execd doesn't know this job
04/13/2005 13:19:24|qmaster|topaasi|E|execd at compute-0-11.local reports running job (299.1/master) in
 queue "all.q at compute-0-11.local" that was not supposed to be there - killing


On a side note, I het this after a successful LAM job. Is it serious?

04/13/2005 13:21:21|qmaster|topaasi|W|job 297.1 failed on host compute-0-1.local in pestop because: 
04/13/2005 13:21:21 [400:12861]: exit_status of pe_stop = 2


> One of the possible signals could be the reprioritization. Is it
> turned off in your case?

Yes, it is off. Though it WAS NOT off previously... should I change it
back?

If the job goes to unknown state, could I not use reschedule_unknown to
reschedule it after it fails?
 
> You can use qping -dump to monitor that communication of the
> qmaster with its execd. Could you try to capture a failing pe job?

You mean qping -dump topaasi.local 537 execd 1, where 537 is the port my
qmaster sits on?

-- 
		 -----------------------------------------------
		| Juha Jäykkä, juolja at utu.fi			|
		| home: http://www.utu.fi/~juolja/		|
		 -----------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list