[GE users] problems with sending task to execd

Thomas Neumann neumann at exasol.com
Fri May 20 15:03:06 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello !

While running a job I get problems with the execd on the first machine 
of a job. The scenario is the following:

The job is running on multiple nodes, using the bash with 4 slots as 
parallel environment. First everything looks fine, all nodes execute 
their commands via qrsh -inherit. But after approximately 30 minutes (it 
is always the same part of the job-script) the following message appears:

error executing task of job xxx: failed sending task to 
execd at xxx.xxx.xxx.xxx: can't find connection

The machine is always different, as the script does not ude the whole 
cluster, but a common thing is that the IP always belongs to the first 
node of the machine-list for the job. After the message appears, the 
script hangs up. Running interactive and executing the commands of the 
job manually the situation is nearly the same, but the job does not hang 
up. I tried to repeat the command which caused the problem and it 
successfully runs through.

There are also some other jobscripts which sporadically report the same 
error. I did not find a smaller scenario to "produce" the error yet, but 
trying to find such a scenario I found out that the message - if it 
appears - only appears for a couple of seconds. The only situation also 
saw the message (and it only worked one time) was running the following 
command in a one-node-job

while true; do qrsh -inherit xxx "hostname"; sleep 1; done

I know that this errormessage appears, if the machine has no running 
execd or the IP-address does not exist. As described the error 
disappered after some seconds, unfortunately I failed to completely 
check everything on the machine while the error appered, but some 
seconds after the error there was nothing unusal: machine reacted in 
time, was reachable by name and IP, in a pstree I saw the execd running, 
containing the shepherds of all jobs on this machine.  I checked the 
messages in the machines directory for errors, but found nothing. 
Breaking and repeating the command, the error did not appear again.

Has anybody got an idea what could by the reason for such a behaviour 
and/or can somebody tell me if there are more situations which cause 
this errormessage?

Thanks
    Thomas



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list