[GE users] ptf complains: Job does not exist

Thiep Duong Thiep.Duong at am.necel.com
Sat Aug 12 01:47:06 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

user sitting on PC, connecting to 'svnc3' a SUN machine, running
solaris_9 via Realvnc (client3.3.7)
The svnc3 host is a submit host, and from there they are issuing qsh
command.

 From the same submit host, user can get session from one queue
but not the other.  After many tries ... user can sometime get
some system on the same queue, but not consitent.

Any suspicious on VNC or .xauthority that may prevent it to work?

What I have noticed is that when things are working, an
xterm session is coming up within 3-seconds.  When we have to
wait for 5-6seconds, then it's doesn't work anymore.
Our network is a mixture of 100Mb/1000MB, and we don't think
there is any network problem.

Strange problem, but it's not unique.  We found the problem
on more than one user.

Thanks.

Thiep

Reuti wrote:
> Hi,
> 
> Am 11.08.2006 um 04:28 schrieb Thiep Duong:
> 
>> I am getting the above messages, and the last discussion on this
>> issue was sometime in Aug/2005 without any conlusion/resolution.
>>
>> Let me try to describe it again:
>> I have more than one user, who is using VNCserver window to submit
>> interactive job
>>     /opt/app/SGE/6.0/bin/sol-sparc64/qsh -q solaris16G.q at scblad02
>>
>> The messages user got back is either:
>>
>> waiting for interactive job to be scheduled ...
>> Your interactive job 27807 has been successfully scheduled.
>> Or
>> waiting for interactive job to be scheduled ...
>> Could not start interactive job.
>>
>>
>> Without any xterm window come up. Look like there is no resource
>> found.  Adding -now no switch so that we can see what's going on.
>> qstat would show there is a job in qw mode (job-id 27807), I can do
>> qstat -j 27807 for 3-5 seconds, then it's just gone.
>>
>> User actually got email telling him that his job is completed.
>>
>> It's not DISPLAY issue -- user can open qsh using other queue
> 
> is it working in one queue, but not the other on the same machine?
> 
> I don't get your configuration: your user is sitting as his PC, making a 
> VNC connection to the master node of your cluster, and issuing there a 
> qsh comand?
> 
> -- Reuti
> 
>>
>> Nothing found in spool/qmaster/messages
>>
>> Looking at comon/accounting file, the job seems to finish
>> solaris16G.q:scblad02:ccds:zeke:INTERACTIVE:27807:sge:0
>> :1155234047:1155234047:1155234047:0:1:0:0:0:0.000000:0
>> :0:0:0:0:0:0:0.000000:0:0:0:0:0:0:eagle:zeke:NONE
>> :1:0:0.000000:0.000000:0.000000:-U zeke -q solaris16G.q at scblad02
>> -l num_proc=1 -soft -l group=scdc -I y -P eagle:0.000000:NONE:0.000000
>>
>>
>> Looking in spool/scblad02 (execution host)
>> 08/10/2006 11:12:59|execd|scblad02|W|reaping job "27801" ptf 
>> complains: Job does not exist
>> 08/10/2006 11:19:30|execd|scblad02|W|reaping job "27805" ptf 
>> complains: Job does not exist
>> 08/10/2006 11:20:47|execd|scblad02|W|reaping job "27807" ptf 
>> complains: Job does not exist
>>
>> It's not execution host, because other user can open/run job
>> on the same queue --
>>
>> What can we do to debug futher?  I am using 6.0u7 release.
>>
>> Thanks in advance.
>>
>> Thiep

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list