[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Reuti reuti at staff.uni-marburg.de
Wed Jul 16 16:53:46 BST 2008


Am 16.07.2008 um 01:47 schrieb Bevan C. Bennett:

> Bevan C. Bennett wrote:
>> Reuti wrote:
>>> Hi,
>>>
>>> Am 15.07.2008 um 01:45 schrieb Bevan C. Bennett:
>>>
>>>> I'm seeing a lot of intermittent failures on my grid. They  
>>>> always look
>>>> like the following and end up marking the host that tried to run  
>>>> them as
>>>> error state. Left alone they'll mark the whole system as error  
>>>> and bring
>>>> everything to a crashing halt.
>>>>
>>>> Does anyone have any idea what might cause these?
>>>>
>>>> Does anyone have any similar issues?
>>>>
>>> are the spool directories global on NFS or alike and can it be a  
>>> timing
>>> problem? Can you have them also local in /var/spool/sge or so?
>>
>> They are global on NFS currently. We keep our system times synced
>> tightly, but the master has been experiencing increased latency at  
>> times.
>>
>> I'll see about starting to convert the servers over.
>
> Been moving through the slow process of converting every server to a
> local spool directory, but noticed this this afternoon. europium was
> converted to a local spool this morning (note references to
> /var/spool/sge) yet still popped up with the same error.

AFAICS the error is different. The old one was "connection refused"  
because "job_pid" inside .../<nodename>/active_jobs/<job-id>  
couldn'tbe written - this seems now to be created.

And now it's "no such file or directory", but in /scratch.

The /scratch is local and writable by everyone?

As the job was submitted as root: ssh by root is allowed?

-- Reuti


> "/scratch" is a local drive on all servers and is set as the "tmp
> directory" for all queues. Could the execd somehow be deleting this  
> tmp
> jobdir before the shepherd gets to it? Could it be getting created  
> too late?
>
> Job 1345595 caused action: Queue  
> "all.q at europium.internal.avlsi.com" set
> to ERROR
>  User        = gcohn
>  Queue       = all.q at europium.internal.avlsi.com
>  Host        = europium.internal.avlsi.com
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before job:07/15/2008 16:40:01 [5143:9026]: can't open file
> /scratch/1345595.1.all.q/pid: No such file or direc
> Shepherd trace:
> 07/15/2008 16:39:43 [5143:9026]: shepherd called with uid = 0, euid  
> = 5143
> 07/15/2008 16:39:43 [5143:9026]: starting up 6.1u2
> 07/15/2008 16:39:43 [5143:9026]: setpgid(9026, 9026) returned 0
> 07/15/2008 16:39:43 [5143:9026]: no prolog script to start
> 07/15/2008 16:39:43 [5143:9027]: processing qlogin job
> 07/15/2008 16:39:43 [5143:9027]: pid=9027 pgrp=9027 sid=9027 old
> pgrp=9026 getlogin()=<no login set>
> 07/15/2008 16:39:43 [5143:9027]: reading passwd information for  
> user 'root'
> 07/15/2008 16:39:43 [5143:9026]: forked "job" with pid 9027
> 07/15/2008 16:39:43 [5143:9026]: child: job - pid: 9027
> 07/15/2008 16:39:43 [5143:9027]: setosjobid: uid = 0, euid = 5143
> 07/15/2008 16:39:43 [5143:9027]: setting limits
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_CPU setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_DATA setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_CORE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_RSS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 07/15/2008 16:39:43 [5143:9027]: setting environment
> 07/15/2008 16:39:43 [5143:9027]: Initializing error file
> 07/15/2008 16:39:43 [5143:9027]: switching to intermediate/target user
> 07/15/2008 16:39:43 [9182:9027]: closing all filedescriptors
> 07/15/2008 16:39:43 [9182:9027]: further messages are in "error"  
> and "trace"
> 07/15/2008 16:39:43 [0:9027]: now running with uid=0, euid=0
> 07/15/2008 16:39:43 [0:9027]: start qlogin
> 07/15/2008 16:39:43 [0:9027]: calling
> qlogin_starter(/var/spool/sge/europium/active_jobs/1345595.1,
> /usr/sbin/sshd-grid -i);
> 07/15/2008 16:39:43 [0:9027]: uid = 0, euid = 0, gid = 0, egid = 0
> 07/15/2008 16:39:43 [0:9027]: using sfd 1
> 07/15/2008 16:39:43 [0:9027]: bound to port 37325
> 07/15/2008 16:39:43 [0:9027]: write_to_qrsh - data =
> 0:37325:/usr/local/grid-6.0/utilbin/lx24-amd64:/var/spool/sge/ 
> europium/active_jobs/1345595.1:europium.internal.avlsi.com
> 07/15/2008 16:39:43 [0:9027]: write_to_qrsh - address = charlemagne: 
> 55825
> 07/15/2008 16:39:43 [0:9027]: write_to_qrsh - host = charlemagne,  
> port =
> 55825
> 07/15/2008 16:39:43 [0:9027]: waiting for connection.
> 07/15/2008 16:40:01 [5143:9026]: wait3 returned -1
> 07/15/2008 16:40:01 [5143:9026]: mapped signal TSTP to signal KILL
> 07/15/2008 16:40:01 [5143:9026]: queued signal KILL
> 07/15/2008 16:40:01 [5143:9026]: can't open file
> /scratch/1345595.1.all.q/pid: No such file or directory
> 07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - data = 1:can't open
> file /scratch/1345595.1.all.q/pid: No such file or directory
> 07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - address =  
> charlemagne:55825
> 07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - host = charlemagne,
> port = 55825
>
> Shepherd error:
> 07/15/2008 16:40:01 [5143:9026]: can't open file
> /scratch/1345595.1.all.q/pid: No such file or directory
>
> Shepherd pe_hostfile:
> europium.internal.avlsi.com 1 all.q at europium.internal.avlsi.com <NULL>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list