[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Bevan C. Bennett bevan at fulcrummicro.com
Wed Jul 16 18:14:15 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 16.07.2008 um 01:47 schrieb Bevan C. Bennett:
> 
>> Bevan C. Bennett wrote:
>>> Reuti wrote:
>>>> Hi,
>>>>
>>>> Am 15.07.2008 um 01:45 schrieb Bevan C. Bennett:
>>>>
>>>>> I'm seeing a lot of intermittent failures on my grid. They always look
>>>>> like the following and end up marking the host that tried to run
>>>>> them as
>>>>> error state. Left alone they'll mark the whole system as error and
>>>>> bring
>>>>> everything to a crashing halt.
>>>>>
>>>>> Does anyone have any idea what might cause these?
>>>>>
>>>>> Does anyone have any similar issues?
>>>>>
>>>> are the spool directories global on NFS or alike and can it be a timing
>>>> problem? Can you have them also local in /var/spool/sge or so?
>>>
>>> They are global on NFS currently. We keep our system times synced
>>> tightly, but the master has been experiencing increased latency at
>>> times.
>>>
>>> I'll see about starting to convert the servers over.
>>
>> Been moving through the slow process of converting every server to a
>> local spool directory, but noticed this this afternoon. europium was
>> converted to a local spool this morning (note references to
>> /var/spool/sge) yet still popped up with the same error.
> 
> AFAICS the error is different. The old one was "connection refused"
> because "job_pid" inside .../<nodename>/active_jobs/<job-id> couldn'tbe
> written - this seems now to be created.

They all had the "can't open /scratch/whatever/pid" message, but you're
right, the earlier ones had more (and different) errors earlier on.

> And now it's "no such file or directory", but in /scratch.
> 
> The /scratch is local and writable by everyone?

Yes and yes.

> As the job was submitted as root: ssh by root is allowed?

The job was not submitted by root, it was submitted by gcohn.
"User        = gcohn"
I don't know why it would be reading root's limits, but root certainly
can't ssh around without it's password.


I'll watch the error logs for a few days and see what happens... it may
be that this does indeed help and the last job got caught "in between"
or something.

>> "/scratch" is a local drive on all servers and is set as the "tmp
>> directory" for all queues. Could the execd somehow be deleting this tmp
>> jobdir before the shepherd gets to it? Could it be getting created too
>> late?
>>
>> Job 1345595 caused action: Queue "all.q at europium.internal.avlsi.com" set
>> to ERROR
>>  User        = gcohn
>>  Queue       = all.q at europium.internal.avlsi.com
>>  Host        = europium.internal.avlsi.com
>>  Start Time  = <unknown>
>>  End Time    = <unknown>
>> failed before job:07/15/2008 16:40:01 [5143:9026]: can't open file
>> /scratch/1345595.1.all.q/pid: No such file or direc
>> Shepherd trace:
>> 07/15/2008 16:39:43 [5143:9026]: shepherd called with uid = 0, euid =
>> 5143
>> 07/15/2008 16:39:43 [5143:9026]: starting up 6.1u2
>> 07/15/2008 16:39:43 [5143:9026]: setpgid(9026, 9026) returned 0
>> 07/15/2008 16:39:43 [5143:9026]: no prolog script to start
>> 07/15/2008 16:39:43 [5143:9027]: processing qlogin job
>> 07/15/2008 16:39:43 [5143:9027]: pid=9027 pgrp=9027 sid=9027 old
>> pgrp=9026 getlogin()=<no login set>
>> 07/15/2008 16:39:43 [5143:9027]: reading passwd information for user
>> 'root'
>> 07/15/2008 16:39:43 [5143:9026]: forked "job" with pid 9027
>> 07/15/2008 16:39:43 [5143:9026]: child: job - pid: 9027
>> 07/15/2008 16:39:43 [5143:9027]: setosjobid: uid = 0, euid = 5143
>> 07/15/2008 16:39:43 [5143:9027]: setting limits
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_CPU setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_FSIZE setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_DATA setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_STACK setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_CORE setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: RLIMIT_RSS setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 07/15/2008 16:39:43 [5143:9027]: setting environment
>> 07/15/2008 16:39:43 [5143:9027]: Initializing error file
>> 07/15/2008 16:39:43 [5143:9027]: switching to intermediate/target user
>> 07/15/2008 16:39:43 [9182:9027]: closing all filedescriptors
>> 07/15/2008 16:39:43 [9182:9027]: further messages are in "error" and
>> "trace"
>> 07/15/2008 16:39:43 [0:9027]: now running with uid=0, euid=0
>> 07/15/2008 16:39:43 [0:9027]: start qlogin
>> 07/15/2008 16:39:43 [0:9027]: calling
>> qlogin_starter(/var/spool/sge/europium/active_jobs/1345595.1,
>> /usr/sbin/sshd-grid -i);
>> 07/15/2008 16:39:43 [0:9027]: uid = 0, euid = 0, gid = 0, egid = 0
>> 07/15/2008 16:39:43 [0:9027]: using sfd 1
>> 07/15/2008 16:39:43 [0:9027]: bound to port 37325
>> 07/15/2008 16:39:43 [0:9027]: write_to_qrsh - data =
>> 0:37325:/usr/local/grid-6.0/utilbin/lx24-amd64:/var/spool/sge/europium/active_jobs/1345595.1:europium.internal.avlsi.com
>>
>> 07/15/2008 16:39:43 [0:9027]: write_to_qrsh - address = charlemagne:55825
>> 07/15/2008 16:39:43 [0:9027]: write_to_qrsh - host = charlemagne, port =
>> 55825
>> 07/15/2008 16:39:43 [0:9027]: waiting for connection.
>> 07/15/2008 16:40:01 [5143:9026]: wait3 returned -1
>> 07/15/2008 16:40:01 [5143:9026]: mapped signal TSTP to signal KILL
>> 07/15/2008 16:40:01 [5143:9026]: queued signal KILL
>> 07/15/2008 16:40:01 [5143:9026]: can't open file
>> /scratch/1345595.1.all.q/pid: No such file or directory
>> 07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - data = 1:can't open
>> file /scratch/1345595.1.all.q/pid: No such file or directory
>> 07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - address =
>> charlemagne:55825
>> 07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - host = charlemagne,
>> port = 55825
>>
>> Shepherd error:
>> 07/15/2008 16:40:01 [5143:9026]: can't open file
>> /scratch/1345595.1.all.q/pid: No such file or directory
>>
>> Shepherd pe_hostfile:
>> europium.internal.avlsi.com 1 all.q at europium.internal.avlsi.com <NULL>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list