[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Bevan C. Bennett bevan at fulcrummicro.com
Wed Jul 16 00:47:04 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Bevan C. Bennett wrote:
> Reuti wrote:
>> Hi,
>>
>> Am 15.07.2008 um 01:45 schrieb Bevan C. Bennett:
>>
>>> I'm seeing a lot of intermittent failures on my grid. They always look
>>> like the following and end up marking the host that tried to run them as
>>> error state. Left alone they'll mark the whole system as error and bring
>>> everything to a crashing halt.
>>>
>>> Does anyone have any idea what might cause these?
>>>
>>> Does anyone have any similar issues?
>>>
>> are the spool directories global on NFS or alike and can it be a timing
>> problem? Can you have them also local in /var/spool/sge or so?
> 
> They are global on NFS currently. We keep our system times synced
> tightly, but the master has been experiencing increased latency at times.
> 
> I'll see about starting to convert the servers over.

Been moving through the slow process of converting every server to a
local spool directory, but noticed this this afternoon. europium was
converted to a local spool this morning (note references to
/var/spool/sge) yet still popped up with the same error.

"/scratch" is a local drive on all servers and is set as the "tmp
directory" for all queues. Could the execd somehow be deleting this tmp
jobdir before the shepherd gets to it? Could it be getting created too late?

Job 1345595 caused action: Queue "all.q at europium.internal.avlsi.com" set
to ERROR
 User        = gcohn
 Queue       = all.q at europium.internal.avlsi.com
 Host        = europium.internal.avlsi.com
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:07/15/2008 16:40:01 [5143:9026]: can't open file
/scratch/1345595.1.all.q/pid: No such file or direc
Shepherd trace:
07/15/2008 16:39:43 [5143:9026]: shepherd called with uid = 0, euid = 5143
07/15/2008 16:39:43 [5143:9026]: starting up 6.1u2
07/15/2008 16:39:43 [5143:9026]: setpgid(9026, 9026) returned 0
07/15/2008 16:39:43 [5143:9026]: no prolog script to start
07/15/2008 16:39:43 [5143:9027]: processing qlogin job
07/15/2008 16:39:43 [5143:9027]: pid=9027 pgrp=9027 sid=9027 old
pgrp=9026 getlogin()=<no login set>
07/15/2008 16:39:43 [5143:9027]: reading passwd information for user 'root'
07/15/2008 16:39:43 [5143:9026]: forked "job" with pid 9027
07/15/2008 16:39:43 [5143:9026]: child: job - pid: 9027
07/15/2008 16:39:43 [5143:9027]: setosjobid: uid = 0, euid = 5143
07/15/2008 16:39:43 [5143:9027]: setting limits
07/15/2008 16:39:43 [5143:9027]: RLIMIT_CPU setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: RLIMIT_RSS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
07/15/2008 16:39:43 [5143:9027]: setting environment
07/15/2008 16:39:43 [5143:9027]: Initializing error file
07/15/2008 16:39:43 [5143:9027]: switching to intermediate/target user
07/15/2008 16:39:43 [9182:9027]: closing all filedescriptors
07/15/2008 16:39:43 [9182:9027]: further messages are in "error" and "trace"
07/15/2008 16:39:43 [0:9027]: now running with uid=0, euid=0
07/15/2008 16:39:43 [0:9027]: start qlogin
07/15/2008 16:39:43 [0:9027]: calling
qlogin_starter(/var/spool/sge/europium/active_jobs/1345595.1,
/usr/sbin/sshd-grid -i);
07/15/2008 16:39:43 [0:9027]: uid = 0, euid = 0, gid = 0, egid = 0
07/15/2008 16:39:43 [0:9027]: using sfd 1
07/15/2008 16:39:43 [0:9027]: bound to port 37325
07/15/2008 16:39:43 [0:9027]: write_to_qrsh - data =
0:37325:/usr/local/grid-6.0/utilbin/lx24-amd64:/var/spool/sge/europium/active_jobs/1345595.1:europium.internal.avlsi.com
07/15/2008 16:39:43 [0:9027]: write_to_qrsh - address = charlemagne:55825
07/15/2008 16:39:43 [0:9027]: write_to_qrsh - host = charlemagne, port =
55825
07/15/2008 16:39:43 [0:9027]: waiting for connection.
07/15/2008 16:40:01 [5143:9026]: wait3 returned -1
07/15/2008 16:40:01 [5143:9026]: mapped signal TSTP to signal KILL
07/15/2008 16:40:01 [5143:9026]: queued signal KILL
07/15/2008 16:40:01 [5143:9026]: can't open file
/scratch/1345595.1.all.q/pid: No such file or directory
07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - data = 1:can't open
file /scratch/1345595.1.all.q/pid: No such file or directory
07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - address = charlemagne:55825
07/15/2008 16:40:01 [5143:9026]: write_to_qrsh - host = charlemagne,
port = 55825

Shepherd error:
07/15/2008 16:40:01 [5143:9026]: can't open file
/scratch/1345595.1.all.q/pid: No such file or directory

Shepherd pe_hostfile:
europium.internal.avlsi.com 1 all.q at europium.internal.avlsi.com <NULL>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list