[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Bevan C. Bennett bevan at fulcrummicro.com
Tue Jul 15 19:45:42 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Hi,
> 
> Am 15.07.2008 um 01:45 schrieb Bevan C. Bennett:
> 
>>
>> I'm seeing a lot of intermittent failures on my grid. They always look
>> like the following and end up marking the host that tried to run them as
>> error state. Left alone they'll mark the whole system as error and bring
>> everything to a crashing halt.
>>
>> Does anyone have any idea what might cause these?
>>
>> Does anyone have any similar issues?
>>
>> We've been band-aiding the situation by scanning the logs and clearing
>> the error states, but that's not maintainable.
>>
>> They all say roughly the same thing:
>>
>> Job 1157200 caused action: Job 1157200 set to ERROR
>>  User        = rozdag
>>  Queue       = all.q at ytterbium.internal.avlsi.com
>>  Host        = ytterbium.internal.avlsi.com
>>  Start Time  = <unknown>
>>  End Time    = <unknown>
>> failed before job:04/23/2008 14:08:01 [0:10801]: can't open file
>> job_pid: Permission denied
>> Shepherd trace:
>> 04/23/2008 14:08:01 [5143:10800]: shepherd called with uid = 0, euid =
>> 5143
>> 04/23/2008 14:08:01 [5143:10800]: starting up 6.1u2
>> 04/23/2008 14:08:01 [5143:10800]: setpgid(10800, 10800) returned 0
>> 04/23/2008 14:08:01 [5143:10800]: no prolog script to start
>> 04/23/2008 14:08:01 [5143:10801]: processing qlogin job
>> 04/23/2008 14:08:01 [5143:10801]: pid=10801 pgrp=10801 sid=10801 old
>> pgrp=10800 getlogin()=<no login set>
>> 04/23/2008 14:08:01 [5143:10801]: reading passwd information for user
>> 'root'
>> 04/23/2008 14:08:01 [5143:10801]: setosjobid: uid = 0, euid = 5143
>> 04/23/2008 14:08:01 [5143:10800]: forked "job" with pid 10801
>> 04/23/2008 14:08:01 [5143:10801]: setting limits
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_CPU setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_FSIZE setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_DATA setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_STACK setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_CORE setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_RSS setting: (soft
>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>> 18446744073709551615 hard 18446744073709551615)
>> 04/23/2008 14:08:01 [5143:10801]: setting environment
>> 04/23/2008 14:08:01 [5143:10801]: Initializing error file
>> 04/23/2008 14:08:01 [5143:10800]: child: job - pid: 10801
>> 04/23/2008 14:08:01 [5143:10801]: switching to intermediate/target user
>> 04/23/2008 14:08:01 [9086:10801]: closing all filedescriptors
>> 04/23/2008 14:08:01 [9086:10801]: further messages are in "error" and
>> "trace"
>> 04/23/2008 14:08:01 [0:10801]: now running with uid=0, euid=0
>> 04/23/2008 14:08:01 [0:10801]: start qlogin
>> 04/23/2008 14:08:01 [0:10801]: calling
>> qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/ytterbium/active_jobs/1157200.1,
>>
>> /usr/sbin/sshd-grid -i);
> 
> are the spool directories global on NFS or alike and can it be a timing
> problem? Can you have them also local in /var/spool/sge or so?

They are global on NFS currently. We keep our system times synced
tightly, but the master has been experiencing increased latency at times.

I'll see about starting to convert the servers over.

> http://gridengine.sunsource.net/howto/nfsreduce.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list