ppc64le failed to reboot

11 4

Natanael Copa <ncopa@alpinelinux.org>

Details

Message ID: <20180223122913.06140a73@ncopa-desktop.copa.dup.pw>
Sender timestamp: 1519385353
DKIM signature: missing

Hi,

I tried to reboot the ppc64le.alpinelinux.org machine because it was
not very responsive. It took 1253 seconds to shut it down (til ping
stopped respond), but it never came back again.

Can you help me have a look what went wrong?

If you cannot do it today, then maybe on monday?

Thanks!

-nc

William Pitcock <nenolod@dereferenced.org>

Details

Message ID: <CA+T2pCEfxi9due6YQM69LN-ZHeN4h5mEK_O2+HF4A6jCcddX9A@mail.gmail.com>
In-Reply-To: <fa4e9eed-2067-3767-0685-dbe6d3a89bb8@br.ibm.com> (view parent)
Sender timestamp: 1519396367
DKIM signature: missing

Download raw message

6 years ago

Hi,

On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl@br.ibm.com> wrote:
> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>> hi Natanael,
>>
>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>> Hi,
>>>
>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>>> not very responsive. It took 1253 seconds to shut it down (til ping
>>> stopped respond), but it never came back again.
>>
>> Yes, that is weird. I am not even able to access the IPMI for this machine.
>>
>>> Can you help me have a look what went wrong?
>>
>> As I cannot access the console, it is hard to understand what is going one,
>> we will need to restart it and see if we can recover any log. :-|
>>
>>> If you cannot do it today, then maybe on monday?
>>
>> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
>> to get it online as soon as possible.
>
> The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
> I am wondering if this is a physical problem or a kernel issue. We will need to investigate:
>
>
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] [c000000000112e88] unmask_irq+0x50/0x6c
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] [c0000000001134e0] handle_level_irq+0x164/0x168
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] [c00000000010d980] generic_handle_irq+0x34/0x54
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] [c000000000076264] opal_handle_events+0x90/0xa8
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] [c000000000194dfc] irq_work_run_list+0x98/0xc0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] [c000000000194e54] irq_work_run+0x30/0x50
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] [c00000000001dd90] timer_interrupt+0xa8/0xc0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] [c00000000000bb3c] fast_exception_return+0x16c/0x190
> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at replay_interrupt_return+0x0/0x4
> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR = arch_local_irq_restore+0x5c/0x80
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] [c000000000706b20] __do_softirq+0xe0/0x388
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] [c0000000000baff0] irq_exit+0x88/0xe0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] [c00000000001dd94] timer_interrupt+0xac/0xc0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at init_timer_key+0x8/0xa0
> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR = schedule_timeout+0xa4/0x3b4
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] [c000000000071fd8] kopald+0x94/0xb4
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] [c0000000000d5cc8] kthread+0x164/0x16c
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected stall on CPU
> Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind) idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
> Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies g=92991 c=92990 q=6903)
> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald Tainted: G        W       4.14.17-0-vanilla #1-Alpine

We have been observing this OPAL-related problem on that machine for a
few weeks.  I suspect there is a problem with either the hardware or
the system firmware.

William

Breno Leitao <brenohl@br.ibm.com>

Details

Message ID: <8036c91d-4b8f-6071-d923-8b1ca2dd11c6@br.ibm.com>
In-Reply-To: <20180223122913.06140a73@ncopa-desktop.copa.dup.pw> (view parent)
Sender timestamp: 1519389379
DKIM signature: missing

Download raw message

6 years ago

hi Natanael,

On 02/23/2018 08:29 AM, Natanael Copa wrote:
> Hi,
> 
> I tried to reboot the ppc64le.alpinelinux.org machine because it was
> not very responsive. It took 1253 seconds to shut it down (til ping
> stopped respond), but it never came back again.

Yes, that is weird. I am not even able to access the IPMI for this machine.

> Can you help me have a look what went wrong?

As I cannot access the console, it is hard to understand what is going one,
we will need to restart it and see if we can recover any log. :-|

> If you cannot do it today, then maybe on monday?

Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
to get it online as soon as possible.

Breno Leitao <brenohl@br.ibm.com>

Details

Message ID: <fa4e9eed-2067-3767-0685-dbe6d3a89bb8@br.ibm.com>
In-Reply-To: <8036c91d-4b8f-6071-d923-8b1ca2dd11c6@br.ibm.com> (view parent)
Sender timestamp: 1519392343
DKIM signature: missing

Download raw message

6 years ago

On 02/23/2018 09:36 AM, Breno Leitao wrote:
> hi Natanael,
> 
> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>> Hi,
>>
>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>> not very responsive. It took 1253 seconds to shut it down (til ping
>> stopped respond), but it never came back again.
> 
> Yes, that is weird. I am not even able to access the IPMI for this machine.
> 
>> Can you help me have a look what went wrong?
> 
> As I cannot access the console, it is hard to understand what is going one,
> we will need to restart it and see if we can recover any log. :-|
> 
>> If you cannot do it today, then maybe on monday?
> 
> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
> to get it online as soon as possible.

The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
I am wondering if this is a physical problem or a kernel issue. We will need to investigate:


Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] [c000000000112e88] unmask_irq+0x50/0x6c
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] [c0000000001134e0] handle_level_irq+0x164/0x168
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] [c00000000010d980] generic_handle_irq+0x34/0x54
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] [c000000000076264] opal_handle_events+0x90/0xa8
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] [c000000000194dfc] irq_work_run_list+0x98/0xc0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] [c000000000194e54] irq_work_run+0x30/0x50
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] [c00000000001dd90] timer_interrupt+0xa8/0xc0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] [c00000000000bb3c] fast_exception_return+0x16c/0x190
Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at replay_interrupt_return+0x0/0x4
Feb 23 09:13:51 alpinebox kern.warn kernel:     LR = arch_local_irq_restore+0x5c/0x80
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] [c000000000706b20] __do_softirq+0xe0/0x388
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] [c0000000000baff0] irq_exit+0x88/0xe0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] [c00000000001dd94] timer_interrupt+0xac/0xc0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at init_timer_key+0x8/0xa0
Feb 23 09:13:51 alpinebox kern.warn kernel:     LR = schedule_timeout+0xa4/0x3b4
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] [c000000000071fd8] kopald+0x94/0xb4
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] [c0000000000d5cc8] kthread+0x164/0x16c
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected stall on CPU
Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind) idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies g=92991 c=92990 q=6903)
Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald Tainted: G        W       4.14.17-0-vanilla #1-Alpine

Breno Leitao <brenohl@br.ibm.com>

Details

Message ID: <8d3736f3-1db6-1bc5-bbfd-2ac620525236@br.ibm.com>
In-Reply-To: <CA+T2pCEfxi9due6YQM69LN-ZHeN4h5mEK_O2+HF4A6jCcddX9A@mail.gmail.com> (view parent)
Sender timestamp: 1519399619
DKIM signature: missing

Download raw message

6 years ago


On 02/23/2018 11:32 AM, William Pitcock wrote:
> Hi,
> 
> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl@br.ibm.com> wrote:
>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>>> hi Natanael,
>>>
>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>>> Hi,
>>>>
>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>>>> not very responsive. It took 1253 seconds to shut it down (til ping
>>>> stopped respond), but it never came back again.
>>>
>>> Yes, that is weird. I am not even able to access the IPMI for this machine.
>>>
>>>> Can you help me have a look what went wrong?
>>>
>>> As I cannot access the console, it is hard to understand what is going one,
>>> we will need to restart it and see if we can recover any log. :-|
>>>
>>>> If you cannot do it today, then maybe on monday?
>>>
>>> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
>>> to get it online as soon as possible.
>>
>> The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
>> I am wondering if this is a physical problem or a kernel issue. We will need to investigate:
>>
>>
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] [c000000000112e88] unmask_irq+0x50/0x6c
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] [c0000000001134e0] handle_level_irq+0x164/0x168
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] [c00000000010d980] generic_handle_irq+0x34/0x54
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] [c000000000076264] opal_handle_events+0x90/0xa8
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] [c000000000194dfc] irq_work_run_list+0x98/0xc0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] [c000000000194e54] irq_work_run+0x30/0x50
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] [c00000000001dd90] timer_interrupt+0xa8/0xc0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] [c00000000000bb3c] fast_exception_return+0x16c/0x190
>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at replay_interrupt_return+0x0/0x4
>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR = arch_local_irq_restore+0x5c/0x80
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] [c000000000706b20] __do_softirq+0xe0/0x388
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] [c0000000000baff0] irq_exit+0x88/0xe0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] [c00000000001dd94] timer_interrupt+0xac/0xc0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at init_timer_key+0x8/0xa0
>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR = schedule_timeout+0xa4/0x3b4
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] [c000000000071fd8] kopald+0x94/0xb4
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] [c0000000000d5cc8] kthread+0x164/0x16c
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected stall on CPU
>> Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind) idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
>> Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies g=92991 c=92990 q=6903)
>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald Tainted: G        W       4.14.17-0-vanilla #1-Alpine
> 
> We have been observing this OPAL-related problem on that machine for a
> few weeks.  I suspect there is a problem with either the hardware or
> the system firmware.

Right. That is what we were talking earlier today. We probably want to
migrate to the latest firmware and see if the problem still continue.

We should be able to upgrade the firmware next week, so, we will need to stop
the machine and do the upgrade.

Natanael Copa <ncopa@alpinelinux.org>

Details

Message ID: <20180223153226.79a15143@ncopa-desktop.copa.dup.pw>
In-Reply-To: <fa4e9eed-2067-3767-0685-dbe6d3a89bb8@br.ibm.com> (view parent)
Sender timestamp: 1519396346
DKIM signature: missing

Download raw message

6 years ago

On Fri, 23 Feb 2018 10:25:43 -0300
Breno Leitao <brenohl@br.ibm.com> wrote:

> On 02/23/2018 09:36 AM, Breno Leitao wrote:
> > hi Natanael,
> > 
> > On 02/23/2018 08:29 AM, Natanael Copa wrote:  
> >> Hi,
> >>
> >> I tried to reboot the ppc64le.alpinelinux.org machine because it was
> >> not very responsive. It took 1253 seconds to shut it down (til ping
> >> stopped respond), but it never came back again.  
> > 
> > Yes, that is weird. I am not even able to access the IPMI for this machine.
> >   
> >> Can you help me have a look what went wrong?  
> > 
> > As I cannot access the console, it is hard to understand what is going one,
> > we will need to restart it and see if we can recover any log. :-|
> >   
> >> If you cannot do it today, then maybe on monday?  
> > 
> > Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
> > to get it online as soon as possible.  
> 
> The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
> I am wondering if this is a physical problem or a kernel issue. We will need to investigate:

Thank you very much for bringing the machine back.

-nc

Mike Sullivan <mksully@us.ibm.com>

Details

Message ID: <OF2F674318.652306AD-ON00258240.004A8B89-86258240.004A9C76@notes.na.collabserv.com>
In-Reply-To: <502867a9-9745-f3ca-b818-b54ca335a733@br.ibm.com> (view parent)
Sender timestamp: 1519652096
DKIM signature: missing

Download raw message

6 years ago

Was there any load on the machine over the weekend? or does the problem 
appear even when idle?

Mike Sullivan
Linux Performance
Linux Technology Center - IBM Corporation
(512)286-9416, email: mksully@us.ibm.com

From:   Breno Leitao <brenohl@br.ibm.com>
To:     William Pitcock <nenolod@dereferenced.org>
Cc:     Natanael Copa <ncopa@alpinelinux.org>, 
alpine-infra@lists.alpinelinux.org, Rafael Sene <rpsene@br.ibm.com>, Mike 
Sullivan/Austin/IBM@IBMUS
Date:   02/26/2018 06:38 AM
Subject:        Re: ppc64le failed to reboot

Just an update on this topic

On 02/23/2018 12:26 PM, Breno Leitao wrote:
> We should be able to upgrade the firmware next week, so, we will need to 
stop
> the machine and do the upgrade.

I am checking the machine and it didn't hit the problem during this 
weekend,
so, the problem does not seem to be very frequently, and I want to wait a 
bit
more to see if the problem will show up again.

Depending on the problem appearnce, Guilherme and I will be upgrading the
machine firmware next Thursday in the morning.

Breno Leitao <brenohl@br.ibm.com>

Details

Message ID: <502867a9-9745-f3ca-b818-b54ca335a733@br.ibm.com>
In-Reply-To: <8d3736f3-1db6-1bc5-bbfd-2ac620525236@br.ibm.com> (view parent)
Sender timestamp: 1519648674
DKIM signature: missing

Download raw message

6 years ago

Just an update on this topic

On 02/23/2018 12:26 PM, Breno Leitao wrote:
> We should be able to upgrade the firmware next week, so, we will need to stop
> the machine and do the upgrade.

I am checking the machine and it didn't hit the problem during this weekend,
so, the problem does not seem to be very frequently, and I want to wait a bit
more to see if the problem will show up again.

Depending on the problem appearnce, Guilherme and I will be upgrading the
machine firmware next Thursday in the morning.

William Pitcock <nenolod@dereferenced.org>

Details

Message ID: <CA+T2pCGuhms7cTZdnOd+98H0g_MyDypKJ0S_RyapQ_+kNJTh1w@mail.gmail.com>
In-Reply-To: <OF2F674318.652306AD-ON00258240.004A8B89-86258240.004A9C76@notes.na.collabserv.com> (view parent)
Sender timestamp: 1519661445
DKIM signature: missing

Download raw message

6 years ago

Hi,

On Mon, Feb 26, 2018 at 7:34 AM, Mike Sullivan <mksully@us.ibm.com> wrote:
> Was there any load on the machine over the weekend? or does the problem
> appear even when idle?

On another POWER machine I observe similar behaviour when the machine
is under network load, but not enough to cause the driver to switch to
polling mode.  Accordingly, I was thinking that maybe it is being
triggered by rsync.

I can try to upgrade the firmware to latest on my machine and see if
it goes away.

William

William Pitcock <nenolod@dereferenced.org>

Details

Message ID: <CA+T2pCFgOs1-xM5uVHbLn3t2FyknJ=UV6bsOAe06dUXE3r0zfw@mail.gmail.com>
In-Reply-To: <20180226171953.7740bdc4@ncopa-macbook.copa.dup.pw> (view parent)
Sender timestamp: 1519669909
DKIM signature: missing

Download raw message

6 years ago

Hi,

For what it's worth, I upgraded the firmware on the SC812L machine I
have, and I can no longer reproduce this problem.  I even created a VM
with qemu -enable-kvm and nothing bad happened.
So, it is probably the firmware.

William

On Mon, Feb 26, 2018 at 10:19 AM, Natanael Copa <ncopa@alpinelinux.org> wrote:
> Hi,
>
> I think the problems started after we made the jenkins qemu/kvm VM.
> Before that we have not had any issues as far I can remember.
>
> I have stopped the VM since we don't use it in production yet, but I
> can start it again if you want reproduce it.
>
> -nc
>
> On Mon, 26 Feb 2018 07:34:56 -0600
> "Mike Sullivan" <mksully@us.ibm.com> wrote:
>
>> Was there any load on the machine over the weekend? or does the problem
>> appear even when idle?
>>
>> Mike Sullivan
>> Linux Performance
>> Linux Technology Center - IBM Corporation
>> (512)286-9416, email: mksully@us.ibm.com
>>
>>
>>
>>
>> From:   Breno Leitao <brenohl@br.ibm.com>
>> To:     William Pitcock <nenolod@dereferenced.org>
>> Cc:     Natanael Copa <ncopa@alpinelinux.org>,
>> alpine-infra@lists.alpinelinux.org, Rafael Sene <rpsene@br.ibm.com>, Mike
>> Sullivan/Austin/IBM@IBMUS
>> Date:   02/26/2018 06:38 AM
>> Subject:        Re: ppc64le failed to reboot
>>
>>
>>
>> Just an update on this topic
>>
>> On 02/23/2018 12:26 PM, Breno Leitao wrote:
>> > We should be able to upgrade the firmware next week, so, we will need to
>> stop
>> > the machine and do the upgrade.
>>
>> I am checking the machine and it didn't hit the problem during this
>> weekend,
>> so, the problem does not seem to be very frequently, and I want to wait a
>> bit
>> more to see if the problem will show up again.
>>
>> Depending on the problem appearnce, Guilherme and I will be upgrading the
>> machine firmware next Thursday in the morning.
>>
>>
>>
>>
>

Natanael Copa <ncopa@alpinelinux.org>

Details

Message ID: <20180226171953.7740bdc4@ncopa-macbook.copa.dup.pw>
In-Reply-To: <OF2F674318.652306AD-ON00258240.004A8B89-86258240.004A9C76@notes.na.collabserv.com> (view parent)
Sender timestamp: 1519661993
DKIM signature: missing

Download raw message

6 years ago

Hi,

I think the problems started after we made the jenkins qemu/kvm VM.
Before that we have not had any issues as far I can remember.

I have stopped the VM since we don't use it in production yet, but I
can start it again if you want reproduce it.

-nc

On Mon, 26 Feb 2018 07:34:56 -0600
"Mike Sullivan" <mksully@us.ibm.com> wrote:

> Was there any load on the machine over the weekend? or does the problem 
> appear even when idle?
> 
> Mike Sullivan
> Linux Performance
> Linux Technology Center - IBM Corporation
> (512)286-9416, email: mksully@us.ibm.com
> 
> 
> 
> 
> From:   Breno Leitao <brenohl@br.ibm.com>
> To:     William Pitcock <nenolod@dereferenced.org>
> Cc:     Natanael Copa <ncopa@alpinelinux.org>, 
> alpine-infra@lists.alpinelinux.org, Rafael Sene <rpsene@br.ibm.com>, Mike 
> Sullivan/Austin/IBM@IBMUS
> Date:   02/26/2018 06:38 AM
> Subject:        Re: ppc64le failed to reboot
> 
> 
> 
> Just an update on this topic
> 
> On 02/23/2018 12:26 PM, Breno Leitao wrote:
> > We should be able to upgrade the firmware next week, so, we will need to   
> stop
> > the machine and do the upgrade.  
> 
> I am checking the machine and it didn't hit the problem during this 
> weekend,
> so, the problem does not seem to be very frequently, and I want to wait a 
> bit
> more to see if the problem will show up again.
> 
> Depending on the problem appearnce, Guilherme and I will be upgrading the
> machine firmware next Thursday in the morning.
> 
> 
> 
>

Mike Sullivan <mksully@us.ibm.com>

Details

Message ID: <OF61CB825E.7F939162-ON00258242.0053025E-86258242.00531F41@notes.na.collabserv.com>
In-Reply-To: <502867a9-9745-f3ca-b818-b54ca335a733@br.ibm.com> (view parent)
Sender timestamp: 1519830475
DKIM signature: missing

Download raw message

6 years ago

Was the firmware updated? Is the machine up and running and ready for the 
Alpine ppc64le builders to be restarted?

Mike Sullivan
Linux Performance
Linux Technology Center - IBM Corporation
(512)286-9416, email: mksully@us.ibm.com

From:   Breno Leitao <brenohl@br.ibm.com>
To:     William Pitcock <nenolod@dereferenced.org>
Cc:     Natanael Copa <ncopa@alpinelinux.org>, 
alpine-infra@lists.alpinelinux.org, Rafael Sene <rpsene@br.ibm.com>, Mike 
Sullivan/Austin/IBM@IBMUS
Date:   02/26/2018 06:38 AM
Subject:        Re: ppc64le failed to reboot

Just an update on this topic

On 02/23/2018 12:26 PM, Breno Leitao wrote:
> We should be able to upgrade the firmware next week, so, we will need to 
stop
> the machine and do the upgrade.

I am checking the machine and it didn't hit the problem during this 
weekend,
so, the problem does not seem to be very frequently, and I want to wait a 
bit
more to see if the problem will show up again.

Depending on the problem appearnce, Guilherme and I will be upgrading the
machine firmware next Thursday in the morning.

~alpine/infra

ppc64le failed to reboot