Mail archive
alpine-infra

Re: ppc64le failed to reboot

From: Breno Leitao <brenohl_at_br.ibm.com>
Date: Fri, 23 Feb 2018 12:26:59 -0300

On 02/23/2018 11:32 AM, William Pitcock wrote:
> Hi,
>
> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl_at_br.ibm.com> wrote:
>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>>> hi Natanael,
>>>
>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>>> Hi,
>>>>
>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>>>> not very responsive. It took 1253 seconds to shut it down (til ping
>>>> stopped respond), but it never came back again.
>>>
>>> Yes, that is weird. I am not even able to access the IPMI for this machine.
>>>
>>>> Can you help me have a look what went wrong?
>>>
>>> As I cannot access the console, it is hard to understand what is going one,
>>> we will need to restart it and see if we can recover any log. :-|
>>>
>>>> If you cannot do it today, then maybe on monday?
>>>
>>> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
>>> to get it online as soon as possible.
>>
>> The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
>> I am wondering if this is a physical problem or a kernel issue. We will need to investigate:
>>
>>
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] [c000000000112e88] unmask_irq+0x50/0x6c
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] [c0000000001134e0] handle_level_irq+0x164/0x168
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] [c00000000010d980] generic_handle_irq+0x34/0x54
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] [c000000000076264] opal_handle_events+0x90/0xa8
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] [c000000000194dfc] irq_work_run_list+0x98/0xc0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] [c000000000194e54] irq_work_run+0x30/0x50
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] [c00000000001dd90] timer_interrupt+0xa8/0xc0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] [c00000000000bb3c] fast_exception_return+0x16c/0x190
>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at replay_interrupt_return+0x0/0x4
>> Feb 23 09:13:51 alpinebox kern.warn kernel: LR = arch_local_irq_restore+0x5c/0x80
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] [c000000000706b20] __do_softirq+0xe0/0x388
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] [c0000000000baff0] irq_exit+0x88/0xe0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] [c00000000001dd94] timer_interrupt+0xac/0xc0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at init_timer_key+0x8/0xa0
>> Feb 23 09:13:51 alpinebox kern.warn kernel: LR = schedule_timeout+0xa4/0x3b4
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] [c000000000071fd8] kopald+0x94/0xb4
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] [c0000000000d5cc8] kthread+0x164/0x16c
>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected stall on CPU
>> Feb 23 09:14:05 alpinebox kern.err kernel: 16-...: (1 GPs behind) idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
>> Feb 23 09:14:05 alpinebox kern.err kernel: (t=2100 jiffies g=92991 c=92990 q=6903)
>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald Tainted: G W 4.14.17-0-vanilla #1-Alpine
>
> We have been observing this OPAL-related problem on that machine for a
> few weeks. I suspect there is a problem with either the hardware or
> the system firmware.

Right. That is what we were talking earlier today. We probably want to
migrate to the latest firmware and see if the problem still continue.

We should be able to upgrade the firmware next week, so, we will need to stop
the machine and do the upgrade.
Received on Fri Feb 23 2018 - 12:26:59 GMT