Mail archive
alpine-infra

Re: ppc64le failed to reboot

From: William Pitcock <nenolod_at_dereferenced.org>
Date: Fri, 23 Feb 2018 08:32:47 -0600

Hi,

On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl_at_br.ibm.com> wrote:
> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>> hi Natanael,
>>
>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>> Hi,
>>>
>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>>> not very responsive. It took 1253 seconds to shut it down (til ping
>>> stopped respond), but it never came back again.
>>
>> Yes, that is weird. I am not even able to access the IPMI for this machine.
>>
>>> Can you help me have a look what went wrong?
>>
>> As I cannot access the console, it is hard to understand what is going one,
>> we will need to restart it and see if we can recover any log. :-|
>>
>>> If you cannot do it today, then maybe on monday?
>>
>> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
>> to get it online as soon as possible.
>
> The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
> I am wondering if this is a physical problem or a kernel issue. We will need to investigate:
>
>
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] [c000000000112e88] unmask_irq+0x50/0x6c
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] [c0000000001134e0] handle_level_irq+0x164/0x168
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] [c00000000010d980] generic_handle_irq+0x34/0x54
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] [c000000000076264] opal_handle_events+0x90/0xa8
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] [c000000000194dfc] irq_work_run_list+0x98/0xc0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] [c000000000194e54] irq_work_run+0x30/0x50
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] [c00000000001dd90] timer_interrupt+0xa8/0xc0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] [c00000000000bb3c] fast_exception_return+0x16c/0x190
> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at replay_interrupt_return+0x0/0x4
> Feb 23 09:13:51 alpinebox kern.warn kernel: LR = arch_local_irq_restore+0x5c/0x80
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] [c000000000706b20] __do_softirq+0xe0/0x388
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] [c0000000000baff0] irq_exit+0x88/0xe0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] [c00000000001dd94] timer_interrupt+0xac/0xc0
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at init_timer_key+0x8/0xa0
> Feb 23 09:13:51 alpinebox kern.warn kernel: LR = schedule_timeout+0xa4/0x3b4
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] [c000000000071fd8] kopald+0x94/0xb4
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] [c0000000000d5cc8] kthread+0x164/0x16c
> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected stall on CPU
> Feb 23 09:14:05 alpinebox kern.err kernel: 16-...: (1 GPs behind) idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
> Feb 23 09:14:05 alpinebox kern.err kernel: (t=2100 jiffies g=92991 c=92990 q=6903)
> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald Tainted: G W 4.14.17-0-vanilla #1-Alpine

We have been observing this OPAL-related problem on that machine for a
few weeks. I suspect there is a problem with either the hardware or
the system firmware.

William
Received on Fri Feb 23 2018 - 08:32:47 GMT