Mail archive
alpine-infra

Re: ppc64le failed to reboot

From: Breno Leitao <brenohl_at_br.ibm.com>
Date: Fri, 23 Feb 2018 10:25:43 -0300

On 02/23/2018 09:36 AM, Breno Leitao wrote:
> hi Natanael,
>
> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>> Hi,
>>
>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>> not very responsive. It took 1253 seconds to shut it down (til ping
>> stopped respond), but it never came back again.
>
> Yes, that is weird. I am not even able to access the IPMI for this machine.
>
>> Can you help me have a look what went wrong?
>
> As I cannot access the console, it is hard to understand what is going one,
> we will need to restart it and see if we can recover any log. :-|
>
>> If you cannot do it today, then maybe on monday?
>
> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
> to get it online as soon as possible.

The machine is back online (Thanks Rafael) and it seems that it hit a kernel issue.
I am wondering if this is a physical problem or a kernel issue. We will need to investigate:


Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] [c000000000112e88] unmask_irq+0x50/0x6c
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] [c0000000001134e0] handle_level_irq+0x164/0x168
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] [c00000000010d980] generic_handle_irq+0x34/0x54
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] [c000000000076264] opal_handle_events+0x90/0xa8
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] [c000000000194dfc] irq_work_run_list+0x98/0xc0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] [c000000000194e54] irq_work_run+0x30/0x50
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] [c00000000001dd90] timer_interrupt+0xa8/0xc0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] [c00000000000bb3c] fast_exception_return+0x16c/0x190
Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at replay_interrupt_return+0x0/0x4
Feb 23 09:13:51 alpinebox kern.warn kernel: LR = arch_local_irq_restore+0x5c/0x80
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] [c000000000706b20] __do_softirq+0xe0/0x388
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] [c0000000000baff0] irq_exit+0x88/0xe0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] [c00000000001dd94] timer_interrupt+0xac/0xc0
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at init_timer_key+0x8/0xa0
Feb 23 09:13:51 alpinebox kern.warn kernel: LR = schedule_timeout+0xa4/0x3b4
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] [c000000000071fd8] kopald+0x94/0xb4
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] [c0000000000d5cc8] kthread+0x164/0x16c
Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected stall on CPU
Feb 23 09:14:05 alpinebox kern.err kernel: 16-...: (1 GPs behind) idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
Feb 23 09:14:05 alpinebox kern.err kernel: (t=2100 jiffies g=92991 c=92990 q=6903)
Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald Tainted: G W 4.14.17-0-vanilla #1-Alpine
Received on Fri Feb 23 2018 - 10:25:43 GMT