Mail archive
alpine-infra

Re: ppc64le failed to reboot

From: Breno Leitao <brenohl_at_br.ibm.com>
Date: Fri, 23 Feb 2018 14:31:31 -0300

Hi Rafael,

I definitely can. Since this machine has the project private keys, I want to
be together with him during this firmware upgrade.

What is the best time so we can do it together?

On 02/23/2018 02:30 PM, Rafael Peria de Sene wrote:
>
> We can schedule the firmware update for later next week. Guilherme (CC), the
> person who will work on it will need access to the OS to execute it. Do we
> have a focal point that could help Guilherme during the process?
>
> *Rafael Sene* 
> *Staff Software Engineer
> *
> *SDK | Unicamp OpenPower Lab | Power Cloud Development*
> *IBM Systems*
>
>
> -----------------------------------------------------------------------------
> *Phone: +55 19 2132 4844 | +55 19 98153 9778*
> *E-mail: _rpsene_at_br.ibm.com_ <mailto:#>*
> *Rod. Jorn. Francisco Aguirre Proença - Chácaras Assay, Hortolândia - São
> Paulo | Brazil Zip Code: 13186-900*
>  
>  
>
>
>
>
> -----Breno Henrique Leitao/Brazil/IBM_at_IBMBR wrote: -----
> To: William Pitcock <nenolod_at_dereferenced.org <mailto:nenolod_at_dereferenced.org>>
> From: Breno Henrique Leitao/Brazil/IBM_at_IBMBR
> Date: 02/23/2018 12:27
> Cc: Natanael Copa <ncopa_at_alpinelinux.org <mailto:ncopa_at_alpinelinux.org>>,
> alpine-infra_at_lists.alpinelinux.org
> <mailto:alpine-infra_at_lists.alpinelinux.org>, Rafael Peria de
> Sene/Brazil/IBM_at_IBMBR, Mike Sullivan <mksully_at_us.ibm.com
> <mailto:mksully_at_us.ibm.com>>
> Subject: Re: ppc64le failed to reboot
>
>
> On 02/23/2018 11:32 AM, William Pitcock wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl_at_br.ibm.com
> <mailto:brenohl_at_br.ibm.com>> wrote:
>>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>>>> hi Natanael,
>>>>
>>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>>>> Hi,
>>>>>
>>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>>>>> not very responsive. It took 1253 seconds to shut it down (til ping
>>>>> stopped respond), but it never came back again.
>>>>
>>>> Yes, that is weird. I am not even able to access the IPMI for this machine.
>>>>
>>>>> Can you help me have a look what went wrong?
>>>>
>>>> As I cannot access the console, it is hard to understand what is going one,
>>>> we will need to restart it and see if we can recover any log. :-|
>>>>
>>>>> If you cannot do it today, then maybe on monday?
>>>>
>>>> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
>>>> to get it online as soon as possible.
>>>
>>> The machine is back online (Thanks Rafael) and it seems that it hit a
> kernel issue.
>>> I am wondering if this is a physical problem or a kernel issue. We will
> need to investigate:
>>>
>>>
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330]
> [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370]
> [c000000000112e88] unmask_irq+0x50/0x6c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0]
> [c0000000001134e0] handle_level_irq+0x164/0x168
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0]
> [c00000000010d980] generic_handle_irq+0x34/0x54
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0]
> [c000000000076264] opal_handle_events+0x90/0xa8
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440]
> [c000000000194dfc] irq_work_run_list+0x98/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490]
> [c000000000194e54] irq_work_run+0x30/0x50
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0]
> [c00000000001d864] __timer_interrupt+0x50/0x1ec
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510]
> [c00000000001dd90] timer_interrupt+0xa8/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540]
> [c00000000000bb3c] fast_exception_return+0x16c/0x190
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> replay_interrupt_return+0x0/0x4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> arch_local_irq_restore+0x5c/0x80
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830]
> [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850]
> [c000000000706b20] __do_softirq+0xe0/0x388
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950]
> [c0000000000baff0] irq_exit+0x88/0xe0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970]
> [c00000000001dd94] timer_interrupt+0xac/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0]
> [c00000000000bb3c] fast_exception_return+0x16c/0x190
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> init_timer_key+0x8/0xa0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> schedule_timeout+0xa4/0x3b4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90]
> [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80]
> [c000000000071fd8] kopald+0x94/0xb4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0]
> [c0000000000d5cc8] kthread+0x164/0x16c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30]
> [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0
> e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010
> ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
>>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected
> stall on CPU
>>> Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind)
> idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
>>> Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies g=92991
> c=92990 q=6903)
>>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
>>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald
> Tainted: G        W       4.14.17-0-vanilla #1-Alpine
>>
>> We have been observing this OPAL-related problem on that machine for a
>> few weeks.  I suspect there is a problem with either the hardware or
>> the system firmware.
>
> Right. That is what we were talking earlier today. We probably want to
> migrate to the latest firmware and see if the problem still continue.
>
> We should be able to upgrade the firmware next week, so, we will need to stop
> the machine and do the upgrade.
>
Received on Fri Feb 23 2018 - 14:31:31 GMT