Mail archive
alpine-infra

Re: ppc64le failed to reboot

From: Guilherme Tiaki Sato <guilhermetiaki_at_me.com>
Date: Fri, 23 Feb 2018 14:56:29 -0300

Hi, Breno

Can we work on this on Monday, at 4 p.m?

--
Guilherme
On Fri, Feb 23, 2018 at 2:31 PM, Breno Leitao <brenohl_at_br.ibm.com> wrote:
> Hi Rafael,
>
> I definitely can. Since this machine has the project private keys, I want
> to
> be together with him during this firmware upgrade.
>
> What is the best time so we can do it together?
>
> On 02/23/2018 02:30 PM, Rafael Peria de Sene wrote:
> >
> > We can schedule the firmware update for later next week. Guilherme (CC),
> the
> > person who will work on it will need access to the OS to execute it. Do
> we
> > have a focal point that could help Guilherme during the process?
> >
> > *Rafael Sene*
> > *Staff Software Engineer
> > *
> > *SDK | Unicamp OpenPower Lab | Power Cloud Development*
> > *IBM Systems*
> >
> >
> > ------------------------------------------------------------
> -----------------
> > *Phone: +55 19 2132 4844 | +55 19 98153 9778*
> > *E-mail: _rpsene_at_br.ibm.com_ <mailto:#>*
> > *Rod. Jorn. Francisco Aguirre Proença - Chácaras Assay, Hortolândia - São
> > Paulo | Brazil Zip Code: 13186-900*
> >
> >
> >
> >
> >
> >
> > -----Breno Henrique Leitao/Brazil/IBM_at_IBMBR wrote: -----
> > To: William Pitcock <nenolod_at_dereferenced.org <mailto:
> nenolod_at_dereferenced.org>>
> > From: Breno Henrique Leitao/Brazil/IBM_at_IBMBR
> > Date: 02/23/2018 12:27
> > Cc: Natanael Copa <ncopa_at_alpinelinux.org <mailto:ncopa_at_alpinelinux.org>
> >,
> > alpine-infra_at_lists.alpinelinux.org
> > <mailto:alpine-infra_at_lists.alpinelinux.org>, Rafael Peria de
> > Sene/Brazil/IBM_at_IBMBR, Mike Sullivan <mksully_at_us.ibm.com
> > <mailto:mksully_at_us.ibm.com>>
> > Subject: Re: ppc64le failed to reboot
> >
> >
> > On 02/23/2018 11:32 AM, William Pitcock wrote:
> >> Hi,
> >>
> >> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl_at_br.ibm.com
> > <mailto:brenohl_at_br.ibm.com>> wrote:
> >>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
> >>>> hi Natanael,
> >>>>
> >>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
> >>>>> not very responsive. It took 1253 seconds to shut it down (til ping
> >>>>> stopped respond), but it never came back again.
> >>>>
> >>>> Yes, that is weird. I am not even able to access the IPMI for this
> machine.
> >>>>
> >>>>> Can you help me have a look what went wrong?
> >>>>
> >>>> As I cannot access the console, it is hard to understand what is
> going one,
> >>>> we will need to restart it and see if we can recover any log. :-|
> >>>>
> >>>>> If you cannot do it today, then maybe on monday?
> >>>>
> >>>> Fortunately Rafael is in the lab, and he is restarting the machine.
> Let's try
> >>>> to get it online as soon as possible.
> >>>
> >>> The machine is back online (Thanks Rafael) and it seems that it hit a
> > kernel issue.
> >>> I am wondering if this is a physical problem or a kernel issue. We will
> > need to investigate:
> >>>
> >>>
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330]
> > [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370]
> > [c000000000112e88] unmask_irq+0x50/0x6c
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0]
> > [c0000000001134e0] handle_level_irq+0x164/0x168
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0]
> > [c00000000010d980] generic_handle_irq+0x34/0x54
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0]
> > [c000000000076264] opal_handle_events+0x90/0xa8
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440]
> > [c000000000194dfc] irq_work_run_list+0x98/0xc0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490]
> > [c000000000194e54] irq_work_run+0x30/0x50
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0]
> > [c00000000001d864] __timer_interrupt+0x50/0x1ec
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510]
> > [c00000000001dd90] timer_interrupt+0xa8/0xc0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540]
> > [c00000000000bb3c] fast_exception_return+0x16c/0x190
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> > replay_interrupt_return+0x0/0x4
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> > arch_local_irq_restore+0x5c/0x80
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830]
> > [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850]
> > [c000000000706b20] __do_softirq+0xe0/0x388
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950]
> > [c0000000000baff0] irq_exit+0x88/0xe0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970]
> > [c00000000001dd94] timer_interrupt+0xac/0xc0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0]
> > [c00000000000bb3c] fast_exception_return+0x16c/0x190
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> > init_timer_key+0x8/0xa0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> > schedule_timeout+0xa4/0x3b4
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90]
> > [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80]
> > [c000000000071fd8] kopald+0x94/0xb4
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0]
> > [c0000000000d5cc8] kthread+0x164/0x16c
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30]
> > [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0
> > e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010
> > ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
> >>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched
> self-detected
> > stall on CPU
> >>> Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind)
> > idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
> >>> Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies
> g=92991
> > c=92990 q=6903)
> >>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
> >>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm:
> kopald
> > Tainted: G        W       4.14.17-0-vanilla #1-Alpine
> >>
> >> We have been observing this OPAL-related problem on that machine for a
> >> few weeks.  I suspect there is a problem with either the hardware or
> >> the system firmware.
> >
> > Right. That is what we were talking earlier today. We probably want to
> > migrate to the latest firmware and see if the problem still continue.
> >
> > We should be able to upgrade the firmware next week, so, we will need to
> stop
> > the machine and do the upgrade.
> >
>
>
Received on Fri Feb 23 2018 - 14:56:29 GMT