~alpine/infra

1

Re: ppc64le failed to reboot

Breno Leitao <brenohl@br.ibm.com>
Details
Message ID
<3063e09c-30b6-14f3-7cb3-a44e4846a5ae@br.ibm.com>
Sender timestamp
1519407091
DKIM signature
missing
Download raw message
Hi Rafael,

I definitely can. Since this machine has the project private keys, I want to
be together with him during this firmware upgrade.

What is the best time so we can do it together?

On 02/23/2018 02:30 PM, Rafael Peria de Sene wrote:
> 
> We can schedule the firmware update for later next week. Guilherme (CC), the
> person who will work on it will need access to the OS to execute it. Do we
> have a focal point that could help Guilherme during the process?
> 
> *Rafael Sene* 
> *Staff Software Engineer
> *
> *SDK | Unicamp OpenPower Lab | Power Cloud Development*
> *IBM Systems*
> 
> 	
> -----------------------------------------------------------------------------
> *Phone: +55 19 2132 4844 | +55 19 98153 9778*
> *E-mail: _rpsene@br.ibm.com_ <mailto:#>*
> *Rod. Jorn. Francisco Aguirre Proença - Chácaras Assay, Hortolândia - São
> Paulo | Brazil Zip Code: 13186-900*
>  
>  
> 	
> 
> 
> 
> -----Breno Henrique Leitao/Brazil/IBM@IBMBR wrote: -----
> To: William Pitcock <nenolod@dereferenced.org <nenolod@dereferenced.org>>
> From: Breno Henrique Leitao/Brazil/IBM@IBMBR
> Date: 02/23/2018 12:27
> Cc: Natanael Copa <ncopa@alpinelinux.org <ncopa@alpinelinux.org>>,
> alpine-infra@lists.alpinelinux.org
> <alpine-infra@lists.alpinelinux.org>, Rafael Peria de
> Sene/Brazil/IBM@IBMBR, Mike Sullivan <mksully@us.ibm.com
> <mksully@us.ibm.com>>
> Subject: Re: ppc64le failed to reboot
> 
> 
> On 02/23/2018 11:32 AM, William Pitcock wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl@br.ibm.com
> <brenohl@br.ibm.com>> wrote:
>>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>>>> hi Natanael,
>>>>
>>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>>>> Hi,
>>>>>
>>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
>>>>> not very responsive. It took 1253 seconds to shut it down (til ping
>>>>> stopped respond), but it never came back again.
>>>>
>>>> Yes, that is weird. I am not even able to access the IPMI for this machine.
>>>>
>>>>> Can you help me have a look what went wrong?
>>>>
>>>> As I cannot access the console, it is hard to understand what is going one,
>>>> we will need to restart it and see if we can recover any log. :-|
>>>>
>>>>> If you cannot do it today, then maybe on monday?
>>>>
>>>> Fortunately Rafael is in the lab, and he is restarting the machine. Let's try
>>>> to get it online as soon as possible.
>>>
>>> The machine is back online (Thanks Rafael) and it seems that it hit a
> kernel issue.
>>> I am wondering if this is a physical problem or a kernel issue. We will
> need to investigate:
>>>
>>>
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330]
> [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370]
> [c000000000112e88] unmask_irq+0x50/0x6c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0]
> [c0000000001134e0] handle_level_irq+0x164/0x168
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0]
> [c00000000010d980] generic_handle_irq+0x34/0x54
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0]
> [c000000000076264] opal_handle_events+0x90/0xa8
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440]
> [c000000000194dfc] irq_work_run_list+0x98/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490]
> [c000000000194e54] irq_work_run+0x30/0x50
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0]
> [c00000000001d864] __timer_interrupt+0x50/0x1ec
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510]
> [c00000000001dd90] timer_interrupt+0xa8/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540]
> [c00000000000bb3c] fast_exception_return+0x16c/0x190
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> replay_interrupt_return+0x0/0x4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> arch_local_irq_restore+0x5c/0x80
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830]
> [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850]
> [c000000000706b20] __do_softirq+0xe0/0x388
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950]
> [c0000000000baff0] irq_exit+0x88/0xe0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970]
> [c00000000001dd94] timer_interrupt+0xac/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0]
> [c00000000000bb3c] fast_exception_return+0x16c/0x190
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> init_timer_key+0x8/0xa0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> schedule_timeout+0xa4/0x3b4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90]
> [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80]
> [c000000000071fd8] kopald+0x94/0xb4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0]
> [c0000000000d5cc8] kthread+0x164/0x16c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30]
> [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0
> e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010
> ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
>>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched self-detected
> stall on CPU
>>> Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind)
> idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
>>> Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies g=92991
> c=92990 q=6903)
>>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
>>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: kopald
> Tainted: G        W       4.14.17-0-vanilla #1-Alpine
>>
>> We have been observing this OPAL-related problem on that machine for a
>> few weeks.  I suspect there is a problem with either the hardware or
>> the system firmware.
> 
> Right. That is what we were talking earlier today. We probably want to
> migrate to the latest firmware and see if the problem still continue.
> 
> We should be able to upgrade the firmware next week, so, we will need to stop
> the machine and do the upgrade.
> 

Re: ppc64le failed to reboot

Guilherme Tiaki Sato <guilhermetiaki@me.com>
Details
Message ID
<CAOM8kGb-mCd7cZCjLXOecSwKf=_XG=jDb3qvEsA9j=OhAKwbrw@mail.gmail.com>
In-Reply-To
<3063e09c-30b6-14f3-7cb3-a44e4846a5ae@br.ibm.com> (view parent)
Sender timestamp
1519408589
DKIM signature
missing
Download raw message
Hi, Breno

Can we work on this on Monday, at 4 p.m?

--
Guilherme

On Fri, Feb 23, 2018 at 2:31 PM, Breno Leitao <brenohl@br.ibm.com> wrote:

> Hi Rafael,
>
> I definitely can. Since this machine has the project private keys, I want
> to
> be together with him during this firmware upgrade.
>
> What is the best time so we can do it together?
>
> On 02/23/2018 02:30 PM, Rafael Peria de Sene wrote:
> >
> > We can schedule the firmware update for later next week. Guilherme (CC),
> the
> > person who will work on it will need access to the OS to execute it. Do
> we
> > have a focal point that could help Guilherme during the process?
> >
> > *Rafael Sene*
> > *Staff Software Engineer
> > *
> > *SDK | Unicamp OpenPower Lab | Power Cloud Development*
> > *IBM Systems*
> >
> >
> > ------------------------------------------------------------
> -----------------
> > *Phone: +55 19 2132 4844 | +55 19 98153 9778*
> > *E-mail: _rpsene@br.ibm.com_ <mailto:#>*
> > *Rod. Jorn. Francisco Aguirre Proença - Chácaras Assay, Hortolândia - São
> > Paulo | Brazil Zip Code: 13186-900*
> >
> >
> >
> >
> >
> >
> > -----Breno Henrique Leitao/Brazil/IBM@IBMBR wrote: -----
> > To: William Pitcock <nenolod@dereferenced.org <mailto:
> nenolod@dereferenced.org>>
> > From: Breno Henrique Leitao/Brazil/IBM@IBMBR
> > Date: 02/23/2018 12:27
> > Cc: Natanael Copa <ncopa@alpinelinux.org <ncopa@alpinelinux.org>
> >,
> > alpine-infra@lists.alpinelinux.org
> > <alpine-infra@lists.alpinelinux.org>, Rafael Peria de
> > Sene/Brazil/IBM@IBMBR, Mike Sullivan <mksully@us.ibm.com
> > <mksully@us.ibm.com>>
> > Subject: Re: ppc64le failed to reboot
> >
> >
> > On 02/23/2018 11:32 AM, William Pitcock wrote:
> >> Hi,
> >>
> >> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <brenohl@br.ibm.com
> > <brenohl@br.ibm.com>> wrote:
> >>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
> >>>> hi Natanael,
> >>>>
> >>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it was
> >>>>> not very responsive. It took 1253 seconds to shut it down (til ping
> >>>>> stopped respond), but it never came back again.
> >>>>
> >>>> Yes, that is weird. I am not even able to access the IPMI for this
> machine.
> >>>>
> >>>>> Can you help me have a look what went wrong?
> >>>>
> >>>> As I cannot access the console, it is hard to understand what is
> going one,
> >>>> we will need to restart it and see if we can recover any log. :-|
> >>>>
> >>>>> If you cannot do it today, then maybe on monday?
> >>>>
> >>>> Fortunately Rafael is in the lab, and he is restarting the machine.
> Let's try
> >>>> to get it online as soon as possible.
> >>>
> >>> The machine is back online (Thanks Rafael) and it seems that it hit a
> > kernel issue.
> >>> I am wondering if this is a physical problem or a kernel issue. We will
> > need to investigate:
> >>>
> >>>
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330]
> > [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370]
> > [c000000000112e88] unmask_irq+0x50/0x6c
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0]
> > [c0000000001134e0] handle_level_irq+0x164/0x168
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0]
> > [c00000000010d980] generic_handle_irq+0x34/0x54
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0]
> > [c000000000076264] opal_handle_events+0x90/0xa8
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440]
> > [c000000000194dfc] irq_work_run_list+0x98/0xc0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490]
> > [c000000000194e54] irq_work_run+0x30/0x50
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0]
> > [c00000000001d864] __timer_interrupt+0x50/0x1ec
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510]
> > [c00000000001dd90] timer_interrupt+0xa8/0xc0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540]
> > [c00000000000bb3c] fast_exception_return+0x16c/0x190
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> > replay_interrupt_return+0x0/0x4
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> > arch_local_irq_restore+0x5c/0x80
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830]
> > [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850]
> > [c000000000706b20] __do_softirq+0xe0/0x388
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950]
> > [c0000000000baff0] irq_exit+0x88/0xe0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970]
> > [c00000000001dd94] timer_interrupt+0xac/0xc0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0]
> > [c00000000000bb3c] fast_exception_return+0x16c/0x190
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at
> > init_timer_key+0x8/0xa0
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel:     LR =
> > schedule_timeout+0xa4/0x3b4
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90]
> > [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80]
> > [c000000000071fd8] kopald+0x94/0xb4
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0]
> > [c0000000000d5cc8] kthread+0x164/0x16c
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30]
> > [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe0
> > e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e8010010
> > ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
> >>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched
> self-detected
> > stall on CPU
> >>> Feb 23 09:14:05 alpinebox kern.err kernel:      16-...: (1 GPs behind)
> > idle=9d6/140000000000002/0 softirq=2956113/2956121 fqs=1049
> >>> Feb 23 09:14:05 alpinebox kern.err kernel:       (t=2100 jiffies
> g=92991
> > c=92990 q=6903)
> >>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16
> >>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm:
> kopald
> > Tainted: G        W       4.14.17-0-vanilla #1-Alpine
> >>
> >> We have been observing this OPAL-related problem on that machine for a
> >> few weeks.  I suspect there is a problem with either the hardware or
> >> the system firmware.
> >
> > Right. That is what we were talking earlier today. We probably want to
> > migrate to the latest firmware and see if the problem still continue.
> >
> > We should be able to upgrade the firmware next week, so, we will need to
> stop
> > the machine and do the upgrade.
> >
>
>
Reply to thread Export thread (mbox)