X-Original-To: alpine-infra@lists.alpinelinux.org Received: from ms11p00im-qufv17110601.me.com (ms11p00im-qufv17110601.me.com [17.58.37.38]) by lists.alpinelinux.org (Postfix) with ESMTP id 928B05C5CDC for ; Fri, 23 Feb 2018 17:57:14 +0000 (GMT) Received: from process-dkim-sign-daemon.ms11p00im-qufv17110601.me.com by ms11p00im-qufv17110601.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) id <0P4M00O007056B00@ms11p00im-qufv17110601.me.com> for alpine-infra@lists.alpinelinux.org; Fri, 23 Feb 2018 17:57:14 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=me.com; s=04042017; t=1519408634; bh=Ma5lO1Waobt68xDJy7ZyaP5Ly3BcGqgsdNAWNpFtF1Y=; h=MIME-version:From:Date:Message-id:Subject:To:Content-type; b=MbmHzXj1sesVLlqsUyoRczOs3iuN+LPpVnBjmMxHyBHdM05AQixq7ME9lwB77CbMl xK4AgAec8BotJyYDmk7s3tyS9861Cs9ggqHhpRZzKa0rzXvtQW+KN5+T1qatF1x7Cm s8keNFkWff1k2QsF7w0SyG06AqnHlAtT5fx7vWO4/aPaBWBv3vnKTYPSVpdzOLmsti E6aHhxoLSSuciBtvSkjSGrhjnc8+UI8XHUcl3o5ugvE3aWuHBvWMAB/4I9+L1DUFn1 R2t17v70qxnVP9c/+jT7B99z7whwcgg5I3Ds02H6x7RiSVJ7T+dKylQS8awsnUxm81 KYd7TCAuA6IGw== Received: from icloud.com ([127.0.0.1]) by ms11p00im-qufv17110601.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) with ESMTPSA id <0P4M00B3J77AIJ30@ms11p00im-qufv17110601.me.com> for alpine-infra@lists.alpinelinux.org; Fri, 23 Feb 2018 17:57:13 +0000 (GMT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-02-23_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 clxscore=1011 suspectscore=61 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1802230220 Received: by mail-ot0-f177.google.com with SMTP id k7so7380041otj.0 for ; Fri, 23 Feb 2018 09:57:10 -0800 (PST) X-Gm-Message-State: APf1xPCZxJXtuQcGdcTNklbBwqZaSmG3veN9+R92zbLO7OEJA5BmUgCU ySiMA208EKP8DX7PKCn+NevQJbbKOCms+BOgFZs= X-Google-Smtp-Source: AG47ELto0aGDGQPDx0e1zZspHMz5pfp+nYGFk2I4VK/2TP7ZE0ernKLmcRW+8sfA5ziWJvoBs79y777XTi7n0gjM90U= X-Received: by 10.157.35.61 with SMTP id j58mr1753545otb.355.1519408630234; Fri, 23 Feb 2018 09:57:10 -0800 (PST) MIME-version: 1.0 Received: by 10.74.98.12 with HTTP; Fri, 23 Feb 2018 09:56:29 -0800 (PST) In-reply-to: <3063e09c-30b6-14f3-7cb3-a44e4846a5ae@br.ibm.com> References: <8d3736f3-1db6-1bc5-bbfd-2ac620525236@br.ibm.com> <20180223122913.06140a73@ncopa-desktop.copa.dup.pw> <8036c91d-4b8f-6071-d923-8b1ca2dd11c6@br.ibm.com> <3063e09c-30b6-14f3-7cb3-a44e4846a5ae@br.ibm.com> From: Guilherme Tiaki Sato Date: Fri, 23 Feb 2018 14:56:29 -0300 X-Gmail-Original-Message-ID: Message-id: Subject: Re: ppc64le failed to reboot To: Breno Leitao Cc: Rafael Peria de Sene , William Pitcock , Natanael Copa , alpine-infra@lists.alpinelinux.org, Mike Sullivan Content-type: multipart/alternative; boundary=001a113dce30ac85d10565e4e370 --001a113dce30ac85d10565e4e370 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, Breno Can we work on this on Monday, at 4 p.m? -- Guilherme On Fri, Feb 23, 2018 at 2:31 PM, Breno Leitao wrote: > Hi Rafael, > > I definitely can. Since this machine has the project private keys, I want > to > be together with him during this firmware upgrade. > > What is the best time so we can do it together? > > On 02/23/2018 02:30 PM, Rafael Peria de Sene wrote: > > > > We can schedule the firmware update for later next week. Guilherme (CC)= , > the > > person who will work on it will need access to the OS to execute it. Do > we > > have a focal point that could help Guilherme during the process? > > > > *Rafael Sene* > > *Staff Software Engineer > > * > > *SDK | Unicamp OpenPower Lab | Power Cloud Development* > > *IBM Systems* > > > > > > ------------------------------------------------------------ > ----------------- > > *Phone: +55 19 2132 4844 | +55 19 98153 9778* > > *E-mail: _rpsene@br.ibm.com_ * > > *Rod. Jorn. Francisco Aguirre Proen=C3=A7a - Ch=C3=A1caras Assay, Horto= l=C3=A2ndia - S=C3=A3o > > Paulo | Brazil Zip Code: 13186-900* > > > > > > > > > > > > > > -----Breno Henrique Leitao/Brazil/IBM@IBMBR wrote: ----- > > To: William Pitcock nenolod@dereferenced.org>> > > From: Breno Henrique Leitao/Brazil/IBM@IBMBR > > Date: 02/23/2018 12:27 > > Cc: Natanael Copa > >, > > alpine-infra@lists.alpinelinux.org > > , Rafael Peria de > > Sene/Brazil/IBM@IBMBR, Mike Sullivan > > > > Subject: Re: ppc64le failed to reboot > > > > > > On 02/23/2018 11:32 AM, William Pitcock wrote: > >> Hi, > >> > >> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao > > wrote: > >>> On 02/23/2018 09:36 AM, Breno Leitao wrote: > >>>> hi Natanael, > >>>> > >>>> On 02/23/2018 08:29 AM, Natanael Copa wrote: > >>>>> Hi, > >>>>> > >>>>> I tried to reboot the ppc64le.alpinelinux.org machine because it wa= s > >>>>> not very responsive. It took 1253 seconds to shut it down (til ping > >>>>> stopped respond), but it never came back again. > >>>> > >>>> Yes, that is weird. I am not even able to access the IPMI for this > machine. > >>>> > >>>>> Can you help me have a look what went wrong? > >>>> > >>>> As I cannot access the console, it is hard to understand what is > going one, > >>>> we will need to restart it and see if we can recover any log. :-| > >>>> > >>>>> If you cannot do it today, then maybe on monday? > >>>> > >>>> Fortunately Rafael is in the lab, and he is restarting the machine. > Let's try > >>>> to get it online as soon as possible. > >>> > >>> The machine is back online (Thanks Rafael) and it seems that it hit a > > kernel issue. > >>> I am wondering if this is a physical problem or a kernel issue. We wi= ll > > need to investigate: > >>> > >>> > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330] > > [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable) > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370] > > [c000000000112e88] unmask_irq+0x50/0x6c > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0] > > [c0000000001134e0] handle_level_irq+0x164/0x168 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0] > > [c00000000010d980] generic_handle_irq+0x34/0x54 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0] > > [c000000000076264] opal_handle_events+0x90/0xa8 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440] > > [c000000000194dfc] irq_work_run_list+0x98/0xc0 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490] > > [c000000000194e54] irq_work_run+0x30/0x50 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0] > > [c00000000001d864] __timer_interrupt+0x50/0x1ec > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510] > > [c00000000001dd90] timer_interrupt+0xa8/0xc0 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540] > > [c00000000000bb3c] fast_exception_return+0x16c/0x190 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at > > replay_interrupt_return+0x0/0x4 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: LR =3D > > arch_local_irq_restore+0x5c/0x80 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830] > > [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable) > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850] > > [c000000000706b20] __do_softirq+0xe0/0x388 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950] > > [c0000000000baff0] irq_exit+0x88/0xe0 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970] > > [c00000000001dd94] timer_interrupt+0xac/0xc0 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0] > > [c00000000000bb3c] fast_exception_return+0x16c/0x190 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901 at > > init_timer_key+0x8/0xa0 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: LR =3D > > schedule_timeout+0xa4/0x3b4 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90] > > [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable) > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80] > > [c000000000071fd8] kopald+0x94/0xb4 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0] > > [c0000000000d5cc8] kthread+0x164/0x16c > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30] > > [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump: > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 eb81ffe= 0 > > e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8 > >>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 e801001= 0 > > ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080 > >>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched > self-detected > > stall on CPU > >>> Feb 23 09:14:05 alpinebox kern.err kernel: 16-...: (1 GPs behind= ) > > idle=3D9d6/140000000000002/0 softirq=3D2956113/2956121 fqs=3D1049 > >>> Feb 23 09:14:05 alpinebox kern.err kernel: (t=3D2100 jiffies > g=3D92991 > > c=3D92990 q=3D6903) > >>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for cpu 16 > >>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 Comm: > kopald > > Tainted: G W 4.14.17-0-vanilla #1-Alpine > >> > >> We have been observing this OPAL-related problem on that machine for a > >> few weeks. I suspect there is a problem with either the hardware or > >> the system firmware. > > > > Right. That is what we were talking earlier today. We probably want to > > migrate to the latest firmware and see if the problem still continue. > > > > We should be able to upgrade the firmware next week, so, we will need t= o > stop > > the machine and do the upgrade. > > > > --001a113dce30ac85d10565e4e370 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi, Breno

Can we work on this on Monday= , at 4 p.m?

--
Guilherme

On Fri, Feb 23, 2018 a= t 2:31 PM, Breno Leitao <brenohl@br.ibm.com> wrote:
Hi Rafael,

I definitely can. Since this machine has the project private keys, I want t= o
be together with him during this firmware upgrade.

What is the best time so we can do it together?

On 02/23/2018 02:30 PM, Rafael Peria de Sene wrote:
>
> We can schedule the firmware update for later next week. Guilherme (CC= ), the
> person who will work on it will need access to the OS to execute it. D= o we
> have a focal point that could help Guilherme during the process?
>
> *Rafael Sene*=C2=A0
> *Staff Software Engineer
> *
> *SDK | Unicamp OpenPower Lab | Power Cloud Development*
> *IBM Systems*
>
>
> ------------------------------------------------------------= -----------------
> *Phone:=C2=A0+55 19 2132 4844 | +55 19 98153 9778*
> *E-mail:=C2=A0_rpsene@br.ibm.com_ <mailto:#>*
> *Rod. Jorn. Francisco Aguirre Proen=C3=A7a - Ch=C3=A1caras Assay, Hort= ol=C3=A2ndia - S=C3=A3o
> Paulo | Brazil Zip Code: 13186-900*
> =C2=A0
> =C2=A0
>
>
>
>
> -----Breno Henrique Leitao/Brazil/IBM@IBMBR wrote: -----
> To: William Pitcock <nenolod@dereferenced.org <mailto:nenolod@dereferenced.org>>
> From: Breno Henrique Leitao/Brazil/IBM@IBMBR
> Date: 02/23/2018 12:27
> Cc: Natanael Copa <= ncopa@alpinelinux.org <mailto:ncopa@alpinelinux.org>>,
> alpine-infra@lis= ts.alpinelinux.org
> <mailto:alpin= e-infra@lists.alpinelinux.org>, Rafael Peria de
> Sene/Brazil/IBM@IBMBR, Mike Sullivan <mksully@us.ibm.com
> <mailto:mksully@us.ibm.com>>
> Subject: Re: ppc64le failed to reboot
>
>
> On 02/23/2018 11:32 AM, William Pitcock wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2018 at 7:25 AM, Breno Leitao <
brenohl@br.ibm.com
> <mailto:brenohl@br.ibm.com>> wrote:
>>> On 02/23/2018 09:36 AM, Breno Leitao wrote:
>>>> hi Natanael,
>>>>
>>>> On 02/23/2018 08:29 AM, Natanael Copa wrote:
>>>>> Hi,
>>>>>
>>>>> I tried to reboot the ppc64le.alpinelinux.org= machine because it was
>>>>> not very responsive. It took 1253 seconds to shut it d= own (til ping
>>>>> stopped respond), but it never came back again.
>>>>
>>>> Yes, that is weird. I am not even able to access the IPMI = for this machine.
>>>>
>>>>> Can you help me have a look what went wrong?
>>>>
>>>> As I cannot access the console, it is hard to understand w= hat is going one,
>>>> we will need to restart it and see if we can recover any l= og. :-|
>>>>
>>>>> If you cannot do it today, then maybe on monday?
>>>>
>>>> Fortunately Rafael is in the lab, and he is restarting the= machine. Let's try
>>>> to get it online as soon as possible.
>>>
>>> The machine is back online (Thanks Rafael) and it seems that i= t hit a
> kernel issue.
>>> I am wondering if this is a physical problem or a kernel issue= . We will
> need to investigate:
>>>
>>>
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f330]=
> [c0000000000760c0] opal_event_unmask+0x90/0x9c (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f370]=
> [c000000000112e88] unmask_irq+0x50/0x6c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3a0]=
> [c0000000001134e0] handle_level_irq+0x164/0x168
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3d0]=
> [c00000000010d980] generic_handle_irq+0x34/0x54
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f3f0]=
> [c000000000076264] opal_handle_events+0x90/0xa8
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f440]=
> [c000000000194dfc] irq_work_run_list+0x98/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f490]=
> [c000000000194e54] irq_work_run+0x30/0x50
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f4c0]=
> [c00000000001d864] __timer_interrupt+0x50/0x1ec
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f510]=
> [c00000000001dd90] timer_interrupt+0xa8/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f540]=
> [c00000000000bb3c] fast_exception_return+0x16c/0x190
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901= at
> replay_interrupt_return+0x0/0x4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: =C2=A0 =C2=A0 LR = =3D
> arch_local_irq_restore+0x5c/0x80
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f830]=
> [c0000000000e5cc8] vtime_account_irq_enter+0x54/0x5c (unreliable)=
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f850]=
> [c000000000706b20] __do_softirq+0xe0/0x388
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f950]=
> [c0000000000baff0] irq_exit+0x88/0xe0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f970]=
> [c00000000001dd94] timer_interrupt+0xac/0xc0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5f9a0]=
> [c00000000000bb3c] fast_exception_return+0x16c/0x190
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: --- interrupt: 901= at
> init_timer_key+0x8/0xa0
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: =C2=A0 =C2=A0 LR = =3D
> schedule_timeout+0xa4/0x3b4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fc90]=
> [c00000000070544c] schedule_timeout+0x164/0x3b4 (unreliable)
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fd80]=
> [c000000000071fd8] kopald+0x94/0xb4
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fdc0]=
> [c0000000000d5cc8] kthread+0x164/0x16c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: [c000001fe4d5fe30]=
> [c00000000000b6e0] ret_from_kernel_thread+0x5c/0x7c
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: Instruction dump:<= br> >>> Feb 23 09:13:51 alpinebox kern.warn kernel: eb41ffd0 eb61ffd8 = eb81ffe0
> e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8
>>> Feb 23 09:13:51 alpinebox kern.warn kernel: 4e800020 ebc1fff0 = e8010010
> ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080
>>> Feb 23 09:14:05 alpinebox kern.err kernel: INFO: rcu_sched sel= f-detected
> stall on CPU
>>> Feb 23 09:14:05 alpinebox kern.err kernel: =C2=A0 =C2=A0 =C2= =A016-...: (1 GPs behind)
> idle=3D9d6/140000000000002/0 softirq=3D2956113/2956121 fqs=3D1049
>>> Feb 23 09:14:05 alpinebox kern.err kernel: =C2=A0 =C2=A0 =C2= =A0 (t=3D2100 jiffies g=3D92991
> c=3D92990 q=3D6903)
>>> Feb 23 09:14:05 alpinebox kern.warn kernel: NMI backtrace for = cpu 16
>>> Feb 23 09:14:05 alpinebox kern.warn kernel: CPU: 16 PID: 1205 = Comm: kopald
> Tainted: G =C2=A0 =C2=A0 =C2=A0 =C2=A0W =C2=A0 =C2=A0 =C2=A0 4.14.17-0= -vanilla #1-Alpine
>>
>> We have been observing this OPAL-related problem on that machine f= or a
>> few weeks.=C2=A0 I suspect there is a problem with either the hard= ware or
>> the system firmware.
>
> Right. That is what we were talking earlier today. We probably want to=
> migrate to the latest firmware and see if the problem still continue.<= br> >
> We should be able to upgrade the firmware next week, so, we will need = to stop
> the machine and do the upgrade.
>


--001a113dce30ac85d10565e4e370--