Re: Issue with ppc64le.alpinelinux.org

Natanael Copa <ncopa@alpinelinux.org>
Details
Message ID: <20180222170437.12ed55fc@ncopa-desktop.copa.dup.pw>
Sender timestamp: 1519315477
DKIM signature: missing
Breno,

It looks like we have problems with the ppc64le.alpinelinux.org server
again.

I could upgrade it to 4.14.20 but I had serious problem with the
4.14.20 which locked up my work desktop[1][2], and I doubt the 4.14.20
has fix for this issue.

It does look like it is related opal, so I wonder if we should simply
stop the jenkins vm for now. Jenkins CI is not yet in production. And
then we should probably reboot it again.

Alternatively we should ask some of the IBM (OPAL?) engineers to look
closer at this issue, since it affects the latest LTS kernel.

What do you think?

-nc

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=198861
[2]: https://www.spinics.net/lists/stable/msg217788.html


Watchdog CPU:0 Hard LOCKUP
Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay veth xt_CHECKSUM bridge stp llc ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TCPMSS iptable_mangle nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_recent xt_conntrack nf_conntrack libcrc32c iptable_filter ip_tables x_tables tun kvm_hv kvm tg3 powernv_op_panel leds_powernv led_class powernv_rng rng_core xhci_pci xhci_hcd usbcore usb_common nls_base ipr libata
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.14.17-0-vanilla #1-Alpine
task: c000000000af4d80 task.stack: c000000000b30000
NIP:  c000000000706e78 LR: c0000000000760c0 CTR: 0000000030030740
REGS: c00000003ffffd80 TRAP: 0900   Tainted: G        W        (4.14.17-0-vanilla)
MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28000424  XER: 20000000
CFAR: c0000000000760c8 SOFTE: 0 
GPR00: c000000000112e88 c000000000b33480 c000000000b36f00 0000000000000001 
GPR04: c000000000a456b0 c000001fff5215b0 9000000000009033 0000000000000000 
GPR08: 0000000000000003 0000000000000000 0000000000000000 9000000000001003 
GPR12: c00000000006e224 c00000000fd00000 0000000000000000 c000000000b30000 
GPR16: c000000000a34c80 c000000000b62200 0000000000000000 c000000000b63b00 
GPR20: 000000000000000a 0000000000000001 c000000000b30000 0000000000200102 
GPR24: 000000010734f4d7 00044bec9affb432 c000000000b62200 0000000000000001 
GPR28: c000000000a45588 8000000000000e30 0000000000000010 c000001fde27ac00 
NIP [c000000000706e78] _restgpr0_31+0x8/0x10
LR [c0000000000760c0] opal_event_unmask+0x90/0x9c
Call Trace:
[c000000000b33480] [c000000000112e88] unmask_irq+0x50/0x6c (unreliable)
[c000000000b334b0] [c0000000001134e0] handle_level_irq+0x164/0x168
[c000000000b334e0] [c00000000010d980] generic_handle_irq+0x34/0x54
[c000000000b33500] [c000000000076264] opal_handle_events+0x90/0xa8
[c000000000b33550] [c000000000194dfc] irq_work_run_list+0x98/0xc0
[c000000000b335a0] [c000000000194e54] irq_work_run+0x30/0x50
[c000000000b335d0] [c00000000001d864] __timer_interrupt+0x50/0x1ec
[c000000000b33620] [c00000000001dd90] timer_interrupt+0xa8/0xc0
[c000000000b33650] [c00000000000bb3c] fast_exception_return+0x16c/0x190
--- interrupt: 901 at replay_interrupt_return+0x0/0x4
    LR = arch_local_irq_restore+0x5c/0x80
[c000000000b33940] [c0000000000e5cac] vtime_account_irq_enter+0x38/0x5c (unreliable)
[c000000000b33960] [c000000000706b20] __do_softirq+0xe0/0x388
[c000000000b33a60] [c0000000000baff0] irq_exit+0x88/0xe0
[c000000000b33a80] [c00000000001dd94] timer_interrupt+0xac/0xc0
[c000000000b33ab0] [c00000000000bb3c] fast_exception_return+0x16c/0x190
--- interrupt: 901 at replay_interrupt_return+0x0/0x4
    LR = arch_local_irq_restore+0x5c/0x80
[c000000000b33da0] [c000000000b33e20] init_thread_union+0x3e20/0x4000 (unreliable)
[c000000000b33dc0] [c0000000005fbadc] cpuidle_enter_state+0x1a0/0x2e4
[c000000000b33e20] [c0000000000facf4] call_cpuidle+0x6c/0x74
[c000000000b33e40] [c0000000000faf9c] do_idle+0x1f0/0x250
[c000000000b33ea0] [c0000000000fb188] cpu_startup_entry+0x30/0x34
[c000000000b33ed0] [c00000000000d29c] rest_init+0xd8/0xe4
[c000000000b33f00] [c000000000993c28] start_kernel+0x520/0x528
[c000000000b33f90] [c00000000000ad70] start_here_common+0x1c/0x4ac
Instruction dump:
eb41ffd0 eb61ffd8 eb81ffe0 e8010010 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8 
4e800020 ebc1fff0 e8010010 ebe1fff8 <7c0803a6> 4e800020 3c4c0043 38420080 
Watchdog CPU:0 became unstuck



On Thu, 8 Feb 2018 10:29:32 -0200
Breno Leitao <brenohl@br.ibm.com> wrote:

> On 02/07/2018 11:55 PM, Natanael Copa wrote:
> > On Fri, 2 Feb 2018 10:59:02 -0200
> > Breno Leitao <brenohl@br.ibm.com> wrote:
> >   
> >> Hi Natanael,
> >>
> >> On 02/02/2018 10:50 AM, Natanael Copa wrote:  
> >>> I am slightly worried about what caused this. I guess we should upgrade
> >>> the kernel to 4.14. If this happens again we may want try debug it.\    
> >>
> >> Yes, we are still using the old 4.4 kernel, and we can install a newer kernel
> >> just to try it if it crashes again.  
> > 
> > Hi!
> > 
> > I have upgraded kernel and updated grub config. Should be ready for a
> > reboot tomorrow morning. Can you help me with that?  
> 
> Done. The machine is now with the 4.14 kernel:
> 
> alpinebox:~# uname -a
> Linux alpinebox 4.14.17-0-vanilla #1-Alpine SMP Mon Feb 5 22:42:16 UTC 2018
> ppc64le Linux
> 
> I also added a safe 4.4 kernel into the grub, just to guarantee that if your
> kernel didn't boot, we have a backup.
> 
> Let's hope we do not see the RCU issue anymore.
>
~alpine/infra

Re: Issue with ppc64le.alpinelinux.org