Received: from sonic307-8.consmr.mail.gq1.yahoo.com (sonic307-8.consmr.mail.gq1.yahoo.com [98.137.64.32]) by nld3-dev1.alpinelinux.org (Postfix) with ESMTPS id 211E2781E1B for <~alpine/devel@lists.alpinelinux.org>; Thu, 2 Jul 2020 03:10:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.ca; s=s2048; t=1593659438; bh=xZ3AubHXSuJq69ez8QkJu/IymP19PZ0KVW8NNhIaZ3w=; h=Date:From:Subject:To:References:From:Subject; b=MU7lBEP11TzZZULn8ojhxaZzZlQiXECUNnxaMWF91X8v+e8ArUc/QXjYS6FsAD3dQFpQBhNIheT405JWfKm+M0o3UonXfvjud2ptqVaBBc9msKS3BgyBYX4XuNAngZmjb+Xtbo6IqLHTJJPD0Q8RApNCFDXeEcXRIKocJQpa1EJdbavMjqnS1pnjDg3IkP/zjBL7YbjSvffOHjiWrkXhye1zckuVM7tR86LhVP2p09FmaSJ+r19gviG4/K8tqT/fhc9ulNkM0suo3c0LPvgQ+oLu7rESoh3nzFb7Y7d9uGXajcPioPqb6Y2IQe1jmBdCkTu1EAtrVE7hKUw/oFQeEQ== X-YMail-OSG: 9utLYLoVM1k5cj00mgRRvOBiMmnC4fXi0WCIpkzMloqgY.67LEw5NlvYk5x9YN0 gFP440pHsT_8Kr_aUG_aBEj_TQWjIEo595Du0FeMPxcSR6xRjYlRbGwUkFQ7J.gzUq1qZ4SxHQOH 6tsJbx5km54lCOMXrGK2I2mE2uiE3gpZvNeBETa9xdiiwk7HdPF4NSqMN4_IngM3HTd2W13.gIZv AQlP2FgHdvFrO13mkHjZ6qY3WfSBfhn5AU_LG_MK1shncbjkx6zY3TksMkg_5Pu2KfTLC9JCgEBA DYdOD9GrIsAxCIQFFDI5FEYdc6m3SPmBL067i2L51UMfjOh5_apuyhcKBg2v9XOlSohkAJNz7jEQ Bz4PScGFy77XpLt4Nlc2QP2.3HYiPnW3KM2aroapKqOid3Cfp3bYvTL7PLpY01JYyka0f3_mPBwy edh3bnXqkcJRMF4ZX0RCVfo0Jz4JpRtHQV2ZOHYDXFFGUBp4lNSfmhPvm3UPAM6uaH1f3jDvATid UgnGsR7vnwdJLQpAUpDp.8Dy9YbmkMVcmhUXY848YrNA9T_VXXwP5tD9d4u1tF5Bwh4HPw9xqfg3 wBpOF2O3l3rHODmkV._SSGI9g0WIVJFY9L0hLrfpjgPK_M3EV0hXMPuhb537AMtaQ.QGdqWHxVRN NbjUICVc0BWkKOUPKoLFN5vMwyYV7COIrGor_qGtzkQ4iT2PMNkp0ACT0UJhDKnEgPJvT7ywONU8 zvRXbeuPw6vNPUWyZwYlzzuekumt5VW.By3pkPrinM.jOkXeRIN.LNO9AqHnBR3Jr6QbNtOvsCF2 60hSHyNp7QWr.vEJ2QAw6b2RWb4YQZpI09u.GybM3xD_jd3rfBlKNCOl9n0zQXK9GN5G2oE.rJU1 JUbxP1VuOqKnO97s.57u7.OKMZkugHSG7S00y_qEvSykdDFpfsApDnhhZ3ZrtcWLKPuud0W7.Zco OEE3HJ7RxMUU5lVV0PSEaNInmtZu0Dy4HhTT2ir4XeIZ54pCxfNFDwGhe5_ccnKRdlRzxDeOfkvo ZSZTynuyUiI17BLult9Jw7C3xxdgWoiOOMa3F64MDKyrzMFHRmUbUl61YSzTxe8Hjs0jTOJwFYaT wc2voxKC0ibfRPCKtBH587lSQNgRZih0VWMHYXPEJQPbkThBxsraeGuwjyq6HH.8LVOwHoLg7w8q 2DflqEg2PxRaNRVNlnJSisOfTKjd0JLgoXweppmQuxQ2MXjtbWDHpc2uFqMxuM0LD0RSUC1Lr2Bt Hcy1aQzBahGFIWv.9DoMuUfvmr3d.OJr2wQayeop_URZmla5GO3PlnEygPaFQVj4QVA5mW6Kju.J 8J8niqSn36beOvaG4y2AligC3AKGYZYVwOputlCGQ56XacV7E2OMYNpWhKKYH1v.FOwghVE2L5w- - Received: from sonic.gate.mail.ne1.yahoo.com by sonic307.consmr.mail.gq1.yahoo.com with HTTP; Thu, 2 Jul 2020 03:10:38 +0000 Received: by smtp426.mail.ne1.yahoo.com (VZM Hermes SMTP Server) with ESMTPA ID dfa789c4ddacb5ab2ae59effbe8e6494; Thu, 02 Jul 2020 03:10:33 +0000 (UTC) Date: Wed, 01 Jul 2020 23:10:30 -0400 From: "Alex Xu (Hello71)" Subject: Distro optimization flags To: ~alpine/devel@lists.alpinelinux.org MIME-Version: 1.0 Message-Id: <1593625212.dirkptm3b0.none@localhost> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable References: <1593625212.dirkptm3b0.none.ref@localhost> X-Mailer: WebService/1.1.16197 hermes_yahoo Apache-HttpAsyncClient/4.1.4 (Java/11.0.7) Recently there was some discussion on #alpine-devel about optimization=20 flags. I think it's worth looking at this issue more closely. =3D=3D=3D Rationale =3D=3D=3D -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with=20 arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from=20 tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45=20 seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds.=20 On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and=20 Debian takes only 22 seconds. In other words, O2 is about a 15% speedup,=20 and LTO is another 30-50% on top of that. https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011=20 says that the kernel ran 'hackbench 15' 10% faster using -O2. http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017=20 appears to say that rv8 ran about 25% faster using -O2 compared to -Os. =3D=3D=3D Drawbacks =3D=3D=3D Obviously, the main issue with this change is increased code size.=20 However, this issue is likely less severe than presented at [1],=20 because: 1. libtracker and some other packages had wrong APKBUILDs that didn't=20 strip libs. I think -O2 causes slightly larger debug tables to be=20 generated. I have submitted merge requests to fix the packages I=20 have found, and we may fix abuild to not require special ordering of=20 subpackages in these cases. 2. It is possible to use a more limited -O2, which does not cause as=20 much code ballooning. I got this idea from [2], which is a bad idea=20 to do in a specific package but seems reasonable system-wide. These=20 -O2 flags have a small improvement on old Intel processors, but=20 actually slow down speed on AMD processors, and significantly=20 increase code size. 3. LTO is roughly as powerful at reducing code size as O2 is at=20 increasing it. I checked size of attica (example from [1]) with these configurations.=20 Column 1 is package size, column 2 is installed size as reported by apk,=20 and column 3+ is the CFLAGS/CXXFLAGS. {1} 165461 585728 -Os {2} 225285 823296 -O2 {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loop= s -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=3D= simple {4} 175413 614400 -O2 -flto -fno-align-jumps -fno-align-functions -fno-alig= n-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algori= thm=3Dsimple {5} 176036 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps -fno= -align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arra= ys -freorder-blocks-algorithm=3Dsimple {6} 154055 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jump= s -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loo= p-arrays -freorder-blocks-algorithm=3Dsimple gcc size is harder to measure here, as I built gcc without most=20 languages. The size of usr/libexec/gcc increased from 43076k excluding=20 cc2obj and d21 to 49144k excluding cc1plus. However, the latter number=20 may not be accurate, as for some reason my attica -Os is a different=20 size from the edge attica. =3D=3D=3D Analysis =3D=3D=3D Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables globally. I provide it here only as a reference (and because I did the=20 benchmark before looking up exactly what the flag does). LTO is a can of worms that I think is definitely worth opening at some=20 point, but should wait at least until both musl 1.2 and gcc 10 are done,=20 which I gather will take some time. Additionally, it is somewhat=20 orthogonal to -Ox. So, the question now is whether a 10-25% increase in=20 performance justifies a 15-30% increase in code size. There is also a third option: we can use -O2 in some common CPU-heavy=20 programs and libraries, such as gcc and openssl. Alpine already uses=20 default optimization for musl, which I think works out to -Os for most=20 components and -O3 for performance-sensitive areas. It would be great if=20 all packages could do this, but it also sounds like way too much work to=20 patch every single package (and probably PGO is the right answer there=20 anyways). There are also probably other compile flags that we should be looking=20 at, such as security flags, or linker flags (-Wl,--hash-style=3Dgnu,=20 -Wl,-O, etc). However, I didn't investigate those at this time. =3D=3D=3D Other distros =3D=3D=3D Although I didn't do much research, I think other distros did not=20 carefully select their optimization flags (as opposed to security=20 flags). Most mainstream distros seem to basically use whatever gcc gives=20 them for -O2. Clear Linux seems to set everything to MAXIMUM=20 OPTIMIZATION. Gentoo recommends -O2 -march=3Dnative -pipe and punts the=20 decision to the user. OpenWRT uses -Os, which can be overridden=20 per-target, although I couldn't find any targets overriding the=20 optimization flags. =3D=3D=3D Limitations =3D=3D=3D These benchmarks are obviously very limited. However, I don't want to go=20 down the path of extensive benchmarks just to find people coming out of=20 the woodwork and complaining that a 20% increase in code size (i.e.=20 excluding scripts, docs, FS overhead, etc) overflows their hard drives. Additionally, whoever desperately needs that extra few dozen megabytes=20 should be using squashfs or zstd apk, so the uncompressed/gzip numbers=20 are not that useful. =3D=3D Conclusions =3D=3D Personally, I think a 15% speedup is very much worth a 15% increase in=20 the small portion of my storage used for storing programs. I definitely=20 think that the optimization level for gcc itself should be changed, and=20 building it with LTO should be fixed/implemented as soon as possible. I=20 certainly hope that nobody is installing gcc on their minimal IoT=20 systems or whatever that cannot spare 10 MB of space. (Also, those=20 people are wasting space already on Obj-C and D support.) In my opinion, anybody that doesn't want to use an extra few dozen=20 megabytes of space either should care more about the extra power=20 consumption, or should be using a custom OpenWRT or Buildroot anyways,=20 where they can customize everything. [1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c= 8fb7f34ea947358b.camel%40cogitri.dev%3E [2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f= 7e4d7774602fd394fb27b Cheers, Alex.