From: Ariadne Conill <ariadne@dereferenced.org>
To: ~alpine/devel@lists.alpinelinux.org
Subject: Re: Distro optimization flags
Date: Thu, 02 Jul 2020 11:42:10 -0600
Message-ID: <3042121.WL6ZjG3rU8@localhost>
In-Reply-To: <1593625212.dirkptm3b0.none@localhost>
References: <1593625212.dirkptm3b0.none.ref@localhost> <1593625212.dirkptm3b0.none@localhost>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"

Hello,

On Wednesday, July 1, 2020 9:10:30 PM MDT Alex Xu (Hello71) wrote:
> Recently there was some discussion on #alpine-devel about optimization
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds.
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup,
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

I don't have any major objection to changing from -Os to -O2.  In most cases, 
it will not be a major increase in code size.  In the case of some 
applications like Chromium, I suspect that the problems we are having with it 
hanging are due to -Os anyway.

> === Drawbacks ===
> 
> Obviously, the main issue with this change is increased code size.
> However, this issue is likely less severe than presented at [1],
> because:
> 
> 1. libtracker and some other packages had wrong APKBUILDs that didn't
>    strip libs. I think -O2 causes slightly larger debug tables to be
>    generated. I have submitted merge requests to fix the packages I
>    have found, and we may fix abuild to not require special ordering of
>    subpackages in these cases.
> 
> 2. It is possible to use a more limited -O2, which does not cause as
>    much code ballooning. I got this idea from [2], which is a bad idea
>    to do in a specific package but seems reasonable system-wide. These
>    -O2 flags have a small improvement on old Intel processors, but
>    actually slow down speed on AMD processors, and significantly
>    increase code size.
> 
> 3. LTO is roughly as powerful at reducing code size as O2 is at
>    increasing it.
> 
> I checked size of attica (example from [1]) with these configurations.
> Column 1 is package size, column 2 is installed size as reported by apk,
> and column 3+ is the CFLAGS/CXXFLAGS.
> 
> {1} 165461 585728 -Os
> {2} 225285 823296 -O2
> {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops
> -fno-align-labels -fno-prefetch-loop-arrays
> -freorder-blocks-algorithm=simple {4} 175413 614400 -O2 -flto
> -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {5} 176036
> 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps
> -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {6} 154055
> 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps
> -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple

We're not going to use a bunch of custom CFLAGS, I think -O2 is good enough 
and in most cases won't cause much bloat.

> gcc size is harder to measure here, as I built gcc without most
> languages. The size of usr/libexec/gcc increased from 43076k excluding
> cc2obj and d21 to 49144k excluding cc1plus. However, the latter number
> may not be accurate, as for some reason my attica -Os is a different
> size from the edge attica.
> 
> === Analysis ===
> 
> Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
> globally. I provide it here only as a reference (and because I did the
> benchmark before looking up exactly what the flag does).
> 
> LTO is a can of worms that I think is definitely worth opening at some
> point, but should wait at least until both musl 1.2 and gcc 10 are done,
> which I gather will take some time. Additionally, it is somewhat
> orthogonal to -Ox. So, the question now is whether a 10-25% increase in
> performance justifies a 15-30% increase in code size.

Most likely we should wait until after Alpine 3.13 release for this.

> There is also a third option: we can use -O2 in some common CPU-heavy
> programs and libraries, such as gcc and openssl. Alpine already uses
> default optimization for musl, which I think works out to -Os for most
> components and -O3 for performance-sensitive areas. It would be great if
> all packages could do this, but it also sounds like way too much work to
> patch every single package (and probably PGO is the right answer there
> anyways).
> 
> There are also probably other compile flags that we should be looking
> at, such as security flags, or linker flags (-Wl,--hash-style=gnu,
> -Wl,-O, etc). However, I didn't investigate those at this time.
> 
> === Other distros ===
> 
> Although I didn't do much research, I think other distros did not
> carefully select their optimization flags (as opposed to security
> flags). Most mainstream distros seem to basically use whatever gcc gives
> them for -O2. Clear Linux seems to set everything to MAXIMUM
> OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the
> decision to the user. OpenWRT uses -Os, which can be overridden
> per-target, although I couldn't find any targets overriding the
> optimization flags.
> 
> === Limitations ===
> 
> These benchmarks are obviously very limited. However, I don't want to go
> down the path of extensive benchmarks just to find people coming out of
> the woodwork and complaining that a 20% increase in code size (i.e.
> excluding scripts, docs, FS overhead, etc) overflows their hard drives.
> 
> Additionally, whoever desperately needs that extra few dozen megabytes
> should be using squashfs or zstd apk, so the uncompressed/gzip numbers
> are not that useful.
> 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in
> the small portion of my storage used for storing programs. I definitely
> think that the optimization level for gcc itself should be changed, and
> building it with LTO should be fixed/implemented as soon as possible. I
> certainly hope that nobody is installing gcc on their minimal IoT
> systems or whatever that cannot spare 10 MB of space. (Also, those
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen
> megabytes of space either should care more about the extra power
> consumption, or should be using a custom OpenWRT or Buildroot anyways,
> where they can customize everything.

It is possible to customize everything in Alpine too, just rebuild the 
packages you want customized.

Ariadne