Received: from out0.migadu.com (out0.migadu.com [94.23.1.103]) by nld3-dev1.alpinelinux.org (Postfix) with ESMTPS id 9C357782B74 for <~alpine/devel@lists.alpinelinux.org>; Thu, 2 Jul 2020 17:42:16 +0000 (UTC) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dereferenced.org; s=default; t=1593711735; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xt6GLTt0kDZyj2KY58rz7YkYcNQ+6e+Zk8IcXTHd2Rk=; b=NxZ06BdXhaeYteih9SsH5Q1OMbuOyFxqgZpk6XlH/+UPh8d8X4yvHjdrcPmHYwqkdpIqdu oby8+TfA2I6v1scccRjUo/6Pq5CWecAS1gQsDHdow7Bvnl4VOlkacGhadicNEHYMZyyQki bpm822rQBdorDUYGpX4hqP+W0aTScdU= From: Ariadne Conill To: ~alpine/devel@lists.alpinelinux.org Subject: Re: Distro optimization flags Date: Thu, 02 Jul 2020 11:42:10 -0600 Message-ID: <3042121.WL6ZjG3rU8@localhost> In-Reply-To: <1593625212.dirkptm3b0.none@localhost> References: <1593625212.dirkptm3b0.none.ref@localhost> <1593625212.dirkptm3b0.none@localhost> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Spam-Score: 0.90 Hello, On Wednesday, July 1, 2020 9:10:30 PM MDT Alex Xu (Hello71) wrote: > Recently there was some discussion on #alpine-devel about optimization > flags. I think it's worth looking at this issue more closely. > > === Rationale === > > -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with > arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from > tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45 > seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds. > On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and > Debian takes only 22 seconds. In other words, O2 is about a 15% speedup, > and LTO is another 30-50% on top of that. > > https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011 > says that the kernel ran 'hackbench 15' 10% faster using -O2. > > http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017 > appears to say that rv8 ran about 25% faster using -O2 compared to -Os. I don't have any major objection to changing from -Os to -O2. In most cases, it will not be a major increase in code size. In the case of some applications like Chromium, I suspect that the problems we are having with it hanging are due to -Os anyway. > === Drawbacks === > > Obviously, the main issue with this change is increased code size. > However, this issue is likely less severe than presented at [1], > because: > > 1. libtracker and some other packages had wrong APKBUILDs that didn't > strip libs. I think -O2 causes slightly larger debug tables to be > generated. I have submitted merge requests to fix the packages I > have found, and we may fix abuild to not require special ordering of > subpackages in these cases. > > 2. It is possible to use a more limited -O2, which does not cause as > much code ballooning. I got this idea from [2], which is a bad idea > to do in a specific package but seems reasonable system-wide. These > -O2 flags have a small improvement on old Intel processors, but > actually slow down speed on AMD processors, and significantly > increase code size. > > 3. LTO is roughly as powerful at reducing code size as O2 is at > increasing it. > > I checked size of attica (example from [1]) with these configurations. > Column 1 is package size, column 2 is installed size as reported by apk, > and column 3+ is the CFLAGS/CXXFLAGS. > > {1} 165461 585728 -Os > {2} 225285 823296 -O2 > {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops > -fno-align-labels -fno-prefetch-loop-arrays > -freorder-blocks-algorithm=simple {4} 175413 614400 -O2 -flto > -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels > -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {5} 176036 > 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps > -fno-align-functions -fno-align-loops -fno-align-labels > -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {6} 154055 > 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps > -fno-align-functions -fno-align-loops -fno-align-labels > -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple We're not going to use a bunch of custom CFLAGS, I think -O2 is good enough and in most cases won't cause much bloat. > gcc size is harder to measure here, as I built gcc without most > languages. The size of usr/libexec/gcc increased from 43076k excluding > cc2obj and d21 to 49144k excluding cc1plus. However, the latter number > may not be accurate, as for some reason my attica -Os is a different > size from the edge attica. > > === Analysis === > > Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables > globally. I provide it here only as a reference (and because I did the > benchmark before looking up exactly what the flag does). > > LTO is a can of worms that I think is definitely worth opening at some > point, but should wait at least until both musl 1.2 and gcc 10 are done, > which I gather will take some time. Additionally, it is somewhat > orthogonal to -Ox. So, the question now is whether a 10-25% increase in > performance justifies a 15-30% increase in code size. Most likely we should wait until after Alpine 3.13 release for this. > There is also a third option: we can use -O2 in some common CPU-heavy > programs and libraries, such as gcc and openssl. Alpine already uses > default optimization for musl, which I think works out to -Os for most > components and -O3 for performance-sensitive areas. It would be great if > all packages could do this, but it also sounds like way too much work to > patch every single package (and probably PGO is the right answer there > anyways). > > There are also probably other compile flags that we should be looking > at, such as security flags, or linker flags (-Wl,--hash-style=gnu, > -Wl,-O, etc). However, I didn't investigate those at this time. > > === Other distros === > > Although I didn't do much research, I think other distros did not > carefully select their optimization flags (as opposed to security > flags). Most mainstream distros seem to basically use whatever gcc gives > them for -O2. Clear Linux seems to set everything to MAXIMUM > OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the > decision to the user. OpenWRT uses -Os, which can be overridden > per-target, although I couldn't find any targets overriding the > optimization flags. > > === Limitations === > > These benchmarks are obviously very limited. However, I don't want to go > down the path of extensive benchmarks just to find people coming out of > the woodwork and complaining that a 20% increase in code size (i.e. > excluding scripts, docs, FS overhead, etc) overflows their hard drives. > > Additionally, whoever desperately needs that extra few dozen megabytes > should be using squashfs or zstd apk, so the uncompressed/gzip numbers > are not that useful. > > == Conclusions == > > Personally, I think a 15% speedup is very much worth a 15% increase in > the small portion of my storage used for storing programs. I definitely > think that the optimization level for gcc itself should be changed, and > building it with LTO should be fixed/implemented as soon as possible. I > certainly hope that nobody is installing gcc on their minimal IoT > systems or whatever that cannot spare 10 MB of space. (Also, those > people are wasting space already on Obj-C and D support.) > > In my opinion, anybody that doesn't want to use an extra few dozen > megabytes of space either should care more about the extra power > consumption, or should be using a custom OpenWRT or Buildroot anyways, > where they can customize everything. It is possible to customize everything in Alpine too, just rebuild the packages you want customized. Ariadne