~alpine/devel

3 3

Distro optimization flags

Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Details
Message ID
<1593625212.dirkptm3b0.none@localhost>
DKIM signature
missing
Download raw message
Recently there was some discussion on #alpine-devel about optimization 
flags. I think it's worth looking at this issue more closely.

=== Rationale ===

-Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with 
arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from 
tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45 
seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds. 
On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and 
Debian takes only 22 seconds. In other words, O2 is about a 15% speedup, 
and LTO is another 30-50% on top of that.

https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011 
says that the kernel ran 'hackbench 15' 10% faster using -O2.

http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017 
appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

=== Drawbacks ===

Obviously, the main issue with this change is increased code size. 
However, this issue is likely less severe than presented at [1], 
because:

1. libtracker and some other packages had wrong APKBUILDs that didn't 
   strip libs. I think -O2 causes slightly larger debug tables to be 
   generated. I have submitted merge requests to fix the packages I 
   have found, and we may fix abuild to not require special ordering of 
   subpackages in these cases.

2. It is possible to use a more limited -O2, which does not cause as 
   much code ballooning. I got this idea from [2], which is a bad idea 
   to do in a specific package but seems reasonable system-wide. These 
   -O2 flags have a small improvement on old Intel processors, but 
   actually slow down speed on AMD processors, and significantly 
   increase code size.

3. LTO is roughly as powerful at reducing code size as O2 is at 
   increasing it.

I checked size of attica (example from [1]) with these configurations. 
Column 1 is package size, column 2 is installed size as reported by apk, 
and column 3+ is the CFLAGS/CXXFLAGS.

{1} 165461 585728 -Os
{2} 225285 823296 -O2
{3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
{4} 175413 614400 -O2 -flto -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
{5} 176036 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
{6} 154055 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple

gcc size is harder to measure here, as I built gcc without most 
languages. The size of usr/libexec/gcc increased from 43076k excluding 
cc2obj and d21 to 49144k excluding cc1plus. However, the latter number 
may not be accurate, as for some reason my attica -Os is a different 
size from the edge attica.

=== Analysis ===

Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
globally. I provide it here only as a reference (and because I did the 
benchmark before looking up exactly what the flag does).

LTO is a can of worms that I think is definitely worth opening at some 
point, but should wait at least until both musl 1.2 and gcc 10 are done, 
which I gather will take some time. Additionally, it is somewhat 
orthogonal to -Ox. So, the question now is whether a 10-25% increase in 
performance justifies a 15-30% increase in code size.

There is also a third option: we can use -O2 in some common CPU-heavy 
programs and libraries, such as gcc and openssl. Alpine already uses 
default optimization for musl, which I think works out to -Os for most 
components and -O3 for performance-sensitive areas. It would be great if 
all packages could do this, but it also sounds like way too much work to 
patch every single package (and probably PGO is the right answer there 
anyways).

There are also probably other compile flags that we should be looking 
at, such as security flags, or linker flags (-Wl,--hash-style=gnu, 
-Wl,-O, etc). However, I didn't investigate those at this time.

=== Other distros ===

Although I didn't do much research, I think other distros did not 
carefully select their optimization flags (as opposed to security 
flags). Most mainstream distros seem to basically use whatever gcc gives 
them for -O2. Clear Linux seems to set everything to MAXIMUM 
OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the 
decision to the user. OpenWRT uses -Os, which can be overridden 
per-target, although I couldn't find any targets overriding the 
optimization flags.

=== Limitations ===

These benchmarks are obviously very limited. However, I don't want to go 
down the path of extensive benchmarks just to find people coming out of 
the woodwork and complaining that a 20% increase in code size (i.e. 
excluding scripts, docs, FS overhead, etc) overflows their hard drives.

Additionally, whoever desperately needs that extra few dozen megabytes 
should be using squashfs or zstd apk, so the uncompressed/gzip numbers 
are not that useful.

== Conclusions ==

Personally, I think a 15% speedup is very much worth a 15% increase in 
the small portion of my storage used for storing programs. I definitely 
think that the optimization level for gcc itself should be changed, and 
building it with LTO should be fixed/implemented as soon as possible. I 
certainly hope that nobody is installing gcc on their minimal IoT 
systems or whatever that cannot spare 10 MB of space. (Also, those 
people are wasting space already on Obj-C and D support.)

In my opinion, anybody that doesn't want to use an extra few dozen 
megabytes of space either should care more about the extra power 
consumption, or should be using a custom OpenWRT or Buildroot anyways, 
where they can customize everything.

[1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c8fb7f34ea947358b.camel%40cogitri.dev%3E
[2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f7e4d7774602fd394fb27b

Cheers,
Alex.
Details
Message ID
<3042121.WL6ZjG3rU8@localhost>
In-Reply-To
<1593625212.dirkptm3b0.none@localhost> (view parent)
DKIM signature
missing
Download raw message
Hello,

On Wednesday, July 1, 2020 9:10:30 PM MDT Alex Xu (Hello71) wrote:
> Recently there was some discussion on #alpine-devel about optimization
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds.
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup,
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

I don't have any major objection to changing from -Os to -O2.  In most cases, 
it will not be a major increase in code size.  In the case of some 
applications like Chromium, I suspect that the problems we are having with it 
hanging are due to -Os anyway.

> === Drawbacks ===
> 
> Obviously, the main issue with this change is increased code size.
> However, this issue is likely less severe than presented at [1],
> because:
> 
> 1. libtracker and some other packages had wrong APKBUILDs that didn't
>    strip libs. I think -O2 causes slightly larger debug tables to be
>    generated. I have submitted merge requests to fix the packages I
>    have found, and we may fix abuild to not require special ordering of
>    subpackages in these cases.
> 
> 2. It is possible to use a more limited -O2, which does not cause as
>    much code ballooning. I got this idea from [2], which is a bad idea
>    to do in a specific package but seems reasonable system-wide. These
>    -O2 flags have a small improvement on old Intel processors, but
>    actually slow down speed on AMD processors, and significantly
>    increase code size.
> 
> 3. LTO is roughly as powerful at reducing code size as O2 is at
>    increasing it.
> 
> I checked size of attica (example from [1]) with these configurations.
> Column 1 is package size, column 2 is installed size as reported by apk,
> and column 3+ is the CFLAGS/CXXFLAGS.
> 
> {1} 165461 585728 -Os
> {2} 225285 823296 -O2
> {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops
> -fno-align-labels -fno-prefetch-loop-arrays
> -freorder-blocks-algorithm=simple {4} 175413 614400 -O2 -flto
> -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {5} 176036
> 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps
> -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {6} 154055
> 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps
> -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple

We're not going to use a bunch of custom CFLAGS, I think -O2 is good enough 
and in most cases won't cause much bloat.

> gcc size is harder to measure here, as I built gcc without most
> languages. The size of usr/libexec/gcc increased from 43076k excluding
> cc2obj and d21 to 49144k excluding cc1plus. However, the latter number
> may not be accurate, as for some reason my attica -Os is a different
> size from the edge attica.
> 
> === Analysis ===
> 
> Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
> globally. I provide it here only as a reference (and because I did the
> benchmark before looking up exactly what the flag does).
> 
> LTO is a can of worms that I think is definitely worth opening at some
> point, but should wait at least until both musl 1.2 and gcc 10 are done,
> which I gather will take some time. Additionally, it is somewhat
> orthogonal to -Ox. So, the question now is whether a 10-25% increase in
> performance justifies a 15-30% increase in code size.

Most likely we should wait until after Alpine 3.13 release for this.

> There is also a third option: we can use -O2 in some common CPU-heavy
> programs and libraries, such as gcc and openssl. Alpine already uses
> default optimization for musl, which I think works out to -Os for most
> components and -O3 for performance-sensitive areas. It would be great if
> all packages could do this, but it also sounds like way too much work to
> patch every single package (and probably PGO is the right answer there
> anyways).
> 
> There are also probably other compile flags that we should be looking
> at, such as security flags, or linker flags (-Wl,--hash-style=gnu,
> -Wl,-O, etc). However, I didn't investigate those at this time.
> 
> === Other distros ===
> 
> Although I didn't do much research, I think other distros did not
> carefully select their optimization flags (as opposed to security
> flags). Most mainstream distros seem to basically use whatever gcc gives
> them for -O2. Clear Linux seems to set everything to MAXIMUM
> OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the
> decision to the user. OpenWRT uses -Os, which can be overridden
> per-target, although I couldn't find any targets overriding the
> optimization flags.
> 
> === Limitations ===
> 
> These benchmarks are obviously very limited. However, I don't want to go
> down the path of extensive benchmarks just to find people coming out of
> the woodwork and complaining that a 20% increase in code size (i.e.
> excluding scripts, docs, FS overhead, etc) overflows their hard drives.
> 
> Additionally, whoever desperately needs that extra few dozen megabytes
> should be using squashfs or zstd apk, so the uncompressed/gzip numbers
> are not that useful.
> 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in
> the small portion of my storage used for storing programs. I definitely
> think that the optimization level for gcc itself should be changed, and
> building it with LTO should be fixed/implemented as soon as possible. I
> certainly hope that nobody is installing gcc on their minimal IoT
> systems or whatever that cannot spare 10 MB of space. (Also, those
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen
> megabytes of space either should care more about the extra power
> consumption, or should be using a custom OpenWRT or Buildroot anyways,
> where they can customize everything.

It is possible to customize everything in Alpine too, just rebuild the 
packages you want customized.

Ariadne
Details
Message ID
<20200707140641.122e8f09@ncopa-desktop.copa.dup.pw>
In-Reply-To
<1593625212.dirkptm3b0.none@localhost> (view parent)
DKIM signature
missing
Download raw message
Hi!

On Wed, 01 Jul 2020 23:10:30 -0400
"Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> wrote:

> Recently there was some discussion on #alpine-devel about optimization 
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with 
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from 
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45 
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds. 
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and 
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup, 
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011 
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017 
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

Thank you for doing those tests. Results are interesting.

...
 
> There is also a third option: we can use -O2 in some common CPU-heavy 
> programs and libraries, such as gcc and openssl. Alpine already uses 
> default optimization for musl, which I think works out to -Os for most 
> components and -O3 for performance-sensitive areas. It would be great if 
> all packages could do this, but it also sounds like way too much work to 
> patch every single package (and probably PGO is the right answer there 
> anyways).

We already do this. We set -O2 for zlib for example.

...
 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in 
> the small portion of my storage used for storing programs. I definitely 
> think that the optimization level for gcc itself should be changed, and 
> building it with LTO should be fixed/implemented as soon as possible. I 
> certainly hope that nobody is installing gcc on their minimal IoT 
> systems or whatever that cannot spare 10 MB of space. (Also, those 
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen 
> megabytes of space either should care more about the extra power 
> consumption, or should be using a custom OpenWRT or Buildroot anyways, 
> where they can customize everything.

I think we should keep -Os as the default and enable -O2 on few
packages where it makes sense. Alpine Linux is "Small. Simple. Secure"
after all.

Those who really want an -O2 distro has a lot of other distros to chose
between.

That said, I agree that it makes sense to build gcc with -O2.

> 
> [1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c8fb7f34ea947358b.camel%40cogitri.dev%3E
> [2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f7e4d7774602fd394fb27b
> 
> Cheers,
> Alex.
Details
Message ID
<dcb0ce15-c5dc-3b38-39d8-a0b907e96c7a@postmarketos.org>
In-Reply-To
<1593625212.dirkptm3b0.none@localhost> (view parent)
DKIM signature
missing
Download raw message
Hi all,

while I can't look into this in detail right now, I'd like to share a
data point. I just switched the CI job of a python program from debian
stretch to alpine 3.12 and found that the testsuite takes almost 8x the
time now (~8 min instead of ~1 min).

https://gitlab.com/postmarketOS/build.postmarketos.org/-/commit/bc3567ce2216226e78f0e31a9da22f3049f94c64

I wonder if compiling python with different flags already makes a big
difference, maybe I'll try it out at some point.

Best regards,
Oliver

Alex Xu (Hello71):
> Recently there was some discussion on #alpine-devel about optimization 
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with 
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from 
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45 
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds. 
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and 
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup, 
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011 
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017 
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.
> 
> === Drawbacks ===
> 
> Obviously, the main issue with this change is increased code size. 
> However, this issue is likely less severe than presented at [1], 
> because:
> 
> 1. libtracker and some other packages had wrong APKBUILDs that didn't 
>    strip libs. I think -O2 causes slightly larger debug tables to be 
>    generated. I have submitted merge requests to fix the packages I 
>    have found, and we may fix abuild to not require special ordering of 
>    subpackages in these cases.
> 
> 2. It is possible to use a more limited -O2, which does not cause as 
>    much code ballooning. I got this idea from [2], which is a bad idea 
>    to do in a specific package but seems reasonable system-wide. These 
>    -O2 flags have a small improvement on old Intel processors, but 
>    actually slow down speed on AMD processors, and significantly 
>    increase code size.
> 
> 3. LTO is roughly as powerful at reducing code size as O2 is at 
>    increasing it.
> 
> I checked size of attica (example from [1]) with these configurations. 
> Column 1 is package size, column 2 is installed size as reported by apk, 
> and column 3+ is the CFLAGS/CXXFLAGS.
> 
> {1} 165461 585728 -Os
> {2} 225285 823296 -O2
> {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> {4} 175413 614400 -O2 -flto -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> {5} 176036 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> {6} 154055 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> 
> gcc size is harder to measure here, as I built gcc without most 
> languages. The size of usr/libexec/gcc increased from 43076k excluding 
> cc2obj and d21 to 49144k excluding cc1plus. However, the latter number 
> may not be accurate, as for some reason my attica -Os is a different 
> size from the edge attica.
> 
> === Analysis ===
> 
> Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
> globally. I provide it here only as a reference (and because I did the 
> benchmark before looking up exactly what the flag does).
> 
> LTO is a can of worms that I think is definitely worth opening at some 
> point, but should wait at least until both musl 1.2 and gcc 10 are done, 
> which I gather will take some time. Additionally, it is somewhat 
> orthogonal to -Ox. So, the question now is whether a 10-25% increase in 
> performance justifies a 15-30% increase in code size.
> 
> There is also a third option: we can use -O2 in some common CPU-heavy 
> programs and libraries, such as gcc and openssl. Alpine already uses 
> default optimization for musl, which I think works out to -Os for most 
> components and -O3 for performance-sensitive areas. It would be great if 
> all packages could do this, but it also sounds like way too much work to 
> patch every single package (and probably PGO is the right answer there 
> anyways).
> 
> There are also probably other compile flags that we should be looking 
> at, such as security flags, or linker flags (-Wl,--hash-style=gnu, 
> -Wl,-O, etc). However, I didn't investigate those at this time.
> 
> === Other distros ===
> 
> Although I didn't do much research, I think other distros did not 
> carefully select their optimization flags (as opposed to security 
> flags). Most mainstream distros seem to basically use whatever gcc gives 
> them for -O2. Clear Linux seems to set everything to MAXIMUM 
> OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the 
> decision to the user. OpenWRT uses -Os, which can be overridden 
> per-target, although I couldn't find any targets overriding the 
> optimization flags.
> 
> === Limitations ===
> 
> These benchmarks are obviously very limited. However, I don't want to go 
> down the path of extensive benchmarks just to find people coming out of 
> the woodwork and complaining that a 20% increase in code size (i.e. 
> excluding scripts, docs, FS overhead, etc) overflows their hard drives.
> 
> Additionally, whoever desperately needs that extra few dozen megabytes 
> should be using squashfs or zstd apk, so the uncompressed/gzip numbers 
> are not that useful.
> 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in 
> the small portion of my storage used for storing programs. I definitely 
> think that the optimization level for gcc itself should be changed, and 
> building it with LTO should be fixed/implemented as soon as possible. I 
> certainly hope that nobody is installing gcc on their minimal IoT 
> systems or whatever that cannot spare 10 MB of space. (Also, those 
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen 
> megabytes of space either should care more about the extra power 
> consumption, or should be using a custom OpenWRT or Buildroot anyways, 
> where they can customize everything.
> 
> [1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c8fb7f34ea947358b.camel%40cogitri.dev%3E
> [2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f7e4d7774602fd394fb27b
> 
> Cheers,
> Alex.
> 
Reply to thread Export thread (mbox)