Distro optimization flags

3 3

Alex Xu (Hello71) <alex_y_xu@yahoo.ca>

Details

Message ID: <1593625212.dirkptm3b0.none@localhost>
DKIM signature: missing

Recently there was some discussion on #alpine-devel about optimization
flags. I think it's worth looking at this issue more closely.

=== Rationale ===

-Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with
arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from
tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45
seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds.
On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and
Debian takes only 22 seconds. In other words, O2 is about a 15% speedup,
and LTO is another 30-50% on top of that.

https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011
says that the kernel ran 'hackbench 15' 10% faster using -O2.

http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017
appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

=== Drawbacks ===

Obviously, the main issue with this change is increased code size.
However, this issue is likely less severe than presented at [1],
because:

1. libtracker and some other packages had wrong APKBUILDs that didn't
strip libs. I think -O2 causes slightly larger debug tables to be
generated. I have submitted merge requests to fix the packages I
have found, and we may fix abuild to not require special ordering of
subpackages in these cases.

2. It is possible to use a more limited -O2, which does not cause as
much code ballooning. I got this idea from [2], which is a bad idea
to do in a specific package but seems reasonable system-wide. These
-O2 flags have a small improvement on old Intel processors, but
actually slow down speed on AMD processors, and significantly
increase code size.

3. LTO is roughly as powerful at reducing code size as O2 is at
increasing it.

I checked size of attica (example from [1]) with these configurations.
Column 1 is package size, column 2 is installed size as reported by apk,
and column 3+ is the CFLAGS/CXXFLAGS.

{1} 165461 585728 -Os
{2} 225285 823296 -O2
{3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
{4} 175413 614400 -O2 -flto -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
{5} 176036 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
{6} 154055 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple

gcc size is harder to measure here, as I built gcc without most
languages. The size of usr/libexec/gcc increased from 43076k excluding
cc2obj and d21 to 49144k excluding cc1plus. However, the latter number
may not be accurate, as for some reason my attica -Os is a different
size from the edge attica.

=== Analysis ===

Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
globally. I provide it here only as a reference (and because I did the
benchmark before looking up exactly what the flag does).

LTO is a can of worms that I think is definitely worth opening at some
point, but should wait at least until both musl 1.2 and gcc 10 are done,
which I gather will take some time. Additionally, it is somewhat
orthogonal to -Ox. So, the question now is whether a 10-25% increase in
performance justifies a 15-30% increase in code size.

There is also a third option: we can use -O2 in some common CPU-heavy
programs and libraries, such as gcc and openssl. Alpine already uses
default optimization for musl, which I think works out to -Os for most
components and -O3 for performance-sensitive areas. It would be great if
all packages could do this, but it also sounds like way too much work to
patch every single package (and probably PGO is the right answer there
anyways).

There are also probably other compile flags that we should be looking
at, such as security flags, or linker flags (-Wl,--hash-style=gnu,
-Wl,-O, etc). However, I didn't investigate those at this time.

=== Other distros ===

Although I didn't do much research, I think other distros did not
carefully select their optimization flags (as opposed to security
flags). Most mainstream distros seem to basically use whatever gcc gives
them for -O2. Clear Linux seems to set everything to MAXIMUM
OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the
decision to the user. OpenWRT uses -Os, which can be overridden
per-target, although I couldn't find any targets overriding the
optimization flags.

=== Limitations ===

These benchmarks are obviously very limited. However, I don't want to go
down the path of extensive benchmarks just to find people coming out of
the woodwork and complaining that a 20% increase in code size (i.e.
excluding scripts, docs, FS overhead, etc) overflows their hard drives.

Additionally, whoever desperately needs that extra few dozen megabytes
should be using squashfs or zstd apk, so the uncompressed/gzip numbers
are not that useful.

== Conclusions ==

Personally, I think a 15% speedup is very much worth a 15% increase in
the small portion of my storage used for storing programs. I definitely
think that the optimization level for gcc itself should be changed, and
building it with LTO should be fixed/implemented as soon as possible. I
certainly hope that nobody is installing gcc on their minimal IoT
systems or whatever that cannot spare 10 MB of space. (Also, those
people are wasting space already on Obj-C and D support.)

In my opinion, anybody that doesn't want to use an extra few dozen
megabytes of space either should care more about the extra power
consumption, or should be using a custom OpenWRT or Buildroot anyways,
where they can customize everything.

[1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c8fb7f34ea947358b.camel%40cogitri.dev%3E
[2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f7e4d7774602fd394fb27b

Cheers,
Alex.

Ariadne Conill <ariadne@dereferenced.org>

Details

Message ID: <3042121.WL6ZjG3rU8@localhost>
In-Reply-To: <1593625212.dirkptm3b0.none@localhost> (view parent)
DKIM signature: missing

Download raw message

5 years ago

Hello,

On Wednesday, July 1, 2020 9:10:30 PM MDT Alex Xu (Hello71) wrote:
> Recently there was some discussion on #alpine-devel about optimization
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds.
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup,
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

I don't have any major objection to changing from -Os to -O2.  In most cases, 
it will not be a major increase in code size.  In the case of some 
applications like Chromium, I suspect that the problems we are having with it 
hanging are due to -Os anyway.

> === Drawbacks ===
> 
> Obviously, the main issue with this change is increased code size.
> However, this issue is likely less severe than presented at [1],
> because:
> 
> 1. libtracker and some other packages had wrong APKBUILDs that didn't
>    strip libs. I think -O2 causes slightly larger debug tables to be
>    generated. I have submitted merge requests to fix the packages I
>    have found, and we may fix abuild to not require special ordering of
>    subpackages in these cases.
> 
> 2. It is possible to use a more limited -O2, which does not cause as
>    much code ballooning. I got this idea from [2], which is a bad idea
>    to do in a specific package but seems reasonable system-wide. These
>    -O2 flags have a small improvement on old Intel processors, but
>    actually slow down speed on AMD processors, and significantly
>    increase code size.
> 
> 3. LTO is roughly as powerful at reducing code size as O2 is at
>    increasing it.
> 
> I checked size of attica (example from [1]) with these configurations.
> Column 1 is package size, column 2 is installed size as reported by apk,
> and column 3+ is the CFLAGS/CXXFLAGS.
> 
> {1} 165461 585728 -Os
> {2} 225285 823296 -O2
> {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops
> -fno-align-labels -fno-prefetch-loop-arrays
> -freorder-blocks-algorithm=simple {4} 175413 614400 -O2 -flto
> -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {5} 176036
> 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps
> -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple {6} 154055
> 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps
> -fno-align-functions -fno-align-loops -fno-align-labels
> -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple

We're not going to use a bunch of custom CFLAGS, I think -O2 is good enough 
and in most cases won't cause much bloat.

> gcc size is harder to measure here, as I built gcc without most
> languages. The size of usr/libexec/gcc increased from 43076k excluding
> cc2obj and d21 to 49144k excluding cc1plus. However, the latter number
> may not be accurate, as for some reason my attica -Os is a different
> size from the edge attica.
> 
> === Analysis ===
> 
> Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
> globally. I provide it here only as a reference (and because I did the
> benchmark before looking up exactly what the flag does).
> 
> LTO is a can of worms that I think is definitely worth opening at some
> point, but should wait at least until both musl 1.2 and gcc 10 are done,
> which I gather will take some time. Additionally, it is somewhat
> orthogonal to -Ox. So, the question now is whether a 10-25% increase in
> performance justifies a 15-30% increase in code size.

Most likely we should wait until after Alpine 3.13 release for this.

> There is also a third option: we can use -O2 in some common CPU-heavy
> programs and libraries, such as gcc and openssl. Alpine already uses
> default optimization for musl, which I think works out to -Os for most
> components and -O3 for performance-sensitive areas. It would be great if
> all packages could do this, but it also sounds like way too much work to
> patch every single package (and probably PGO is the right answer there
> anyways).
> 
> There are also probably other compile flags that we should be looking
> at, such as security flags, or linker flags (-Wl,--hash-style=gnu,
> -Wl,-O, etc). However, I didn't investigate those at this time.
> 
> === Other distros ===
> 
> Although I didn't do much research, I think other distros did not
> carefully select their optimization flags (as opposed to security
> flags). Most mainstream distros seem to basically use whatever gcc gives
> them for -O2. Clear Linux seems to set everything to MAXIMUM
> OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the
> decision to the user. OpenWRT uses -Os, which can be overridden
> per-target, although I couldn't find any targets overriding the
> optimization flags.
> 
> === Limitations ===
> 
> These benchmarks are obviously very limited. However, I don't want to go
> down the path of extensive benchmarks just to find people coming out of
> the woodwork and complaining that a 20% increase in code size (i.e.
> excluding scripts, docs, FS overhead, etc) overflows their hard drives.
> 
> Additionally, whoever desperately needs that extra few dozen megabytes
> should be using squashfs or zstd apk, so the uncompressed/gzip numbers
> are not that useful.
> 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in
> the small portion of my storage used for storing programs. I definitely
> think that the optimization level for gcc itself should be changed, and
> building it with LTO should be fixed/implemented as soon as possible. I
> certainly hope that nobody is installing gcc on their minimal IoT
> systems or whatever that cannot spare 10 MB of space. (Also, those
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen
> megabytes of space either should care more about the extra power
> consumption, or should be using a custom OpenWRT or Buildroot anyways,
> where they can customize everything.

It is possible to customize everything in Alpine too, just rebuild the 
packages you want customized.

Ariadne

Natanael Copa <ncopa@alpinelinux.org>

Details

Message ID: <20200707140641.122e8f09@ncopa-desktop.copa.dup.pw>
In-Reply-To: <1593625212.dirkptm3b0.none@localhost> (view parent)
DKIM signature: missing

Download raw message

5 years ago

Hi!

On Wed, 01 Jul 2020 23:10:30 -0400
"Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> wrote:

> Recently there was some discussion on #alpine-devel about optimization 
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with 
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from 
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45 
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds. 
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and 
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup, 
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011 
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017 
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.

Thank you for doing those tests. Results are interesting.

...
 
> There is also a third option: we can use -O2 in some common CPU-heavy 
> programs and libraries, such as gcc and openssl. Alpine already uses 
> default optimization for musl, which I think works out to -Os for most 
> components and -O3 for performance-sensitive areas. It would be great if 
> all packages could do this, but it also sounds like way too much work to 
> patch every single package (and probably PGO is the right answer there 
> anyways).

We already do this. We set -O2 for zlib for example.

...
 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in 
> the small portion of my storage used for storing programs. I definitely 
> think that the optimization level for gcc itself should be changed, and 
> building it with LTO should be fixed/implemented as soon as possible. I 
> certainly hope that nobody is installing gcc on their minimal IoT 
> systems or whatever that cannot spare 10 MB of space. (Also, those 
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen 
> megabytes of space either should care more about the extra power 
> consumption, or should be using a custom OpenWRT or Buildroot anyways, 
> where they can customize everything.

I think we should keep -Os as the default and enable -O2 on few
packages where it makes sense. Alpine Linux is "Small. Simple. Secure"
after all.

Those who really want an -O2 distro has a lot of other distros to chose
between.

That said, I agree that it makes sense to build gcc with -O2.

> 
> [1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c8fb7f34ea947358b.camel%40cogitri.dev%3E
> [2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f7e4d7774602fd394fb27b
> 
> Cheers,
> Alex.

Oliver Smith <ollieparanoid@postmarketos.org>

Details

Message ID: <dcb0ce15-c5dc-3b38-39d8-a0b907e96c7a@postmarketos.org>
In-Reply-To: <1593625212.dirkptm3b0.none@localhost> (view parent)
DKIM signature: missing

Download raw message

4 years ago

Hi all,

while I can't look into this in detail right now, I'd like to share a
data point. I just switched the CI job of a python program from debian
stretch to alpine 3.12 and found that the testsuite takes almost 8x the
time now (~8 min instead of ~1 min).

https://gitlab.com/postmarketOS/build.postmarketos.org/-/commit/bc3567ce2216226e78f0e31a9da22f3049f94c64

I wonder if compiling python with different flags already makes a big
difference, maybe I'll try it out at some point.

Best regards,
Oliver

Alex Xu (Hello71):
> Recently there was some discussion on #alpine-devel about optimization 
> flags. I think it's worth looking at this issue more closely.
> 
> === Rationale ===
> 
> -Os is much slower than -O2. I recompiled gcc 9.3.0-r3 from head with 
> arguments {3} from below and tested compiling Linux 5.7.7 allnoconfig from 
> tmpfs on edge with make -j4. On my Intel laptop, edge gcc takes about 45 
> seconds, O2 gcc takes 39 seconds, and Debian sid takes only 30 seconds. 
> On my Ryzen desktop, edge takes 38 seconds, O2 takes 33 seconds, and 
> Debian takes only 22 seconds. In other words, O2 is about a 15% speedup, 
> and LTO is another 30-50% on top of that.
> 
> https://lore.kernel.org/lkml/20110323211415.GA8791@elte.hu/ from 2011 
> says that the kernel ran 'hackbench 15' 10% faster using -O2.
> 
> http://web.archive.org/web/20200408145313/https://rv8.io/bench from 2017 
> appears to say that rv8 ran about 25% faster using -O2 compared to -Os.
> 
> === Drawbacks ===
> 
> Obviously, the main issue with this change is increased code size. 
> However, this issue is likely less severe than presented at [1], 
> because:
> 
> 1. libtracker and some other packages had wrong APKBUILDs that didn't 
>    strip libs. I think -O2 causes slightly larger debug tables to be 
>    generated. I have submitted merge requests to fix the packages I 
>    have found, and we may fix abuild to not require special ordering of 
>    subpackages in these cases.
> 
> 2. It is possible to use a more limited -O2, which does not cause as 
>    much code ballooning. I got this idea from [2], which is a bad idea 
>    to do in a specific package but seems reasonable system-wide. These 
>    -O2 flags have a small improvement on old Intel processors, but 
>    actually slow down speed on AMD processors, and significantly 
>    increase code size.
> 
> 3. LTO is roughly as powerful at reducing code size as O2 is at 
>    increasing it.
> 
> I checked size of attica (example from [1]) with these configurations. 
> Column 1 is package size, column 2 is installed size as reported by apk, 
> and column 3+ is the CFLAGS/CXXFLAGS.
> 
> {1} 165461 585728 -Os
> {2} 225285 823296 -O2
> {3} 198665 757760 -O2 -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> {4} 175413 614400 -O2 -flto -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> {5} 176036 675840 -O2 -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> {6} 154055 540672 -O2 -flto -fno-asynchronous-unwind-tables -fno-align-jumps -fno-align-functions -fno-align-loops -fno-align-labels -fno-prefetch-loop-arrays -freorder-blocks-algorithm=simple
> 
> gcc size is harder to measure here, as I built gcc without most 
> languages. The size of usr/libexec/gcc increased from 43076k excluding 
> cc2obj and d21 to 49144k excluding cc1plus. However, the latter number 
> may not be accurate, as for some reason my attica -Os is a different 
> size from the edge attica.
> 
> === Analysis ===
> 
> Unfortunately, it doesn't seem safe to set -fno-asynchronous-unwind-tables
> globally. I provide it here only as a reference (and because I did the 
> benchmark before looking up exactly what the flag does).
> 
> LTO is a can of worms that I think is definitely worth opening at some 
> point, but should wait at least until both musl 1.2 and gcc 10 are done, 
> which I gather will take some time. Additionally, it is somewhat 
> orthogonal to -Ox. So, the question now is whether a 10-25% increase in 
> performance justifies a 15-30% increase in code size.
> 
> There is also a third option: we can use -O2 in some common CPU-heavy 
> programs and libraries, such as gcc and openssl. Alpine already uses 
> default optimization for musl, which I think works out to -Os for most 
> components and -O3 for performance-sensitive areas. It would be great if 
> all packages could do this, but it also sounds like way too much work to 
> patch every single package (and probably PGO is the right answer there 
> anyways).
> 
> There are also probably other compile flags that we should be looking 
> at, such as security flags, or linker flags (-Wl,--hash-style=gnu, 
> -Wl,-O, etc). However, I didn't investigate those at this time.
> 
> === Other distros ===
> 
> Although I didn't do much research, I think other distros did not 
> carefully select their optimization flags (as opposed to security 
> flags). Most mainstream distros seem to basically use whatever gcc gives 
> them for -O2. Clear Linux seems to set everything to MAXIMUM 
> OPTIMIZATION. Gentoo recommends -O2 -march=native -pipe and punts the 
> decision to the user. OpenWRT uses -Os, which can be overridden 
> per-target, although I couldn't find any targets overriding the 
> optimization flags.
> 
> === Limitations ===
> 
> These benchmarks are obviously very limited. However, I don't want to go 
> down the path of extensive benchmarks just to find people coming out of 
> the woodwork and complaining that a 20% increase in code size (i.e. 
> excluding scripts, docs, FS overhead, etc) overflows their hard drives.
> 
> Additionally, whoever desperately needs that extra few dozen megabytes 
> should be using squashfs or zstd apk, so the uncompressed/gzip numbers 
> are not that useful.
> 
> == Conclusions ==
> 
> Personally, I think a 15% speedup is very much worth a 15% increase in 
> the small portion of my storage used for storing programs. I definitely 
> think that the optimization level for gcc itself should be changed, and 
> building it with LTO should be fixed/implemented as soon as possible. I 
> certainly hope that nobody is installing gcc on their minimal IoT 
> systems or whatever that cannot spare 10 MB of space. (Also, those 
> people are wasting space already on Obj-C and D support.)
> 
> In my opinion, anybody that doesn't want to use an extra few dozen 
> megabytes of space either should care more about the extra power 
> consumption, or should be using a custom OpenWRT or Buildroot anyways, 
> where they can customize everything.
> 
> [1] https://lists.alpinelinux.org/~alpine/devel/%3C2896c13070c508a49cbaa72c8fb7f34ea947358b.camel%40cogitri.dev%3E
> [2] https://github.com/richfelker/mallocng-draft/commit/a9187f0387dcbb77f1f7e4d7774602fd394fb27b
> 
> Cheers,
> Alex.
>