~alpine/users

5 3

[alpine-user] FYI: community/zstd binary much (up to 4x) slower than necessary

Steffen Nurpmeso <steffen@sdaoden.eu>
Details
Message ID
<20180313180648.kXWsR%steffen@sdaoden.eu>
Sender timestamp
1520964408
DKIM signature
missing
Download raw message
Hello, for your possible interest.

In a thead for the LUGA(ustria) i eventually had to time some
compression algorithms and wondered why zstd is so slow, but
especially so the decompressing stage, which a key feature of this
one.  It turns out that the -Os compilation causes, well, drama-
tical performance degradation.  I compiled my own with -O3 and the
difference is up to factor four.  Just one example:

POSIX standard (C165.txt):

Alpine, -Os:
  #?0[steffen@essex tmp]$ time zstd --rm x4.txt
  x4.txt               : 20.95%   (12513780 => 2621685 bytes, x4.txt.zst)
      0m00.57s real     0m00.23s user     0m00.12s system
  #?0[steffen@essex tmp]$ time zstd -d -c x4.txt.zst   >/dev/null
  x4.txt.zst          : 12513780 bytes
      0m00.38s real     0m00.15s user     0m00.12s system

  #?0[steffen@essex tmp]$ time zstd --rm -19 x5.txt
  x5.txt               : 15.40%   (12513780 => 1926643 bytes, x5.txt.zst)
      0m16.30s real     0m13.53s user     0m00.27s system

  #?0[steffen@essex tmp]$ time zstd -d -c x5.txt.zst   >/dev/null
  x5.txt.zst          : 12513780 bytes
      0m00.39s real     0m00.12s user     0m00.14s system

-O3:
  #?0[steffen@essex tmp]$ time x/zstd/zstd -f x1.txt
  x1.txt               : 20.95%   (12513780 => 2621685 bytes, x1.txt.zst)
      0m00.34s real     0m00.12s user     0m00.10s system
  #?0[steffen@essex tmp]$ time x/zstd/zstd -d -c x1.txt.zst >/dev/null
  x1.txt.zst          : 12513780 bytes
      0m00.10s real     0m00.02s user     0m00.05s system

  #?0[steffen@essex tmp]$ time x/zstd/zstd -19 x1.txt
  x1.txt               : 15.40%   (12513780 => 1926643 bytes, x1.txt.zst)
      0m13.29s real     0m11.27s user     0m00.17s system
  #?0[steffen@essex tmp]$ time x/zstd/zstd -d -c x1.txt.zst >/dev/null
  x1.txt.zst          : 12513780 bytes
      0m00.12s real     0m00.02s user     0m00.07s system

That makes me actually wonder how ports should deal with CFLAGS.
Is it acceptable for a port to watch for compiler flags and set
them, my MUA would go for PIE, relro and all that, then?

Ciao,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


---
Unsubscribe:  alpine-user+unsubscribe@lists.alpinelinux.org
Help:         alpine-user+help@lists.alpinelinux.org
---
Details
Message ID
<20180313211323.aFXkP%ca6c@bitmessage.ch>
In-Reply-To
<20180313180648.kXWsR%steffen@sdaoden.eu> (view parent)
Sender timestamp
1520975603
DKIM signature
missing
Download raw message
Steffen Nurpmeso wrote:

> Hello, for your possible interest.
> 
> In a thead for the LUGA(ustria) i eventually had to time some
> compression algorithms and wondered why zstd is so slow, but
> especially so the decompressing stage, which a key feature of this
> one.  It turns out that the -Os compilation causes, well, drama-
> tical performance degradation.  I compiled my own with -O3 and the
> difference is up to factor four.  Just one example:

What about -O2? Also what are the differences in binary sizes? Are you
using gcc? If yes, try clang.

--
caóc



---
Unsubscribe:  alpine-user+unsubscribe@lists.alpinelinux.org
Help:         alpine-user+help@lists.alpinelinux.org
---
Steffen Nurpmeso <steffen@sdaoden.eu>
Details
Message ID
<20180314181551.dESjX%steffen@sdaoden.eu>
In-Reply-To
<20180313211323.aFXkP%ca6c@bitmessage.ch> (view parent)
Sender timestamp
1521051351
DKIM signature
missing
Download raw message
Cág <ca6c@bitmessage.ch> wrote:
 |Steffen Nurpmeso wrote:
 |> Hello, for your possible interest.
 |> 
 |> In a thead for the LUGA(ustria) i eventually had to time some
 |> compression algorithms and wondered why zstd is so slow, but
 |> especially so the decompressing stage, which a key feature of this
 |> one.  It turns out that the -Os compilation causes, well, drama-
 |> tical performance degradation.  I compiled my own with -O3 and the
 |> difference is up to factor four.  Just one example:
 |
 |What about -O2? Also what are the differences in binary sizes? Are you
 |using gcc? If yes, try clang.

I thought it could be of interest for those who have many files or
whatever.  Factor four is not nothing, especially if it is lost at
the bottommost level of computing.
In some private message i responded

  Not really comparable since it found development stuff of other
  archivers and compiled that in -- he adds more and more support
  for other archive formats and i think that will end up like tar
  a.k.a. libarchive umbrellas do.  I do not know how i could have
  an isolated quickshot or what make flags i would have to use to
  get a stripped version that is comparable.  (Too lazy, too late.)

  But sure it will be somewhat larger, -Os is like -O2 (?) with some
  reduction -- then again this is not chromium or something but
  a (per se) small archiver, and factor four on decompression side
  is drastical.  It may also be platform dependent.  I mean, for my
  use case that is all right (but now that i have the binary around
  it stays for a while), but if it would drive a compressed file
  system or if i had a lot of compressed files to deal with
  regulary, or if i had a server with database or whatever and it
  would base on such files, then it would matter.  (That is why
  i said FYI.)

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


---
Unsubscribe:  alpine-user+unsubscribe@lists.alpinelinux.org
Help:         alpine-user+help@lists.alpinelinux.org
---
Natanael Copa <ncopa@alpinelinux.org>
Details
Message ID
<20180316091207.3ad9dd48@ncopa-desktop.copa.dup.pw>
In-Reply-To
<20180313180648.kXWsR%steffen@sdaoden.eu> (view parent)
Sender timestamp
1521187927
DKIM signature
missing
Download raw message
On Tue, 13 Mar 2018 19:06:48 +0100
Steffen Nurpmeso <steffen@sdaoden.eu> wrote:

> Hello, for your possible interest.
> 
> In a thead for the LUGA(ustria) i eventually had to time some
> compression algorithms and wondered why zstd is so slow, but
> especially so the decompressing stage, which a key feature of this
> one.  It turns out that the -Os compilation causes, well, drama-
> tical performance degradation.  I compiled my own with -O3 and the
> difference is up to factor four.  Just one example:
> 
> POSIX standard (C165.txt):
> 
> Alpine, -Os:
>   #?0[steffen@essex tmp]$ time zstd --rm x4.txt
>   x4.txt               : 20.95%   (12513780 => 2621685 bytes, x4.txt.zst)
>       0m00.57s real     0m00.23s user     0m00.12s system
>   #?0[steffen@essex tmp]$ time zstd -d -c x4.txt.zst   >/dev/null
>   x4.txt.zst          : 12513780 bytes
>       0m00.38s real     0m00.15s user     0m00.12s system
> 
>   #?0[steffen@essex tmp]$ time zstd --rm -19 x5.txt
>   x5.txt               : 15.40%   (12513780 => 1926643 bytes, x5.txt.zst)
>       0m16.30s real     0m13.53s user     0m00.27s system
> 
>   #?0[steffen@essex tmp]$ time zstd -d -c x5.txt.zst   >/dev/null
>   x5.txt.zst          : 12513780 bytes
>       0m00.39s real     0m00.12s user     0m00.14s system
> 
> -O3:
>   #?0[steffen@essex tmp]$ time x/zstd/zstd -f x1.txt
>   x1.txt               : 20.95%   (12513780 => 2621685 bytes, x1.txt.zst)
>       0m00.34s real     0m00.12s user     0m00.10s system
>   #?0[steffen@essex tmp]$ time x/zstd/zstd -d -c x1.txt.zst >/dev/null
>   x1.txt.zst          : 12513780 bytes
>       0m00.10s real     0m00.02s user     0m00.05s system
> 
>   #?0[steffen@essex tmp]$ time x/zstd/zstd -19 x1.txt
>   x1.txt               : 15.40%   (12513780 => 1926643 bytes, x1.txt.zst)
>       0m13.29s real     0m11.27s user     0m00.17s system
>   #?0[steffen@essex tmp]$ time x/zstd/zstd -d -c x1.txt.zst >/dev/null
>   x1.txt.zst          : 12513780 bytes
>       0m00.12s real     0m00.02s user     0m00.07s system

Are you compressing the same file? I see x4.txt, x5.txt avs x1.txt.
File content may make difference too.
 
> That makes me actually wonder how ports should deal with CFLAGS.
> Is it acceptable for a port to watch for compiler flags and set
> them, my MUA would go for PIE, relro and all that, then?

I think if the difference is 4x then, yes, I think we should explicitly
set CFLAGS from aport with a reference on why. I do prefer -O2 over -O3
though, so It would be nice to see the numbers with -O2 and also what
the numbers are on different platforms.

We already explicitly set -O2 for zlib, because its a case where we do
want trade more speed at the cost of size.

-nc


---
Unsubscribe:  alpine-user+unsubscribe@lists.alpinelinux.org
Help:         alpine-user+help@lists.alpinelinux.org
---
Steffen Nurpmeso <steffen@sdaoden.eu>
Details
Message ID
<20180316153741.sbhOT%steffen@sdaoden.eu>
In-Reply-To
<20180316091207.3ad9dd48@ncopa-desktop.copa.dup.pw> (view parent)
Sender timestamp
1521214661
DKIM signature
missing
Download raw message
Hello.

Natanael Copa <ncopa@alpinelinux.org> wrote:
 |On Tue, 13 Mar 2018 19:06:48 +0100
 |Steffen Nurpmeso <steffen@sdaoden.eu> wrote:
 |
 |> Hello, for your possible interest.
 |>
 |> In a thead for the LUGA(ustria) i eventually had to time some
 |> compression algorithms and wondered why zstd is so slow, but
 |> especially so the decompressing stage, which a key feature of this
 |> one.  It turns out that the -Os compilation causes, well, drama-
 |> tical performance degradation.  I compiled my own with -O3 and the
 |> difference is up to factor four.  Just one example:
 ...
 |Are you compressing the same file? I see x4.txt, x5.txt avs x1.txt.
 |File content may make difference too.

Yes, it was all the same.  It was just an excerpt of that LUGA
message, sorry.

 |> That makes me actually wonder how ports should deal with CFLAGS.
 |> Is it acceptable for a port to watch for compiler flags and set
 |> them, my MUA would go for PIE, relro and all that, then?
 |
 |I think if the difference is 4x then, yes, I think we should explicitly
 |set CFLAGS from aport with a reference on why. I do prefer -O2 over -O3
 |though, so It would be nice to see the numbers with -O2 and also what
 |the numbers are on different platforms.
 |
 |We already explicitly set -O2 for zlib, because its a case where we do
 |want trade more speed at the cost of size.

I see.  I only have control of x86 (with Linux) for now, i really
have to do something about that at some day...  With -O2:

  #?0[steffen@essex zstd]$ CFLAGS=-O2 make zstd
  ...
  #?0[steffen@essex zstd]$ ll zstd
  -rwxr-x---    1 steffen  steffen     582392 Mar 16 16:11 zstd*
  #?0[steffen@essex zstd]$ ldd zstd
          /lib/ld-musl-x86_64.so.1 (0x7fc87972c000)
          libz.so.1 => /lib/libz.so.1 (0x7fc879291000)
          libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7fc87972c000)
  #?0[steffen@essex zstd]$ time ./zstd -c < C165.txt > .t1
      0m00.40s real     0m00.27s user     0m00.09s system
  #?0[steffen@essex zstd]$ time ./zstd -c < C165.txt > .t1
      0m00.31s real     0m00.23s user     0m00.07s system
  #?0[steffen@essex zstd]$ time ./zstd -19 -c < C165.txt > .t1
      0m12.50s real     0m12.35s user     0m00.13s system
  #?0[steffen@essex zstd]$ time ./zstd -19 -c < C165.txt > .t1
      0m12.32s real     0m12.14s user     0m00.15s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.17s real     0m00.11s user     0m00.06s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.13s real     0m00.09s user     0m00.03s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.12s real     0m00.09s user     0m00.02s system

No difference with -O3, actually:

  #?0[steffen@essex zstd]$ CFLAGS=-O3 make zstd
  ...
  #?0[steffen@essex zstd]$ ll zstd
  -rwxr-x---    1 steffen  steffen     619296 Mar 16 16:17 zstd*
  #?0[steffen@essex zstd]$ ldd zstd
          /lib/ld-musl-x86_64.so.1 (0x7f423a622000)
          libz.so.1 => /lib/libz.so.1 (0x7f423a17e000)
          libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f423a622000)
  #?0[steffen@essex zstd]$ time ./zstd -c < C165.txt > .t1
      0m00.33s real     0m00.26s user     0m00.06s system
  #?0[steffen@essex zstd]$ time ./zstd -c < C165.txt > .t1
      0m00.28s real     0m00.23s user     0m00.04s system
  #?0[steffen@essex zstd]$ time ./zstd -19 -c < C165.txt > .t1
      0m12.45s real     0m12.19s user     0m00.21s system
  #?0[steffen@essex zstd]$ time ./zstd -19 -c < C165.txt > .t1
      0m12.97s real     0m12.82s user     0m00.14s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.13s real     0m00.07s user     0m00.06s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.13s real     0m00.08s user     0m00.05s system

But lots of difference for /usr/bin/zstd:

  #?0[steffen@essex zstd]$ ll /usr/bin/zstd
  -rwxr-xr-x    1 root     root        382792 Dec 27 15:17 /usr/bin/zstd*
  #?0[steffen@essex zstd]$ ldd /usr/bin/zstd
          /lib/ld-musl-x86_64.so.1 (0x7f2255a3d000)
          libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f2255a3d000)
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -c < C165.txt > .t1
      0m00.53s real     0m00.44s user     0m00.07s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -c < C165.txt > .t1
      0m00.52s real     0m00.44s user     0m00.07s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -19 -c < C165.txt > .t1
      0m15.16s real     0m15.06s user     0m00.09s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -19 -c < C165.txt > .t1
      0m15.35s real     0m15.19s user     0m00.14s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d -c < .t1 >/dev/null
      0m00.40s real     0m00.27s user     0m00.12s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d -c < .t1 >/dev/null
      0m00.36s real     0m00.30s user     0m00.05s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d -c < .t1 >/dev/null
    0m00.40s real     0m00.27s user     0m00.14s system

Quick PDF with Steven-Levy_Hackers-Heroes-Computer-Revolution.pdf,
difference is not so big here, but decompression near factor two:

  #?0[steffen@essex zstd]$ ll slhhcr.pdf
  -rw-r-----    1 steffen  steffen    2761072 Mar 16 16:24 slhhcr.pdf
  #?0[steffen@essex zstd]$ time ./zstd -c < slhhcr.pdf >.t1
      0m00.13s real     0m00.06s user     0m00.06s system
  #?0[steffen@essex zstd]$ time ./zstd -19 -c < slhhcr.pdf >.t1
      0m01.58s real     0m01.50s user     0m00.08s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.03s real     0m00.02s user     0m00.01s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.04s real     0m00.01s user     0m00.02s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.05s real     0m00.02s user     0m00.03s system

  #?0[steffen@essex zstd]$ time /usr/bin/zstd -c < slhhcr.pdf >.t1
      0m00.18s real     0m00.11s user     0m00.07s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -19 -c < slhhcr.pdf >.t1
      0m01.82s real     0m01.74s user     0m00.07s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d < .t1 >/dev/null
      0m00.07s real     0m00.03s user     0m00.04s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d < .t1 >/dev/null
      0m00.09s real     0m00.04s user     0m00.04s system

And the Guide_to_Digital_Signal_Processing (directory of PDF) as
a tar file, finally, decompression factor three to four:

  #?0[steffen@essex zstd]$ ll gtdsp.tar
  -rw-r-----    1 steffen  steffen   16537600 Mar 16 16:29 gtdsp.tar
  #?0[steffen@essex zstd]$ time ./zstd -c < gtdsp.tar >.t1
      0m00.36s real     0m00.22s user     0m00.13s system
  #?0[steffen@essex zstd]$ time ./zstd -19 -c < gtdsp.tar >.t1
      0m06.78s real     0m06.62s user     0m00.14s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.10s real     0m00.06s user     0m00.04s system
  #?0[steffen@essex zstd]$ time ./zstd -d -c < .t1 >/dev/null
      0m00.10s real     0m00.05s user     0m00.04s system

  #?0[steffen@essex zstd]$ time /usr/bin/zstd -c < gtdsp.tar >.t1
      0m00.62s real     0m00.43s user     0m00.18s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -19 -c < gtdsp.tar >.t1
      0m07.43s real     0m07.16s user     0m00.23s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d -c < .t1 >/dev/null
      0m00.37s real     0m00.21s user     0m00.15s system
  #?0[steffen@essex zstd]$ time /usr/bin/zstd -d -c < .t1 >/dev/null
      0m00.33s real     0m00.29s user     0m00.04s system

Since i have no chance to test i leave the arch= unmodified, but
i wonder since the Makefile has explicit arm flags?
Ciao,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)
Natanael Copa <ncopa@alpinelinux.org>
Details
Message ID
<20180316165056.7768bcf0@ncopa-desktop.copa.dup.pw>
In-Reply-To
<20180316153741.sbhOT%steffen@sdaoden.eu> (view parent)
Sender timestamp
1521215456
DKIM signature
missing
Download raw message
On Fri, 16 Mar 2018 16:37:41 +0100
Steffen Nurpmeso <steffen@sdaoden.eu> wrote:

> Hello.
> 
> Natanael Copa <ncopa@alpinelinux.org> wrote:
>  |On Tue, 13 Mar 2018 19:06:48 +0100
>  |Steffen Nurpmeso <steffen@sdaoden.eu> wrote:
>  |
>  |> Hello, for your possible interest.
>  |>
>  |> In a thead for the LUGA(ustria) i eventually had to time some
>  |> compression algorithms and wondered why zstd is so slow, but
>  |> especially so the decompressing stage, which a key feature of this
>  |> one.  It turns out that the -Os compilation causes, well, drama-
>  |> tical performance degradation.  I compiled my own with -O3 and the
>  |> difference is up to factor four.  Just one example:  
>  ...
>  |Are you compressing the same file? I see x4.txt, x5.txt avs x1.txt.
>  |File content may make difference too.
> 
> Yes, it was all the same.  It was just an excerpt of that LUGA
> message, sorry.
> 
>  |> That makes me actually wonder how ports should deal with CFLAGS.
>  |> Is it acceptable for a port to watch for compiler flags and set
>  |> them, my MUA would go for PIE, relro and all that, then?  
>  |
>  |I think if the difference is 4x then, yes, I think we should explicitly
>  |set CFLAGS from aport with a reference on why. I do prefer -O2 over -O3
>  |though, so It would be nice to see the numbers with -O2 and also what
>  |the numbers are on different platforms.
>  |
>  |We already explicitly set -O2 for zlib, because its a case where we do
>  |want trade more speed at the cost of size.
> 
> I see.  I only have control of x86 (with Linux) for now, i really
> have to do something about that at some day...  With -O2:
> 
>   #?0[steffen@essex zstd]$ CFLAGS=-O2 make zstd
>   ...
>   #?0[steffen@essex zstd]$ ll zstd
>   -rwxr-x---    1 steffen  steffen     582392 Mar 16 16:11 zstd*

...

> 
> No difference with -O3, actually:
> 
>   #?0[steffen@essex zstd]$ CFLAGS=-O3 make zstd
>   ...
>   #?0[steffen@essex zstd]$ ll zstd
>   -rwxr-x---    1 steffen  steffen     619296 Mar 16 16:17 zstd*

Yes, no big difference in performance -O2 vs -O3, but it gets bigger.

...

> But lots of difference for /usr/bin/zstd:
> 
>   #?0[steffen@essex zstd]$ ll /usr/bin/zstd
>   -rwxr-xr-x    1 root     root        382792 Dec 27 15:17 /usr/bin/zstd*

I assume that is with -Os.

I think this alone is good enough reason to force -O2.

Thanks!

-nc


---
Unsubscribe:  alpine-user+unsubscribe@lists.alpinelinux.org
Help:         alpine-user+help@lists.alpinelinux.org
---
Reply to thread Export thread (mbox)