My personal journey to unlock more performance on Linux - Part 4: Compiling from Source
Thank you for coming back for the fourth part of my Linux Gaming Tweaks series. If you missed the first part, head over here to get a general overview and learn more about my hardware and Linux distribution choice or visit the second part where I cover the Linux Kernel. In the third episode, I took a look at the GCC and LLVM/Clang compiler toolchain and implemented several tweaks to each of them. This episode is a spiritual continuation of the last episode as we now put our custom compiled toolchains to good use.
Editing the makepkg.conf file
Let's talk about the choice of my CFLAGS first, a term which I will use for compiler settings in general.
On Arch, most packages which you compile from AUR use the compiler settings in /etc/makepkg.conf - hence if you want to use your own you need to edit that file with your choices as the default flags used are rather conservative. Head over to the Arch documentation if you want to know more about all of the other settings in that file. However some packages, e.g. the Linux Kernel, don't take their build flags from there or disable the one's from the makepkg.conf deliberatly, hence you should check with each PKGBUILD if you find a "!buildflags" in there which is the negative expression of using the buildflags set there.
For example, these are my CFLAGS when using GCC for most packages:
CFLAGS="-O3 -march=native -fno-semantic-interposition -falign-functions=32 -fipa-pta -flive-range-shrinkage -fno-math-errno -fno-trapping-math -mtls-dialect=gnu2 -feliminate-unused-debug-types -floop-nest-optimize -fgraphite-identity -fcf-protection=none -pipe -flto=auto -floop-parallelize-all -ftree-parallelize-loops=18 -fdevirtualize-at-ltrans -mharden-sls=none"
For a more technical description of what each of the settings mean, visit the GCC documentation.
In general, I optimize for my CPU (with the "native" arugment it is auto-detected), for cache and data locality and also auto-parallelization of loops. I disable two security features with "-fcf-protection=none" and "-mharden-sls=none" due to their negative impact on performance at the cost of security. As you see, I make use of GRAPHITE and Link Time Optimizations (LTO). The others are used for producing more efficient or faster code, roughly speaking.
The LDFLAGS are also important, these are the settings for your linker (which is either BFD, GOLD or LLD).
I use: LDFLAGS="-march=native -Wl,-O3,--as-needed,-Bsymbolic-functions,-flto=auto -fopenmp"
"O3" means something different in this context but also enables more optimizations. "Bsymbolic-functions" is a bit controversial as some developers oppose it, but it has been found by others that together with "fno-semantic-interposition" it could yield significantly lower compilation times and better performance in some cases. One noteable exception is Glibc, as with these settings the produced binary is faulty and can leave your installation useless. You have been warned!
The "fopenmp" setting is used due to my use of the "-floop-parallelize-all -ftree-parallelize-loops=18" flags in the CFLAGS section, as GCC needs OpenMP to achieve loop parallelization and the linker needs to know to link the files with the OpenMP runtime library. While "fopenmp" should be automatically used in LDFLAGS when using these, for the peace of my mind I added it manually as I have seen bugs due to missing the linked-in OpenMP runtime in the past, but strictly speaking that should be redundant.
"-flto=auto" stands for using LTO and making use of all CPU threads.
When using LLVM/Clang, I insert the following after the LTOFLAGS entry - makepkg will then use this toolchain instead of GCC:
export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CFLAGS="-O3 -march=native -mllvm -polly -mllvm -polly-parallel -fopenmp -mllvm -polly-vectorizer=stripmine -mllvm -polly-omp-backend=LLVM -mllvm -polly-num-threads=36 -mllvm -polly-scheduling=dynamic -mllvm -polly-scheduling-chunksize=1 -mllvm -polly-ast-use-context -mllvm -polly-invariant-load-hoisting -mllvm -polly-loopfusion-greedy -mllvm -polly-run-inliner -mllvm -polly-run-dce -fno-math-errno -fno-trapping-math -falign-functions=32 -fno-semantic-interposition -fcf-protection=none -flto=thin"
export CXXFLAGS="${CFLAGS}"
export LDFLAGS="-march=native -Wl,--lto-O3,-O3,-Bsymbolic-functions,--as-needed -flto=thin -fuse-ld=lld"
You might see the similarities, Polly is used to optimize for data/cache locality, vectorization and auto-parallelization. Most other flags are similar or identical to the ones used by GCC and serve the same purpose. For the LLVM/Clang toolchain you need to use "-flto=thin" to make use of all available CPU threads. While you get a slightly larger binary, you get huge savings in compilation time.
Identifying packages for optimization
As talked about in the first episode, distributions tend to be more conservative with their CFLAGS for their packages, they also need to be compatible with several CPU architectures, hence some advanced instructions are not widely used yet. We covered that part with my custom CFLAGS above which are indeed more aggressive. Now that you are ready and willing to start compiling packages from the AUR, you should know how to identify suitable packages for optimizations, after all it doesn't make sense to optimize for something you won't use in the end. Of course there are some easy picks, like the Linux Kernel, Mesa, Systemd or Xorg-Server and a couple of their dependencies. There are others which are a bit more subtle, such as compression, audio, image or crypto libraries which are used everywhere which makes them important. To identify these running libraries for a specific program, you need two terminal windows, the program "htop" lists all running processes with their PID and the command "sudo pmap *PID*" from the other console window lists you all the used libraries of that process - just replace the *PID* with the number for the one of the running application which you found with htop. Sometimes you will see some obscure file names where you don't know to which specific package they belong, but Google and a search in the AUR will most probably help you out.
Here are some examples of the more subtle packages:
- compression: zlib-ng, zstd, lz4, brotli
- audio: libogg, libvorbis, FLAC, pipewire, wireplumber
- image: libjpeg-turbo, libpng
- crypto: libgcrypt, openssl 1.1.1
You will find a large pile of optimized PKGBUILDs in my Github repository. Just be aware of the downside that the more packages you compile yourself, the less maintainable your system gets. Some future updates might either force you to re-compile specific packages or might even break your system.
Some packages are also a pain to deal with (e.g qt5-base, GCC, Glibc and the whole KDE packaging mess) either because they take a long time to finish, need to be compiled in a specific order or depend on too many other projects, might break from time to time, break your system or simply don't react too mildly when confronted with aggressive compiler flags. Silent breakage also happens from time to time where the compilation process finishes succesfully but triggers problems later on, but most often the compile process simply errors out on you if something went wrong and you need to analyze what might have triggered it, often some compiler flags are to blame, e.g. LTO or the auto-parallelization flags - simply deleting these flags from your makepkg.conf might help you out. But sometimes other CFLAGS or simply the program or its configuration options are at fault. You can try it safe with just using "-O3 -march=native" and see if that compiles fine and enable more and more flags later, try all possible things before giving up. Also there can be bugs, either on the compiler or the program side. Each new major version might behave differently, for good or for worse. Some PKGBUILDS from the AUR might also need manual intervention to make use of your buildflags and/or enable/disable some configuration options, edit the PKGBUILD accordingly.
Let's talk about some drop-in-replacements
There are several alternative projects out there which might be worth a look. Let's talk about three such important projects today which I use for a better experience, the first one is KwinFT. KwinFT serves as a more modern drop-in replacement to KDE's Kwin compositor, read more about its history in the developer's blog. A low-latency compositor and its dependencies are critical for a good user desktop experience. You want that to be snappy, fast and smooth. KwinFT and its related replacements are better in this regard, just try it out yourself to see the difference!
Speaking of compositors, display managers and their impact, a small anecdote with regards to my monitor refresh rate and using KwinFT/KDisplay. By default, it is set to 165 Hz on the desktop, but my favorite games limit that to 120 Hz. You might think that the performance should be not impacted by the desktop refresh rate, but I found out that it is. If I set the desktop refresh rate to 119,997 Hz, I get a far smoother gaming experience and 7-8% better FPS. That is a huge difference and it might be that factors of 60Hz are simply better supported than non-standard refresh rates! 119,997 Hz even makes a big difference over 120,046 Hz - don't ask me why.
The second project is dbus-broker, a more performant Inter-Process-Communication (software talking to each other) implementation. The AUR package lacks the "-D linux-4-17=true" build option to make use of newer Kernel features to make the programm even better, that was an easy fix. One additional tip from me: Look out for these configuration options as the package maintainer might have overlooked something. Where do you find these? This depends on the used build system, if the package uses the classic autotools system, a "./configure --help" in the source directory reveals these. With Meson, it is even easier to find, these options and their arguments are listed in the "meson_options.txt" which you find in the main source directory.
The last alternative project today is zlib-ng which serves as a more modern drop-in-replacement for the often-used zlib compression library. Try it out, it has more optimizations for modern processors and also aims to be compatible with zlib.
Closing Remarks
There will be blood, toil, tears and sweat, but you might notice a better desktop experience in the end. I might find a topic to extent this series further in the future, but I hope that I've inspired some of you already to tinker a bit with their systems. A fully-optimized state is admittedly hard to maintain and be always prepared for the worst. I hope that Linux gaming and Linux distributions will advance further though as a fully optimized system has shown me how much untapped performance there is still left in a CPU from 2014. A lower barrier to entry and less effort to get there would be much appreciated. In the meantime, you can try out Endeavour with the x86-64-v3 repo as described in Part 1 or CachyOS which also comes with a lot of optimizations.