Unlocking Speed – part 2

Unlocking Speed – part 2

How Profile-Guided Optimization Revolutionizes Performance of Go

In our previous blog post, we demonstrated the advantages of Profile-Guided Optimization (PGO), highlighting a 10-30% decrease in end-user latency and datacenter cost for server applications without requiring the developer to change source code or even look at profiles. We applied PGO to a couple of Go benchmarks (go-json and 1BRC) and showed its effectiveness. In this blog, we will discuss the application of PGO across various widely-used programming languages and delve into the challenges encountered in deploying PGO on a large-scale production environment.

Just-In-Time (JIT) compiled languages such as Java and JavaScript perform PGO under the hood. Because compilation happens during application execution, JIT compilers can automatically profile an application (using some combination of instrumentation, sampling, and hardware performance monitoring) and apply the collected profiles to optimize the hot paths of an application.

Statically-compiled languages – such as Go, Swift, and Rust – don’t perform PGO automatically. They require the developer to first profile their applications and then explicitly feed the profiles to the compiler in a subsequent build. Doing these PGO steps manually in a production environment is cumbersome and hard to maintain. As a result, these languages typically leave at least 10% performance on the table.

PGO can be automated in a way that is continuous, transparent, and low overhead. In the diagram below, we show a simple PGO deployment workflow for a production environment. First, the standard compilation pipeline produces a binary artifact without using any profiling data (i.e., no PGO enabled) and deploys it in production. In production, a profile collection and management platform triggers a job that periodically gathers and aggregates profiling data from production services, storing the collected profiles in a database. On subsequent deployments, the compiler uses the aggregated profiling data to perform profile-guided optimizations (such as code layout) on frequently executed code paths to produce an optimized binary artifact. This new optimized binary improves instruction fetch efficiency and reduces the number of instructions executed at runtime, thus improving both performance and efficiency (see our previous blog post).

The profile collection and management platform (as shown above) plays a crucial role in deploying PGO. It should run continuously to collect fresh profiling data without slowing down production. It can be set up as an independent service, either on-prem or in the cloud, or it can be incorporated into the CI/CD pipeline via a batch-job which has secured access to production. Moreover, profile collection should be performed via sampling-based profilers instead of instrumentation to minimize overhead on running services. Furthermore, depending on the scale of the production fleet, the computational and storage needs of this platform can be significant – profiling data per microservice per deployed instance can range from a few hundreds of MBs to several GBs. Therefore, if the profiling data is collected often or retained for longer periods of time, it requires efficient management, storage, and aggregation of profiling data.

When it comes to selecting a profiling tool for PGO, developers have several options: perf, pprof, and eBPF. The perf tool is based on hardware performance counters and is the most accurate one. However, its application is limited in production environments, both on-prem and in the cloud, due to its need for special access permissions – many cloud vendors disable perf commands over security concerns in virtualized environments. Additionally, the call-stacks generated by perf require further processing for symbol resolution before they can be integrated into the compiler.

On the other hand, the pprof tool is slightly less accurate than perf, but works universally in both on-prem and cloud settings, and is popular among Go developers. For example, Go compiler’s PGO implementation uses profiling data generated by pprof. However, pprof’s CPU profiler can introduce 3-5% overhead during the profiling phase. Additionally, the memory profiler in pprof tends to have higher overheads compared to its CPU counterpart, which could be a major issue in production.

Lastly, eBPF allows developers to safely and dynamically insert light-weight custom profiling hooks into running OS kernels. It is particularly useful for complex performance analysis such as investigating networking issues and security monitoring. Although eBPF has a growing community & ecosystem, it has limited portability (i.e., only available on Linux OS) and has a steep learning curve, which can be a barrier for developers new to system-level profiling.Therefore, the choice of profiling tool (whether perf, pprof, or eBPF) should be based on the production environment settings and the programming languages you use, ensuring the most effective integration with PGO.

Enabling PGO in a production environment doesn’t always lead to instant performance improvements. This process involves fine-tuning of various knobs within the compiler (e.g., adjusting the inline_budget for hot functions). If the threshold is set too high, it may cause too many functions to be inlined, which could not only make the binary larger but also increase the likelihood of instruction cache misses. Moreover, it may be necessary to aggregate profiles across the entire production fleet to bubble up critical libraries and frameworks that contribute most to performance across a datacenter. By fine-tuning compiler thresholds for these key libraries, rather than applying a uniform inlining_budget across the entire application, we can incrementally realize performance gains of PGO.

PGO deployment becomes relatively easier if your entire application codebase uses the same version of the compiler with the exact same versions of the libraries and frameworks. Diverging versions can make the fine-tuning of compiler thresholds quite cumbersome. For example, adopting a monolithic repository (aka monorepo), with its unified tooling ecosystem, can help mitigate some of these challenges. Moreover, PGO can introduce delays in your deployment cycle. Since PGO requires a full clean build due to the introduction of fresh profiling data, this invalidates any previously cached build artifacts, leading to longer build times. This experience can be very frustrating for some developers who need to quickly deploy hotfixes.

Together, all these challenges above demand a comprehensive approach, combining expertise in compilers, profiling tools, build systems, and CI/CD to successfully harness the power of PGO in production-scale applications.

At Gitar, we are exploring PGO-as-a-service for emerging statically-compiled programming languages such as Go and Rust. If you are using these languages in your production environment and are keen on reducing your cloud or on-prem data center costs, we'd love to hear from you.

We invite you to join our Slack community, where we continue to explore and discuss these topics further.