The Go Blog

Profile-guided optimization in Go 1.21

Michael Pratt
5 September 2023

Earlier in 2023, Go 1.20 shipped a preview of profile-guided optimization (PGO) for users to test. After addressing known limitations in the preview, and with additional refinements thanks to community feedback and contributions, PGO support in Go 1.21 is ready for general production use! See the profile-guided optimization user guide for complete documentation.

Below we will run through an example of using PGO to improve the performance of an application. Before we get to that, what exactly is “profile-guided optimization”?

When you build a Go binary, the Go compiler performs optimizations to try to generate the best performing binary it can. For example, constant propagation can evaluate constant expressions at compile time, avoiding runtime evaluation cost. Escape analysis avoids heap allocations for locally-scoped objects, avoiding GC overheads. Inlining copies the body of simple functions into callers, often enabling further optimization in the caller (such as additional constant propagation or better escape analysis). Devirtualization converts indirect calls on interface values whose type can be determined statically into direct calls to the concrete method (which often enables inlining of the call).

Go improves optimizations from release to release, but doing so is no easy task. Some optimizations are tunable, but the compiler can’t just “turn it up to 11” on every optimization because overly aggressive optimizations can actually hurt performance or cause excessive build times. Other optimizations require the compiler to make a judgment call about what the “common” and “uncommon” paths in a function are. The compiler must make a best guess based on static heuristics because it can’t know which cases will be common at run time.

Or can it?

With no definitive information about how the code is used in a production environment, the compiler can operate only on the source code of packages. But we do have a tool to evaluate production behavior: profiling. If we provide a profile to the compiler, it can make more informed decisions: more aggressively optimizing the most frequently used functions, or more accurately selecting common cases.

Using profiles of application behavior for compiler optimization is known as Profile-Guided Optimization (PGO) (also known as Feedback-Directed Optimization (FDO)).

Example

Let’s build a service that converts Markdown to HTML: users upload Markdown source to /render, which returns the HTML conversion. We can use gitlab.com/golang-commonmark/markdown to implement this easily.

Set up

$ go mod init example.com/markdown
$ go get gitlab.com/golang-commonmark/markdown@bf3e522c626a

In main.go:

package main
import (
    "bytes"
    "io"
    "log"
    "net/http"
    _ "net/http/pprof"
    "gitlab.com/golang-commonmark/markdown"
)
func render(w http.ResponseWriter, r *http.Request) {
    if r.Method != "POST" {
        http.Error(w, "Only POST allowed", http.StatusMethodNotAllowed)
        return
    }
    src, err := io.ReadAll(r.Body)
    if err != nil {
        log.Printf("error reading body: %v", err)
        http.Error(w, "Internal Server Error", http.StatusInternalServerError)
        return
    }
    md := markdown.New(
        markdown.XHTMLOutput(true),
        markdown.Typographer(true),
        markdown.Linkify(true),
        markdown.Tables(true),
    )
    var buf bytes.Buffer
    if err := md.Render(&buf, src); err != nil {
        log.Printf("error converting markdown: %v", err)
        http.Error(w, "Malformed markdown", http.StatusBadRequest)
        return
    }
    if _, err := io.Copy(w, &buf); err != nil {
        log.Printf("error writing response: %v", err)
        http.Error(w, "Internal Server Error", http.StatusInternalServerError)
        return
    }
}
func main() {
    http.HandleFunc("/render", render)
    log.Printf("Serving on port 8080...")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Build and run the server:

$ go build -o markdown.nopgo.exe
$ ./markdown.nopgo.exe
2023/08/23 03:55:51 Serving on port 8080...

Let’s try sending some Markdown from another terminal. We can use the README.md from the Go project as a sample document:

$ curl -o README.md -L "https://raw.githubusercontent.com/golang/go/c16c2c49e2fa98ae551fc6335215fadd62d33542/README.md"
$ curl --data-binary @README.md http://localhost:8080/render
<h1>The Go Programming Language</h1>
<p>Go is an open source programming language that makes it easy to build simple,
reliable, and efficient software.</p>
...

Profiling

Now that we have a working service, let’s collect a profile and rebuild with PGO to see if we get better performance.

In main.go, we imported net/http/pprof which automatically adds a /debug/pprof/profile endpoint to the server for fetching a CPU profile.

Normally you want to collect a profile from your production environment so that the compiler gets a representative view of behavior in production. Since this example doesn’t have a “production” environment, I have created a simple program to generate load while we collect a profile. Fetch and start the load generator (make sure the server is still running!):

$ go run github.com/prattmic/markdown-pgo/load@latest

While that is running, download a profile from the server:

$ curl -o cpu.pprof "http://localhost:8080/debug/pprof/profile?seconds=30"

Once this completes, kill the load generator and the server.

Using the profile

The Go toolchain will automatically enable PGO when it finds a profile named default.pgo in the main package directory. Alternatively, the -pgo flag to go build takes a path to a profile to use for PGO.

We recommend committing default.pgo files to your repository. Storing profiles alongside your source code ensures that users automatically have access to the profile simply by fetching the repository (either via the version control system, or via go get) and that builds remain reproducible.

Let’s build:

$ mv cpu.pprof default.pgo
$ go build -o markdown.withpgo.exe

We can check that PGO was enabled in the build with go version:

$ go version -m markdown.withpgo.exe
./markdown.withpgo.exe: go1.21.0
...
        build   -pgo=/tmp/pgo121/default.pgo

Evaluation

We will use a Go benchmark version of the load generator to evaluate the effect of PGO on performance.

First, we will benchmark the server without PGO. Start that server:

$ ./markdown.nopgo.exe

While that is running, run several benchmark iterations:

$ go get github.com/prattmic/markdown-pgo@latest
$ go test github.com/prattmic/markdown-pgo/load -bench=. -count=40 -source $(pwd)/README.md > nopgo.txt

Once that completes, kill the original server and start the version with PGO:

$ ./markdown.withpgo.exe

While that is running, run several benchmark iterations:

$ go test github.com/prattmic/markdown-pgo/load -bench=. -count=40 -source $(pwd)/README.md > withpgo.txt

Once that completes, let’s compare the results:

$ go install golang.org/x/perf/cmd/benchstat@latest
$ benchstat nopgo.txt withpgo.txt
goos: linux
goarch: amd64
pkg: github.com/prattmic/markdown-pgo/load
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
        │  nopgo.txt  │            withpgo.txt             │
        │   sec/op    │   sec/op     vs base               │
Load-12   374.5µ ± 1%   360.2µ ± 0%  -3.83% (p=0.000 n=40)

The new version is around 3.8% faster! In Go 1.21, workloads typically get between 2% and 7% CPU usage improvements from enabling PGO. Profiles contain a wealth of information about application behavior and Go 1.21 just begins to crack the surface by using this information for a limited set of optimizations. Future releases will continue improving performance as more parts of the compiler take advantage of PGO.

Next steps

In this example, after collecting a profile, we rebuilt our server using the exact same source code used in the original build. In a real-world scenario, there is always ongoing development. So we may collect a profile from production, which is running last week’s code, and use it to build with today’s source code. That is perfectly fine! PGO in Go can handle minor changes to source code without issue. Of course, over time source code will drift more and more, so it is still important to update the profile occasionally.

For much more information on using PGO, best practices and caveats to be aware of, please see the profile-guided optimization user guide. If you are curious about what is going on under the hood, keep reading!

Under the hood

To get a better understanding of what made this application faster, let’s take a look under the hood to see how performance has changed. We are going to take a look at two different PGO-driven optimizations.

Inlining

To observe inlining improvements, let’s analyze this markdown application both with and without PGO.

I will compare this using a technique called differential profiling, where we collect two profiles (one with PGO and one without) and compare them. For differential profiling, it’s important that both profiles represent the same amount of work, not the same amount of time, so I’ve adjusted the server to automatically collect profiles, and the load generator to send a fixed number of requests and then exit the server.

The changes I have made to the server as well as the profiles collected can be found at https://github.com/prattmic/markdown-pgo. The load generator was run with -count=300000 -quit.

As a quick consistency check, let’s take a look at the total CPU time required to handle all 300k requests:

$ go tool pprof -top cpu.nopgo.pprof | grep "Total samples"
Duration: 116.92s, Total samples = 118.73s (101.55%)
$ go tool pprof -top cpu.withpgo.pprof | grep "Total samples"
Duration: 113.91s, Total samples = 115.03s (100.99%)

CPU time dropped from ~118s to ~115s, or about 3%. This is in line with our benchmark results, which is a good sign that these profiles are representative.

Now we can open a differential profile to look for savings:

$ go tool pprof -diff_base cpu.nopgo.pprof cpu.withpgo.pprof
File: markdown.profile.withpgo.exe
Type: cpu
Time: Aug 28, 2023 at 10:26pm (EDT)
Duration: 230.82s, Total samples = 118.73s (51.44%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top -cum
Showing nodes accounting for -0.10s, 0.084% of 118.73s total
Dropped 268 nodes (cum <= 0.59s)
Showing top 10 nodes out of 668
      flat  flat%   sum%        cum   cum%
    -0.03s 0.025% 0.025%     -2.56s  2.16%  gitlab.com/golang-commonmark/markdown.ruleLinkify
     0.04s 0.034% 0.0084%     -2.19s  1.84%  net/http.(*conn).serve
     0.02s 0.017% 0.025%     -1.82s  1.53%  gitlab.com/golang-commonmark/markdown.(*Markdown).Render
     0.02s 0.017% 0.042%     -1.80s  1.52%  gitlab.com/golang-commonmark/markdown.(*Markdown).Parse
    -0.03s 0.025% 0.017%     -1.71s  1.44%  runtime.mallocgc
    -0.07s 0.059% 0.042%     -1.62s  1.36%  net/http.(*ServeMux).ServeHTTP
     0.04s 0.034% 0.0084%     -1.58s  1.33%  net/http.serverHandler.ServeHTTP
    -0.01s 0.0084% 0.017%     -1.57s  1.32%  main.render
     0.01s 0.0084% 0.0084%     -1.56s  1.31%  net/http.HandlerFunc.ServeHTTP
    -0.09s 0.076% 0.084%     -1.25s  1.05%  runtime.newobject
(pprof) top
Showing nodes accounting for -1.41s, 1.19% of 118.73s total
Dropped 268 nodes (cum <= 0.59s)
Showing top 10 nodes out of 668
      flat  flat%   sum%        cum   cum%
    -0.46s  0.39%  0.39%     -0.91s  0.77%  runtime.scanobject
    -0.40s  0.34%  0.72%     -0.40s  0.34%  runtime.nextFreeFast (inline)
     0.36s   0.3%  0.42%      0.36s   0.3%  gitlab.com/golang-commonmark/markdown.performReplacements
    -0.35s  0.29%  0.72%     -0.37s  0.31%  runtime.writeHeapBits.flush
     0.32s  0.27%  0.45%      0.67s  0.56%  gitlab.com/golang-commonmark/markdown.ruleReplacements
    -0.31s  0.26%  0.71%     -0.29s  0.24%  runtime.writeHeapBits.write
    -0.30s  0.25%  0.96%     -0.37s  0.31%  runtime.deductAssistCredit
     0.29s  0.24%  0.72%      0.10s 0.084%  gitlab.com/golang-commonmark/markdown.ruleText
    -0.29s  0.24%  0.96%     -0.29s  0.24%  runtime.(*mspan).base (inline)
    -0.27s  0.23%  1.19%     -0.42s  0.35%  bytes.(*Buffer).WriteRune

When specifying pprof -diff_base, the values in displayed in pprof are the difference between the two profiles. So, for instance, runtime.scanobject used 0.46s less CPU time with PGO than without. On the other hand, gitlab.com/golang-commonmark/markdown.performReplacements used 0.36s more CPU time. In a differential profile, we typically want to look at the absolute values (flat and cum columns), as the percentages aren’t meaningful.

top -cum shows the top differences by cumulative change. That is, the difference in CPU of a function and all transitive callees from that function. This will generally show the outermost frames in our program’s call graph, such as main or another goroutine entry point. Here we can see most savings are coming from the ruleLinkify portion of handling HTTP requests.

top shows the top differences limited only to changes in the function itself. This will generally show inner frames in our program’s call graph, where most of the actual work is happening. Here we can see that individual savings are coming mostly from runtime functions.

What are those? Let’s peek up the call stack to see where they come from:

(pprof) peek scanobject$
Showing nodes accounting for -3.72s, 3.13% of 118.73s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                            -0.86s 94.51% |   runtime.gcDrain
                                            -0.09s  9.89% |   runtime.gcDrainN
                                             0.04s  4.40% |   runtime.markrootSpans
    -0.46s  0.39%  0.39%     -0.91s  0.77%                | runtime.scanobject
                                            -0.19s 20.88% |   runtime.greyobject
                                            -0.13s 14.29% |   runtime.heapBits.nextFast (inline)
                                            -0.08s  8.79% |   runtime.heapBits.next
                                            -0.08s  8.79% |   runtime.spanOfUnchecked (inline)
                                             0.04s  4.40% |   runtime.heapBitsForAddr
                                            -0.01s  1.10% |   runtime.findObject
----------------------------------------------------------+-------------
(pprof) peek gcDrain$
Showing nodes accounting for -3.72s, 3.13% of 118.73s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                               -1s   100% |   runtime.gcBgMarkWorker.func2
     0.15s  0.13%  0.13%        -1s  0.84%                | runtime.gcDrain
                                            -0.86s 86.00% |   runtime.scanobject
                                            -0.18s 18.00% |   runtime.(*gcWork).balance
                                            -0.11s 11.00% |   runtime.(*gcWork).tryGet
                                             0.09s  9.00% |   runtime.pollWork
                                            -0.03s  3.00% |   runtime.(*gcWork).tryGetFast (inline)
                                            -0.03s  3.00% |   runtime.markroot
                                            -0.02s  2.00% |   runtime.wbBufFlush
                                             0.01s  1.00% |   runtime/internal/atomic.(*Bool).Load (inline)
                                            -0.01s  1.00% |   runtime.gcFlushBgCredit
                                            -0.01s  1.00% |   runtime/internal/atomic.(*Int64).Add (inline)
----------------------------------------------------------+-------------

So runtime.scanobject is ultimately coming from runtime.gcBgMarkWorker. The Go GC Guide tells us that runtime.gcBgMarkWorker is part of the garbage collector, so runtime.scanobject savings must be GC savings. What about nextFreeFast and other runtime functions?

(pprof) peek nextFreeFast$
Showing nodes accounting for -3.72s, 3.13% of 118.73s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                            -0.40s   100% |   runtime.mallocgc (inline)
    -0.40s  0.34%  0.34%     -0.40s  0.34%                | runtime.nextFreeFast
----------------------------------------------------------+-------------
(pprof) peek writeHeapBits
Showing nodes accounting for -3.72s, 3.13% of 118.73s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                            -0.37s   100% |   runtime.heap

Vytvořeno 2y | 5. 9. 2023 18:40:03

Chcete-li přidat komentář, přihlaste se