小能豆

some question about `for loop` preformance

go

it’s my code:
code A:

func RandBytes(r *rand.Rand, b []byte) {
    for i := 0; i < len(b); i += 4 {
        int31 := r.Int31()
        for j := 0; j < 4; j++ {
            if i+j < len(b) {
                b[i+j] = letters[(int31&0b11111111)%lettersLen]
                int31 = int31 >> 8
            }
        }
    }
}

code B:

func RandBytes(r *rand.Rand, b []byte) {
    for i := 0; i < len(b); i += 4 {
        int31 := r.Int31()
        b[i] = letters[(int31&0b11111111)%lettersLen]
        int31 = int31 >> 8
        if i+1 < len(b) {
            b[i+1] = letters[(int31&0b11111111)%lettersLen]
            int31 = int31 >> 8
        }
        if i+2 < len(b) {
            b[i+2] = letters[(int31&0b11111111)%lettersLen]
            int31 = int31 >> 8
        }
        if i+3 < len(b) {
            b[i+3] = letters[(int31&0b11111111)%lettersLen]
            int31 = int31 >> 8
        }
    }
}

And some benchmark test code:

func BenchmarkRandBytes(b *testing.B) {
    r := rand.New(rand.NewSource(time.Now().UnixNano()))
    buf := make([]byte, 100)
    for i := 0; i < b.N; i++ {
        RandBytes(r, buf)
    }
}

They seem to be the same. but code A:

goos: windows
goarch: amd64
pkg: github.com/pingcap/go-ycsb/pkg/util
cpu: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
BenchmarkRandBytes
BenchmarkRandBytes-12            3272442               377.9 ns/op
PASS

code B

goos: windows
goarch: amd64
pkg: github.com/pingcap/go-ycsb/pkg/util
cpu: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
BenchmarkRandBytes
BenchmarkRandBytes-12            4012189               295.9 ns/op
PASS

Performance is off by a quarter. why?

They should be the same. my go version is go version go1.21.1 windows/amd64


阅读 84

收藏
2023-12-19

共1个答案

小能豆

The performance difference between the two versions of RandBytes you provided could be attributed to how the Go compiler optimizes the code. In Go, performance can be influenced by various factors, including loop unrolling, inlining, and other compiler optimizations.

In your second version (code B), you have eliminated the inner loop by manually unrolling it. This can lead to better performance in some cases, as it reduces the number of loop control instructions and allows the compiler to optimize more aggressively.

The performance difference you observed might be due to the specific optimizations applied by the compiler in each case. It’s essential to note that performance can vary across different compiler versions and target architectures.

To investigate further, you may consider looking at the assembly output generated by the compiler for each version. You can use the following command to generate the assembly output:

go build -gcflags="-S" yourfile.go

This will print the assembly code to the console, and you can compare the generated assembly for both versions of your RandBytes function. This might give you insights into the specific optimizations that are affecting the performance in each case.

Keep in mind that microbenchmarks like this can be sensitive to various factors, and small variations in the results may not always reflect significant real-world performance differences.

2023-12-19