In this C++ code, sorting the data (before the timed region) makes the primary loop ~6x faster:
#include <algorithm> #include <ctime> #include <iostream> int main() { // Generate data const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; // !!! With this, the next loop runs faster. std::sort(data, data + arraySize); // Test clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { for (unsigned c = 0; c < arraySize; ++c) { // Primary loop. if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC; std::cout << elapsedTime << '\n'; std::cout << "sum = " << sum << '\n'; }
std::sort(data, data + arraySize);
(Sorting itself takes more time than this one pass over the array, so it’s not actually worth doing if we needed to calculate this for an unknown array.)
Initially, I thought this might be just a language or compiler anomaly, so I tried Java:
import java.util.Arrays; import java.util.Random; public class Main { public static void main(String[] args) { // Generate data int arraySize = 32768; int data[] = new int[arraySize]; Random rnd = new Random(0); for (int c = 0; c < arraySize; ++c) data[c] = rnd.nextInt() % 256; // !!! With this, the next loop runs faster Arrays.sort(data); // Test long start = System.nanoTime(); long sum = 0; for (int i = 0; i < 100000; ++i) { for (int c = 0; c < arraySize; ++c) { // Primary loop. if (data[c] >= 128) sum += data[c]; } } System.out.println((System.nanoTime() - start) / 1000000000.0); System.out.println("sum = " + sum); } }
With a similar but less extreme result.
My first thought was that sorting brings the data into the cache, but that’s silly because the array was just generated.
The code is summing up some independent terms, so the order should not matter.
The performance difference you’re observing between processing a sorted array and an unsorted array is likely due to the efficiency of CPU caching and branch prediction.
When you sort the array, you improve the spatial locality of the data. In other words, elements that are close to each other in memory are now close in value. Modern computer architectures, including the CPU cache, are designed to take advantage of spatial locality. When you access a memory location, the CPU is likely to bring in nearby memory locations into its cache as well. This means that accessing consecutive elements of a sorted array is more cache-friendly compared to accessing elements of an unsorted array scattered throughout memory.
Additionally, the branch prediction in modern processors can play a role. In your primary loop, you are checking if data[c] >= 128 and then performing an operation based on this condition. With a sorted array, there’s a higher likelihood of the branch predictor making accurate predictions because consecutive elements are more likely to follow the same pattern. In an unsorted array, the branch predictor might have a harder time predicting the outcome of the conditional statement, leading to more pipeline stalls and decreased performance.
data[c] >= 128
In summary, sorting the array improves spatial locality, making better use of CPU cache, and can enhance the predictability of branch instructions, leading to more efficient execution of your loop. Keep in mind that the actual performance gain may vary depending on the specific characteristics of the hardware and compiler optimizations.