Improving cloud performance with caching

Recently the compound cloud performance has been attempted to be improved. One of those was to introduce caching for cloud to world conversion. That requires 81 vectors in total to handle all combinations, and a dictionary was added to cache this. Then later it was brought up that a FrozenDictionary should be faster. So I decided to uset BenchmarkDotNet so that it would be clear what is faster. For fun I also added a benchmark that recalculated the world vector each time, and the results are a bit surprising.

Here’s the raw table output:

Method NonSequentialReads ExtraReaderThreads Mean Error StdDev Allocated
RecalculateKeyEachTime 0 0 26.93 ms 0.325 ms 0.271 ms 328 B
DictionaryTryGetValue 0 0 41.54 ms 0.380 ms 0.337 ms 2640 B
DictionaryMarshalGetRef 0 0 39.84 ms 0.241 ms 0.226 ms 2640 B
FrozenDictionaryTryGet 0 0 39.71 ms 0.256 ms 0.240 ms 3776 B
FrozenDictionaryGetValueRef 0 0 39.54 ms 0.324 ms 0.303 ms 3776 B
RecalculateKeyEachTime 0 1 26.61 ms 0.125 ms 0.105 ms 896 B
DictionaryTryGetValue 0 1 40.11 ms 0.187 ms 0.175 ms 3080 B
DictionaryMarshalGetRef 0 1 40.21 ms 0.228 ms 0.202 ms 3080 B
FrozenDictionaryTryGet 0 1 37.41 ms 0.206 ms 0.182 ms 4216 B
FrozenDictionaryGetValueRef 0 1 35.49 ms 0.135 ms 0.126 ms 4284 B
RecalculateKeyEachTime 0 2 27.19 ms 0.122 ms 0.115 ms 1216 B
DictionaryTryGetValue 0 2 42.54 ms 0.380 ms 0.356 ms 3456 B
DictionaryMarshalGetRef 0 2 42.06 ms 0.369 ms 0.345 ms 3456 B
FrozenDictionaryTryGet 0 2 37.56 ms 0.332 ms 0.310 ms 4602 B
FrozenDictionaryGetValueRef 0 2 38.64 ms 0.171 ms 0.143 ms 4592 B
RecalculateKeyEachTime 0.01 0 27.77 ms 0.252 ms 0.224 ms 328 B
DictionaryTryGetValue 0.01 0 42.06 ms 0.201 ms 0.178 ms 2640 B
DictionaryMarshalGetRef 0.01 0 42.04 ms 0.317 ms 0.281 ms 2640 B
FrozenDictionaryTryGet 0.01 0 37.06 ms 0.124 ms 0.104 ms 3776 B
FrozenDictionaryGetValueRef 0.01 0 41.23 ms 0.211 ms 0.187 ms 3776 B
RecalculateKeyEachTime 0.01 1 28.50 ms 0.160 ms 0.150 ms 772 B
DictionaryTryGetValue 0.01 1 44.47 ms 0.247 ms 0.219 ms 3080 B
DictionaryMarshalGetRef 0.01 1 45.20 ms 0.349 ms 0.326 ms 3080 B
FrozenDictionaryTryGet 0.01 1 37.46 ms 0.276 ms 0.258 ms 4216 B
FrozenDictionaryGetValueRef 0.01 1 39.89 ms 0.243 ms 0.216 ms 4216 B
RecalculateKeyEachTime 0.01 2 28.77 ms 0.153 ms 0.136 ms 1327 B
DictionaryTryGetValue 0.01 2 42.90 ms 0.322 ms 0.301 ms 3592 B
DictionaryMarshalGetRef 0.01 2 42.45 ms 0.227 ms 0.202 ms 3490 B
FrozenDictionaryTryGet 0.01 2 39.44 ms 0.074 ms 0.066 ms 4592 B
FrozenDictionaryGetValueRef 0.01 2 39.26 ms 0.504 ms 0.472 ms 4592 B
RecalculateKeyEachTime 0.1 0 41.39 ms 0.108 ms 0.096 ms 328 B
DictionaryTryGetValue 0.1 0 58.23 ms 0.312 ms 0.277 ms 2640 B
DictionaryMarshalGetRef 0.1 0 61.12 ms 0.295 ms 0.261 ms 2640 B
FrozenDictionaryTryGet 0.1 0 48.05 ms 0.149 ms 0.124 ms 3776 B
FrozenDictionaryGetValueRef 0.1 0 50.45 ms 0.492 ms 0.436 ms 3776 B
RecalculateKeyEachTime 0.1 1 41.51 ms 0.266 ms 0.236 ms 886 B
DictionaryTryGetValue 0.1 1 61.36 ms 1.043 ms 0.975 ms 3080 B
DictionaryMarshalGetRef 0.1 1 65.37 ms 0.867 ms 0.811 ms 3080 B
FrozenDictionaryTryGet 0.1 1 48.73 ms 0.419 ms 0.392 ms 4216 B
FrozenDictionaryGetValueRef 0.1 1 47.94 ms 0.130 ms 0.115 ms 4216 B
RecalculateKeyEachTime 0.1 2 42.11 ms 0.326 ms 0.289 ms 1234 B
DictionaryTryGetValue 0.1 2 66.24 ms 0.339 ms 0.300 ms 3456 B
DictionaryMarshalGetRef 0.1 2 60.65 ms 0.201 ms 0.168 ms 3456 B
FrozenDictionaryTryGet 0.1 2 48.95 ms 0.203 ms 0.180 ms 4604 B
FrozenDictionaryGetValueRef 0.1 2 49.50 ms 0.231 ms 0.181 ms 4592 B
RecalculateKeyEachTime 0.9 0 101.42 ms 0.276 ms 0.244 ms 328 B
DictionaryTryGetValue 0.9 0 130.20 ms 0.642 ms 0.569 ms 2640 B
DictionaryMarshalGetRef 0.9 0 130.23 ms 0.287 ms 0.254 ms 2640 B
FrozenDictionaryTryGet 0.9 0 86.01 ms 0.336 ms 0.315 ms 3776 B
FrozenDictionaryGetValueRef 0.9 0 86.36 ms 0.318 ms 0.282 ms 3776 B
RecalculateKeyEachTime 0.9 1 103.33 ms 0.147 ms 0.131 ms 768 B
DictionaryTryGetValue 0.9 1 132.43 ms 0.554 ms 0.462 ms 3144 B
DictionaryMarshalGetRef 0.9 1 134.23 ms 0.728 ms 0.608 ms 3080 B
FrozenDictionaryTryGet 0.9 1 87.85 ms 0.150 ms 0.117 ms 4307 B
FrozenDictionaryGetValueRef 0.9 1 88.98 ms 0.824 ms 0.730 ms 4234 B
RecalculateKeyEachTime 0.9 2 107.62 ms 0.572 ms 0.478 ms 1208 B
DictionaryTryGetValue 0.9 2 134.71 ms 1.245 ms 1.039 ms 3524 B
DictionaryMarshalGetRef 0.9 2 134.57 ms 1.533 ms 1.434 ms 3483 B
FrozenDictionaryTryGet 0.9 2 90.13 ms 0.958 ms 0.896 ms 4683 B
FrozenDictionaryGetValueRef 0.9 2 90.46 ms 1.590 ms 1.409 ms 4660 B

It is quite surprising that the recalculation is faster in many cases. However FrozenDictionary clearly wins over normal one. And interestingly enough when there is enough randomness in the reads, it wins over the precalculation approach.

So I guess the ultimate question is if most of our cloud calculations are actually sequential or more random? I would bet that we have quite a lot of sequential reads because cells absorb things in a circle around them thus leading to quite similar positions in a row.

Here’s my draft PR where I have the benchmark code:

2 Likes

That’s pretty interesting.
It looks like direct computation is actually faster due to CPU caching for sequential access, as the number of different configurations for the calculations is small. If the access is random FrozenDictionary is better because it relies on precomputed keys, making the reading overhead almost zero.

1 Like

Interestingly by making the random number generations always happen, that changes the results significantly. Now the frozen dictionary always wins, and even over the recalculation each time:

Method NonSequentialReads ExtraReaderThreads Mean Error StdDev
RecalculateKeyEachTime 0 0 117.79 ms 0.386 ms 0.361 ms
DictionaryTryGetValue 0 0 98.94 ms 0.738 ms 0.654 ms
DictionaryMarshalGetRef 0 0 98.40 ms 0.396 ms 0.331 ms
FrozenDictionaryTryGet 0 0 93.57 ms 0.515 ms 0.457 ms
FrozenDictionaryGetValueRef 0 0 91.02 ms 0.566 ms 0.502 ms
RecalculateKeyEachTime 0 1 117.64 ms 0.383 ms 0.339 ms
DictionaryTryGetValue 0 1 104.39 ms 0.662 ms 0.587 ms
DictionaryMarshalGetRef 0 1 100.53 ms 0.553 ms 0.491 ms
FrozenDictionaryTryGet 0 1 95.67 ms 0.486 ms 0.431 ms
FrozenDictionaryGetValueRef 0 1 94.42 ms 0.391 ms 0.365 ms
RecalculateKeyEachTime 0 2 122.23 ms 0.586 ms 0.519 ms
DictionaryTryGetValue 0 2 105.33 ms 0.777 ms 0.689 ms
DictionaryMarshalGetRef 0 2 99.32 ms 0.666 ms 0.556 ms
FrozenDictionaryTryGet 0 2 96.77 ms 1.037 ms 0.919 ms
FrozenDictionaryGetValueRef 0 2 110.77 ms 2.179 ms 2.332 ms
RecalculateKeyEachTime 0.01 0 121.49 ms 0.420 ms 0.372 ms
DictionaryTryGetValue 0.01 0 99.49 ms 0.490 ms 0.459 ms
DictionaryMarshalGetRef 0.01 0 100.45 ms 0.893 ms 0.836 ms
FrozenDictionaryTryGet 0.01 0 91.49 ms 0.734 ms 0.686 ms
FrozenDictionaryGetValueRef 0.01 0 96.39 ms 0.334 ms 0.296 ms
RecalculateKeyEachTime 0.01 1 121.24 ms 0.862 ms 0.806 ms
DictionaryTryGetValue 0.01 1 103.98 ms 0.785 ms 0.656 ms
DictionaryMarshalGetRef 0.01 1 103.60 ms 0.593 ms 0.555 ms
FrozenDictionaryTryGet 0.01 1 95.69 ms 0.452 ms 0.423 ms
FrozenDictionaryGetValueRef 0.01 1 95.45 ms 0.364 ms 0.340 ms
RecalculateKeyEachTime 0.01 2 124.16 ms 0.649 ms 0.575 ms
DictionaryTryGetValue 0.01 2 105.92 ms 0.677 ms 0.565 ms
DictionaryMarshalGetRef 0.01 2 104.74 ms 0.760 ms 0.673 ms
FrozenDictionaryTryGet 0.01 2 97.99 ms 1.604 ms 1.500 ms
FrozenDictionaryGetValueRef 0.01 2 100.18 ms 0.588 ms 0.521 ms
RecalculateKeyEachTime 0.1 0 117.61 ms 0.272 ms 0.241 ms
DictionaryTryGetValue 0.1 0 107.77 ms 0.647 ms 0.573 ms
DictionaryMarshalGetRef 0.1 0 108.34 ms 0.449 ms 0.398 ms
FrozenDictionaryTryGet 0.1 0 100.19 ms 0.946 ms 0.838 ms
FrozenDictionaryGetValueRef 0.1 0 96.80 ms 0.695 ms 0.650 ms
RecalculateKeyEachTime 0.1 1 128.40 ms 0.387 ms 0.362 ms
DictionaryTryGetValue 0.1 1 112.01 ms 0.441 ms 0.412 ms
DictionaryMarshalGetRef 0.1 1 112.70 ms 0.414 ms 0.387 ms
FrozenDictionaryTryGet 0.1 1 101.37 ms 2.010 ms 2.064 ms
FrozenDictionaryGetValueRef 0.1 1 99.01 ms 0.262 ms 0.204 ms
RecalculateKeyEachTime 0.1 2 124.14 ms 0.396 ms 0.330 ms
DictionaryTryGetValue 0.1 2 114.56 ms 1.038 ms 0.920 ms
DictionaryMarshalGetRef 0.1 2 113.30 ms 0.721 ms 0.674 ms
FrozenDictionaryTryGet 0.1 2 98.69 ms 0.393 ms 0.328 ms
FrozenDictionaryGetValueRef 0.1 2 101.33 ms 0.510 ms 0.478 ms
RecalculateKeyEachTime 0.9 0 110.35 ms 0.281 ms 0.235 ms
DictionaryTryGetValue 0.9 0 127.40 ms 0.710 ms 0.664 ms
DictionaryMarshalGetRef 0.9 0 127.97 ms 1.092 ms 0.968 ms
FrozenDictionaryTryGet 0.9 0 93.13 ms 0.452 ms 0.377 ms
FrozenDictionaryGetValueRef 0.9 0 94.26 ms 0.779 ms 0.729 ms
RecalculateKeyEachTime 0.9 1 111.31 ms 0.255 ms 0.239 ms
DictionaryTryGetValue 0.9 1 130.53 ms 0.596 ms 0.497 ms
DictionaryMarshalGetRef 0.9 1 131.03 ms 0.776 ms 0.688 ms
FrozenDictionaryTryGet 0.9 1 93.97 ms 0.354 ms 0.314 ms
FrozenDictionaryGetValueRef 0.9 1 95.32 ms 0.436 ms 0.408 ms
RecalculateKeyEachTime 0.9 2 117.28 ms 0.354 ms 0.313 ms
DictionaryTryGetValue 0.9 2 132.47 ms 0.754 ms 0.630 ms
DictionaryMarshalGetRef 0.9 2 133.18 ms 0.490 ms 0.383 ms
FrozenDictionaryTryGet 0.9 2 96.27 ms 0.688 ms 0.574 ms
FrozenDictionaryGetValueRef 0.9 2 96.10 ms 0.312 ms 0.277 ms

I’ll try profiling the game itself a bit to see if Thrive itself is closer to the situation with not much other happening in the loop with the linear scan or with a bunch of interleaved work / random access happening. But I’ll at least change the game to use a frozen dictionary when using a dictionary for the cloud processing.

Edit: looks like the game itself is closer to the more complex scenario where FrozenDictionary wins over recalculation so I’ll update my PR accordingly, which is now also ready for review.

2 Likes

I reclaimed a bit more performance by ditching the dictionary completely and using a flat array.

Method NonSequentialReads Reads ExtraReaderThreads Mean Error StdDev Allocated
RecalculateKeyEachTime 0 5000000 0 230.3 ms 0.37 ms 0.33 ms 328 B
DictionaryTryGetValue 0 5000000 0 223.7 ms 4.35 ms 7.38 ms 2640 B
DictionaryMarshalGetRef 0 5000000 0 208.1 ms 4.14 ms 4.43 ms 2640 B
FrozenDictionaryTryGet 0 5000000 0 187.9 ms 1.09 ms 0.97 ms 3776 B
FrozenDictionaryGetValueRef 0 5000000 0 189.0 ms 3.46 ms 3.24 ms 3776 B
ArrayFlatIndex 0 5000000 0 147.3 ms 0.39 ms 0.37 ms 328 B
RecalculateKeyEachTime 0.01 5000000 0 233.3 ms 2.77 ms 2.31 ms 328 B
DictionaryTryGetValue 0.01 5000000 0 206.9 ms 3.00 ms 2.81 ms 2640 B
DictionaryMarshalGetRef 0.01 5000000 0 204.7 ms 4.03 ms 3.96 ms 2640 B
FrozenDictionaryTryGet 0.01 5000000 0 189.1 ms 2.54 ms 2.38 ms 3776 B
FrozenDictionaryGetValueRef 0.01 5000000 0 205.8 ms 4.09 ms 5.31 ms 3776 B
ArrayFlatIndex 0.01 5000000 0 152.1 ms 0.30 ms 0.24 ms 328 B
RecalculateKeyEachTime 0.1 5000000 0 234.5 ms 0.33 ms 0.29 ms 328 B
DictionaryTryGetValue 0.1 5000000 0 222.7 ms 4.40 ms 10.28 ms 2640 B
DictionaryMarshalGetRef 0.1 5000000 0 232.8 ms 4.58 ms 10.34 ms 2640 B
FrozenDictionaryTryGet 0.1 5000000 0 200.2 ms 3.97 ms 5.16 ms 3776 B
FrozenDictionaryGetValueRef 0.1 5000000 0 206.1 ms 3.91 ms 3.65 ms 3776 B
ArrayFlatIndex 0.1 5000000 0 157.4 ms 0.36 ms 0.30 ms 328 B
RecalculateKeyEachTime 0.9 5000000 0 217.6 ms 1.04 ms 0.97 ms 328 B
DictionaryTryGetValue 0.9 5000000 0 235.0 ms 4.60 ms 4.73 ms 2640 B
DictionaryMarshalGetRef 0.9 5000000 0 232.7 ms 4.55 ms 5.24 ms 2640 B
FrozenDictionaryTryGet 0.9 5000000 0 176.7 ms 2.12 ms 1.98 ms 3776 B
FrozenDictionaryGetValueRef 0.9 5000000 0 181.6 ms 3.56 ms 3.33 ms 3776 B
ArrayFlatIndex 0.9 5000000 0 148.4 ms 0.13 ms 0.11 ms 328 B

The benchmark code I added is

[Benchmark]
    public void ArrayFlatIndex()
    {
        var tasks = new Task[ExtraReaderThreads];

        void TryRead(int key)
        {
            var cached = sourceDataArray[key];

            _ = cachedWorldPosition + cached;
        }

        for (int task = 0; task < ExtraReaderThreads; ++task)
        {
            tasks[task] = Task.Run(() =>
            {
                var random = new Random(RandomSeed);
                int xRead = 0;
                int yRead = 0;
                int playerXRead = 0;
                int playerYRead = 0;

                for (int i = 0; i < Reads; ++i)
                {
                    int key = GetRandomKeyArray(random);
                    if (random.NextDouble() < NonSequentialReads)
                    {
                        TryRead(key);
                    }
                    else
                    {
                        int index = GetIndexKeyArray(xRead, yRead, playerXRead, playerYRead);
                        TryRead(index);
                        IncrementRead(ref xRead, ref yRead, ref playerXRead, ref playerYRead);
                    }
                }
            });
        }

        {
            var random = new Random(RandomSeed);
            int xRead = 0;
            int yRead = 0;
            int playerXRead = 0;
            int playerYRead = 0;

            for (int i = 0; i < Reads; ++i)
            {
                int key = GetRandomKeyArray(random);
                if (random.NextDouble() < NonSequentialReads)
                {
                    TryRead(key);
                }
                else
                {
                    int index = GetIndexKeyArray(xRead, yRead, playerXRead, playerYRead);
                    TryRead(index);
                    IncrementRead(ref xRead, ref yRead, ref playerXRead, ref playerYRead);
                }
            }
        }

        Task.WaitAll(tasks);
    }

    private static int GetRandomKeyArray(Random random)
    {
        var key = random.Next(PlaneOffsets.Length) << 6 | random.Next(PlaneOffsets.Length) << 4 |
            random.Next(PlayerPositions.Length) << 2 | random.Next(PlayerPositions.Length);
        return key;
    }

    private static int GetIndexKeyArray(int xRead, int yRead, int playerXRead, int playerYRead)
    {
        return (xRead << 6) | (yRead << 4) | (playerXRead << 2) | playerYRead;
    }

Where sourceDataArray is initialized as a 256 length array.

Forgot about the precalculation

private static Vector2[] PrecalculateWorldShiftVectorsArray()
    {
        var shiftCache = new Vector2[256];
        int worldShift = CLOUD_SIZE / CLOUD_PLANE_SQUARES_PER_SIDE * cloudResolution;

        for (int xi = 0; xi < 3; xi++)
        {
            for (int yi = 0; yi < 3; yi++)
            {
                for (int pX = 0; pX < 3; pX++)
                {
                    for (int pY = 0; pY < 3; pY++)
                    {
                        int x0 = xi * 100;
                        int y0 = yi * 100;

                        int xShift = GetEdgeShift(x0, pX);
                        int yShift = GetEdgeShift(y0, pY);

                        var wholePlaneShift = new Vector2(
                            worldShift * ((4 - pX) % 3 - 1) - CLOUD_SIZE,
                            worldShift * ((4 - pY) % 3 - 1) - CLOUD_SIZE
                        );

                        var edgePlanesShift = new Vector2(xShift * worldShift, yShift * worldShift);
                        int index = (xi << 6) | (yi << 4) | (pX << 2) | pY;

                        shiftCache[index] = wholePlaneShift + edgePlanesShift;
                    }
                }
            }
        }

        return shiftCache;
    }

I thought about using a flat array, however I decided against it as the key-space was seemingly quite sparse (and would require like thousands of useless entries). Looks like it is not the case.

However, I do have one concern which is that if the key space has to be increased then the simple bit-packing approach you showed cannot be expanded (or at least I would lack the skill to converting that to morton order function). So while that approach would work right now, there’s a threat that it might require updating in the future and then it would be a lot harder to do the update for the code…

In principle the general key for n²×m² such loop (where n is the range for x,y and m is the range for playerX and playerY) can be constructed as, if we define

a = ceil(log_2(m))
b = ceil(log_2(n))
key = x << 2*(a+b) | y << 2a+b | playerX << 2a | playerY << a

Anyways, this is not morton code but a simple bitpacking. True morton code requires n*m to be a power of two.

1 Like

Okay, so I’m done testing now, and here’s the final numbers before I talk about them (I excluded the 0.9 random reads numbers as they tell just more of the same story).

Method NonSequentialReads ExtraReaderThreads Mean Error StdDev Allocated
RecalculateKeyEachTime 0 0 116.23 ms 0.394 ms 0.329 ms 328 B
DictionaryTryGetValue 0 0 98.61 ms 0.792 ms 0.702 ms 2640 B
DictionaryMarshalGetRef 0 0 98.91 ms 0.985 ms 0.873 ms 2640 B
FrozenDictionaryTryGet 0 0 94.52 ms 0.563 ms 0.499 ms 3776 B
FrozenDictionaryGetValueRef 0 0 93.80 ms 0.294 ms 0.261 ms 3776 B
FrozenDictionaryDirectRandomTry 0 0 97.59 ms 0.565 ms 0.529 ms 3776 B
FrozenDictionaryDirectRandomRef 0 0 94.59 ms 0.656 ms 0.613 ms 3776 B
FlatArray 0 0 118.41 ms 0.554 ms 0.491 ms 368 B
FlatArrayDirectRandom 0 0 84.24 ms 0.284 ms 0.266 ms 368 B
RecalculateKeyEachTime 0 1 119.72 ms 0.349 ms 0.310 ms 800 B
DictionaryTryGetValue 0 1 99.64 ms 0.871 ms 0.772 ms 3080 B
DictionaryMarshalGetRef 0 1 101.91 ms 1.209 ms 1.072 ms 3080 B
FrozenDictionaryTryGet 0 1 96.46 ms 1.054 ms 0.934 ms 4216 B
FrozenDictionaryGetValueRef 0 1 97.28 ms 1.606 ms 1.502 ms 4234 B
FrozenDictionaryDirectRandomTry 0 1 95.71 ms 0.304 ms 0.269 ms 4253 B
FrozenDictionaryDirectRandomRef 0 1 96.15 ms 0.832 ms 0.737 ms 4234 B
FlatArray 0 1 119.69 ms 0.301 ms 0.266 ms 808 B
FlatArrayDirectRandom 0 1 121.85 ms 0.703 ms 0.624 ms 869 B
RecalculateKeyEachTime 0 2 121.47 ms 0.590 ms 0.552 ms 1344 B
DictionaryTryGetValue 0 2 101.15 ms 0.391 ms 0.346 ms 3501 B
DictionaryMarshalGetRef 0 2 102.29 ms 1.011 ms 0.844 ms 3456 B
FrozenDictionaryTryGet 0 2 95.61 ms 0.600 ms 0.532 ms 4728 B
FrozenDictionaryGetValueRef 0 2 97.57 ms 1.045 ms 0.816 ms 4728 B
FrozenDictionaryDirectRandomTry 0 2 97.48 ms 0.712 ms 0.631 ms 4592 B
FrozenDictionaryDirectRandomRef 0 2 97.22 ms 0.493 ms 0.411 ms 4637 B
FlatArray 0 2 121.71 ms 0.350 ms 0.310 ms 1184 B
FlatArrayDirectRandom 0 2 121.51 ms 1.107 ms 1.035 ms 1320 B
RecalculateKeyEachTime 0.01 0 118.18 ms 0.513 ms 0.455 ms 328 B
DictionaryTryGetValue 0.01 0 99.17 ms 0.675 ms 0.632 ms 2640 B
DictionaryMarshalGetRef 0.01 0 100.84 ms 0.936 ms 0.875 ms 2640 B
FrozenDictionaryTryGet 0.01 0 93.94 ms 0.909 ms 0.850 ms 3776 B
FrozenDictionaryGetValueRef 0.01 0 94.52 ms 1.053 ms 0.934 ms 3776 B
FrozenDictionaryDirectRandomTry 0.01 0 94.08 ms 0.327 ms 0.273 ms 3776 B
FrozenDictionaryDirectRandomRef 0.01 0 94.38 ms 0.412 ms 0.365 ms 3776 B
FlatArray 0.01 0 119.39 ms 0.615 ms 0.545 ms 368 B
FlatArrayDirectRandom 0.01 0 84.69 ms 0.294 ms 0.245 ms 368 B
RecalculateKeyEachTime 0.01 1 124.17 ms 0.615 ms 0.575 ms 768 B
DictionaryTryGetValue 0.01 1 100.22 ms 1.089 ms 0.966 ms 3080 B
DictionaryMarshalGetRef 0.01 1 103.16 ms 0.539 ms 0.478 ms 3080 B
FrozenDictionaryTryGet 0.01 1 98.96 ms 0.468 ms 0.438 ms 4344 B
FrozenDictionaryGetValueRef 0.01 1 96.06 ms 0.496 ms 0.464 ms 4216 B
FrozenDictionaryDirectRandomTry 0.01 1 95.90 ms 0.624 ms 0.553 ms 4253 B
FrozenDictionaryDirectRandomRef 0.01 1 98.90 ms 1.110 ms 0.984 ms 4237 B
FlatArray 0.01 1 124.48 ms 0.319 ms 0.282 ms 936 B
FlatArrayDirectRandom 0.01 1 124.15 ms 0.880 ms 0.824 ms 1195 B
RecalculateKeyEachTime 0.01 2 122.89 ms 0.546 ms 0.510 ms 1305 B
DictionaryTryGetValue 0.01 2 102.98 ms 0.787 ms 0.657 ms 3456 B
DictionaryMarshalGetRef 0.01 2 104.25 ms 0.595 ms 0.556 ms 3456 B
FrozenDictionaryTryGet 0.01 2 96.95 ms 0.612 ms 0.573 ms 4615 B
FrozenDictionaryGetValueRef 0.01 2 97.15 ms 1.223 ms 1.144 ms 4705 B
FrozenDictionaryDirectRandomTry 0.01 2 97.31 ms 0.526 ms 0.492 ms 4660 B
FrozenDictionaryDirectRandomRef 0.01 2 97.42 ms 0.431 ms 0.360 ms 4728 B
FlatArray 0.01 2 124.16 ms 0.528 ms 0.468 ms 1207 B
FlatArrayDirectRandom 0.01 2 124.14 ms 0.857 ms 0.760 ms 1320 B
RecalculateKeyEachTime 0.1 0 120.52 ms 0.645 ms 0.603 ms 328 B
DictionaryTryGetValue 0.1 0 109.20 ms 0.936 ms 0.782 ms 2640 B
DictionaryMarshalGetRef 0.1 0 109.37 ms 0.772 ms 0.644 ms 2640 B
FrozenDictionaryTryGet 0.1 0 98.26 ms 0.945 ms 0.838 ms 3776 B
FrozenDictionaryGetValueRef 0.1 0 100.45 ms 0.900 ms 0.798 ms 3776 B
FrozenDictionaryDirectRandomTry 0.1 0 97.39 ms 0.495 ms 0.439 ms 3776 B
FrozenDictionaryDirectRandomRef 0.1 0 98.52 ms 0.625 ms 0.554 ms 3776 B
FlatArray 0.1 0 121.64 ms 1.102 ms 1.031 ms 368 B
FlatArrayDirectRandom 0.1 0 90.21 ms 0.406 ms 0.360 ms 368 B
RecalculateKeyEachTime 0.1 1 123.12 ms 0.392 ms 0.367 ms 800 B
DictionaryTryGetValue 0.1 1 111.55 ms 0.500 ms 0.468 ms 3080 B
DictionaryMarshalGetRef 0.1 1 112.78 ms 0.935 ms 0.875 ms 3101 B
FrozenDictionaryTryGet 0.1 1 99.58 ms 0.519 ms 0.460 ms 4323 B
FrozenDictionaryGetValueRef 0.1 1 101.14 ms 0.474 ms 0.396 ms 4216 B
FrozenDictionaryDirectRandomTry 0.1 1 99.09 ms 0.366 ms 0.286 ms 4216 B
FrozenDictionaryDirectRandomRef 0.1 1 100.47 ms 0.530 ms 0.470 ms 4216 B
FlatArray 0.1 1 126.76 ms 0.519 ms 0.485 ms 1067 B
FlatArrayDirectRandom 0.1 1 123.01 ms 0.330 ms 0.275 ms 936 B
RecalculateKeyEachTime 0.1 2 126.19 ms 0.434 ms 0.363 ms 1321 B
DictionaryTryGetValue 0.1 2 113.35 ms 0.620 ms 0.550 ms 3479 B
DictionaryMarshalGetRef 0.1 2 113.39 ms 0.348 ms 0.326 ms 3456 B
FrozenDictionaryTryGet 0.1 2 101.27 ms 0.782 ms 0.693 ms 4705 B
FrozenDictionaryGetValueRef 0.1 2 100.92 ms 0.936 ms 0.781 ms 4637 B
FrozenDictionaryDirectRandomTry 0.1 2 101.67 ms 0.586 ms 0.520 ms 4637 B
FrozenDictionaryDirectRandomRef 0.1 2 101.19 ms 0.615 ms 0.513 ms 4615 B
FlatArray 0.1 2 128.57 ms 0.531 ms 0.470 ms 1320 B
FlatArrayDirectRandom 0.1 2 127.42 ms 0.682 ms 0.569 ms 1320 B

Okay so my final conclusion is that the FrozenDictionary TryGet is actually the fastest option, because when adding one extra reader thread (so 2 at once) the very specific runtime optimizations that the flat array code provided above hits become ineffective. The direct ref get from the dictionary sometimes wins but it is very close in performance to the other option and if I counted right the TryGet variant actually wins slightly more often.

What I noticed is that when I did my own flat array implementation it was actually the slowest option, so calculating the random read values directly is very fast when the runtime optimization hits it (likely due to the avoided array reads). But that optimization seems to only work with a single reader thread and all other tests the array code is very slow. As Thrive is meant to run multiple cloud absorbers /emitters at once for performance reasons, I choose to put more weight on the more reader thread numbers. And also it would be unclear whether in the game the the runtime optimizer would manage to do the same optimization as in the benchmark.

So FrozenDictionary is the fastest of the tested options that doesn’t really care if there’s 0, 1 or 2 extra threads competing for the reads. I’ve updated my PR and as I’ve basically now spent 2 days on this, I’m going to merge it once the CI checks pass.

2 Likes