AMD Llano A-Series: Architecture Analysis - Caches Architecture

Indice articoli

Caches Architecture

The Llano cores have an L1 instruction cache, L1 data cache and an unified L2 cache. The L3 cache has been eliminated, compared to some of the Stars core, so the cache structure follows that of the Propus core. This can make you lose a few percentage points in performance, on memory-hungry applications, but as we will see later, the improvements to the core, the L2 cache and the memory controller, combined with faster memory, in many cases compensate for this lack.

The reason for the elimination of the L3 cache is a matter of energy saving, in addition to space saving. The unique architecture of the AMD CPU cache means that when you have to look for a datum, it must be sought in the L3 cache and all other CPU cache. But with the C6 power state, it is possible that they should be made awake, making the process very slow, complicated and expensive. In essence, the implementation of the C6 state would be inefficient with an L3 cache.

Another difference between the Llano and Stars core cache is that the L1 data cache is of 8 transistors type in Llano, compared with 6 of the previous generation. The 8 transistors caches increase, as can be imagined, the space occupied by the cells, but allow for greater speed, more reliability, low voltage and thus lower consumption.

 

L1 instruction cache

The instruction cache size is 64KB and ti is two-way set associative, with lines of 64 bytes each. This unit is responsible for loading instructions, doing prefetching, pre-decoding, to determine the end of each instruction and the beginning of the other, and keeping the information for the branch prediction.

The data that are not present in the cache are requested to the L2 cache or the system memory. In this case the cache requires two lines of 64-byte naturally aligned and consecutive, so doing a prefetch of possibly successive instruction, since the code typically has spatial locality.

The cache lines are replaced with the LRU algorithm (Least recently used).

During these cache fills, the pre-decoding information, which determine the instruction boundaries, are generated and stored with the instructions in the appropriate bits. This in order to more efficiently decode the instructions in the downstream decoder. The cache is protected only by parity bits.

 

L1 Data Cache

The data cache size is 64KB and it is two-way set associative, with 64 byte lines each and with two 128-bit ports. It's managed with the write-allocate policy (namely when writing a datum, it is stored in any case in the L1 cache) and write-back policy (namely the data is written physically in the lower levels, such as L2 cache or RAM, only when it should be removed from the cache).

The cache lines are replaced with the LRU algorithm.

It's divided into 8 banks of 16 bytes each. Two accesses in the two ports are possible only if both are aimed at different banks. The cache supports the MOESI coherence protocol (Modified, Owner, Exclusive, Shared, and Invalid), and ECC protection. It has a prefetcher that loads the data in advance to avoid misses and has a 3 clock cycles latency.

 

L2 Cache

The L2 cache is integrated on-die, proceeds at the same frequency of the CPU and there is one for each core. It is also an exclusive cache architecture: the cache contains only modified lines from the L1 to be written into RAM and that have been designated by LRU algorithm to be deleted from the L1 cache, because it must be replaced with new data. These lines are called victim.

The L2 cache latency is 9 clock cycles in addition to the L1 cache one. In Llano the L2 cache is 1MB 16-way, against 512KB 8-way of most of the Stars core (only dual core Regor has 1MB of cache). The L2 cache is protected with the ECC error protection code.

Translation-Lookaside Buffer

The translation-lookaside buffer (TLB) holds the most recent virtual address translation information used and therefore accelerates its calculation.

Each memory access goes through various stages. The first step is addressing. An instruction specifies an addressing mode which is simply the calculation procedure to find the address (linear also called virtual) of a given data.

There are various methods of addressing. The simplest is the immediate, in which the data is included in the instruction. In this case there is no need to further access to memory. Then there is the direct, in which in the instruction is specified the absolute address of the data. There are other types of addressing (indirect, indexed, with offset, etc. ) and other features of x86-64 architecture (segmentation), which require the calculation, more or less complicated, of the final address (usually done in the AGU, which are described below), but the end result is a linear or virtual address.

If virtual memory is enabled (and in modern operating systems it is always enabled), the virtual address must be translated into a physical address. Here come into play the TLB. The address translation involves the division of the address space in pages of 4KB (even if the CPU supports pages of 4KB, 2MB, 4MB and 1GB), where for each of them security information and physical location of the data is stored.

So for each process there is a table of pages, organized as a tree with multiple levels. For each memory access we should access and navigate the tree, going into the RAM. Without the TLB this would be a very slow process. Instead, the TLB caches the latest information from the translation used.

Llano uses a 2-level TLB structure.

 

Level 1 instruction TLB

The L1 instruction TLB is fully associative, with space for 32 translations of 4KB pages and 16 for 2MB pages.4MB pages require 2 2MB locations.

 

Level 1 data TLB

The L1 data TLB is fully associative, with space for 48 translations of 4KB, 2MB, 4MB and 1GB pages.4MB pages require 2 2MB locations.

 

Level 2 instruction TLB

The L2 instruction TLB is 4-way associative, with space for 512 4KB page translations.

 

Level 2 data TLB

The L2 data TLB has space for 1024 translations of 4KB pages (compared with 512 of the Stars core: this implies a greater probability of finding the translation in the cache and the ability to handle double the memory for the same performance), 4-way associative, 128 translations of 2MB pages, 2-way associative and 16 translations of 1GB pages, 8-way associative.4MB pages require 2 2MB locations.

Corsair