Cache Memory Tutorial - 1
-Aviral Mittal (avimit att yhaoo dat cam)

Connect @ https://www.linkedin.com/in/avimit/
SITE HOME

This techerature is about cache memories. If you ever wondered what are cache memories, why are cache memories needed and/or how caches work? You are at the right place.

Motivation for this Techerature

There are tons of information on Cache Memories littered everywhere on the Internet. So why one more? The problem with all that is very common. These are usually very high level descriptions, which ordinary readers may not be able to understand. The other problem is, that these will usually fail to let the user appreciate objectively what is the need of such memories. Even though the need is established, the working mechanism of these fails to connect to the readers with no or little prior knowledge on the subject. This techerature will hopefully make the average reader fully understand the need, structure, functionality of these wonderful devices. And this techrature is written in plain English, no jargon.

Cache-Memories - 1 : Need of Cache Memory, what is cache memory, what is cache-hit what is cache-miss.
First of all the tutorial brings about the need of cache memories. Why are cache memories needed at all, with a very simple example.
It then attempts to explain the more complex cache memories in very simple language, with detailed diagrams and their working principles.

Consider the following example of a very simple System on Chip design with a microprocessor such as an ARM Cortex-M3.



Fiugre-1 Simple System with Processor, SPI-Memory Controller, and SPI Flash Memory


The SoC will have software for the processor to execute. Let us say that this software is stored in a non-volatile memory, say a Flash Memory, as shown in the above diagram.
The Problem: The Flash Memory can be either integrated within the SoC (e-flash or embedded flash as it is called) or for smaller technology nodes i.e smaller than 28nm, the system Software is usually stored off the main SoC/Microcontroller in a separate Flash Memory IC. In either cases the time taken by the processor to access the memory is significant. This is because Flash memories are quite slow as compared to SRAMs for example. And if the flash is off the main SoC then the access time are even worse. Due to these large access times, the processor's performance is being compromised. This is the problem statement here, fast processor is being let down by slow memory.

More elaborate example:
The Processor: The processor on the SoC is ARM Cortex-M3 processor which can run say at 100 MHz. This means a clock cycle period of 10 ns. This means that the processor is able to execute 1 instruction every 10 ns.
The Memory: The way the SPI-Flash memory works is, it needs a 'ritual' of things to be done, before it can send back read data. First the chip select must be asserted, followed by a series of steps, i.e sending command, address etc, and finally the SPI-Flash starts sending the data corresponding to the address. In summary it can be 100s of ns before the SPI-Flash will send the data corresponding to the address. The initial 'ritual' once performed, the data can be fetched sequentially from the start address to any length. This means it is efficient to read multiple data bytes once the initial 'ritual' is performed. On the other hand if we just read one byte or word at a time, and de-assert the chip-select, the whole 'ritual' must be performed again, before the next data byte/word may be fetched. A bit more about SPI/QSPI is here.

The processor is running a lot faster, the memory is a lot slow. When the processor requests one 32 bit word at a time, it will be desirable to get few more 32 bit 'anticipatory' data words from SPI-Flash, and store them somewhere, such that, if the following request from processor is a sequential fetch from next address location, the SPI-Flash 'ritual' can be avoided. This arrangement is shown in the figure below: Notice the '256 bit Register + Control Logic' between the processor and SPI-controller,


Figure-2: Simple System with Processor, SPI-Memory Controller, and SPI Flash Memory with 256-bit register+control Logic

Using the scheme of Figure 2 shown above, when the processor sends its first access i.e. ACCESS0 @ address 0, it waits for a long time to get the data back, as the full SPI 'ritual' must be performed. However if the following accesses from the processor e.g. ACCESS1, ACCESS2, ACCESS3 are at address 1, address 2 and address 3 sequentially, then for these requests, the processor will not wait as much as it has to wait for ACCESS0, because the '256 bit reg + controller' logic fetches the locations at these addresses in anticipation and stores them in the 256 bit registers, and hence the SPI initial 'ritual' is avoided for these subsequent accesses. Though the processor will still 'stall' between accesses, but the 'stall' times are clearly reduced.

Worked Out example:
SPI initial 'ritual' time (Tr) = 1000 ns
SPI time to fetch 1 32 bit data word following ritual (Ta) = 200 ns
Time needed to get 8 32 bit data words in Figure 1 (Ttf1)
Ttf1 =  8 x (Tr + Ta) = 8 x (1000 + 200) ns = 9600 ns.
Time needed to get 8 32 bit data words in Figure 2 (Ttf2)
Ttf2 = Tr + 8 x Ta = 1000 + 8 * 200 = 2600 ns.

It is very clear how the introduction of a 256 bit register + controller has helped the processor to reduce average access times, and run a lot faster. The controller is the one which fetches the 'anticipatory' data words following the first request from the processor, and stores it in 256 bit register.

Consider one more aspect. Once this 256 bit register has fetched the contents from SPI flash, it may save it for future use. Now any time the processor fetches data from those addresses corresponding to which this 256 bit register has the data, external SPI accesses could be completely avoided. The access times to access 8 32 bit words in this case will be 10 ns x 8 = 80 ns. This is in-crrrrreee-diiiiib-le. In a hypothetical example, if the code processor executes takes less than 256 bits to store, then the entire code can be stored in this 256 bit register and the processor will work very fast.
Hence, there are 2 advantages of this '256 bit register + controller'.
i). It helps in reducing average access times form the Off Chip SPI-Flash.
ii). If the code that the processor intends to run can be found in this 256 bit register, the external fetches can be avoided for the portion of the code that is stored in this 256 bit register, and this portion of the code will run extremely fast.
Note that the second advantage is the one which is exploited rather more frequently, and this is the one cache-memory typically tries to address.
This brings a lot of questions E.g. which 256 bits should be kept in the register, when these 256 bits be replaced/renewed from the SPI-Flash etc.etc. These will be addressed in later chapters.

Now the question is, how and when the 'control logic' around this 256 bit register should issue a read request to the Flash memory. The answer is: when it does not have the data corresponding to the Address the processor has asked for. So the control logic first must determine if it has the data corresponding to the address the processor has issued. This means in addition to maintain the 256 bits data, the control logic must also store the addresses (or a part of the address bits as we will later learn), for which it has data. Upon receiving a transaction request (i.e. a read data request or a write data operation) from the processor, the control logic makes a search for the transaction address within its address storage. If the control logic determines that the 256 bit register has the data corresponding to the address issued by the processor, it returns that data, if the 256 bit register does not have corresponding data it can issue a read request to the Flash memory, and get 8 32 bit words.

In the above example, the 256 bit register is a simple 'cache memory', the control logic is a simple cache controller. The storage space of addresses corresponding to which the cache memory has data is called 'tag' memory (not shown here). At the moment the functionality of cache controller is not being discussed, it will come later.

Why do we need cache memory?
It is clear from the above example and explanation cache memory is needed to help speed up processor execution from slow memories, while keeping the costs under control. Theoretically it is possible to have a RAM or even a ROM of the size of the flash memory, not have a cache at all, and store everything in that RAM/ROM which is on chip, close to the processor. This will make the processing very fast, but the cost will be way too high. Remember the cost of on-chip ROM/RAM per bit is much higher than the cost of per bit storage in off chip flash. The cache memory is a compromise, which would help the processor to run faster, let the system have lots of low cost-storage (e.g. SPI-Flash), and will help in keeping the total system costs checked.

Cache Hit
If the process of searching the addresses for which corresponding data may be present in the cache results in a +ive, that is, the data is present in cache memory it is called a 'Cache Hit'.
Cache Miss.
On the other hand if the process of searching the addresses for which corresponding data may be present in cache results in a -ive, that is, the data is not present in the cache memory it is called a 'Cache Miss'. A 'Miss' will then trigger a read of the Flash memory, or the slower memory.

Note that a cache memory has storage for data and storage of all corresponding addresses (or part of addresses as it will be evident later) for which it has data. Hence a cache memory is just a couple of RAMs , i.e the 'DATA RAM' and the 'TAG RAM' organized in a way which enables the functionality of the cache memory. While the data ram stores the data, the tag ram stores the addresses or a part of addresses.

This example explains a very simple cache memory, a 256 bit cache memory. For more elaborate example(s) and more cache fundamentals see Next chapters.
Conclusion: What is a cache memory?
A cache memory is an intermediate memory which is located between the processor and a relatively slower memory, to help the processor speed up its execution. In the example above the slower memory is Flash Memory, but slower memory does not have to be Flash, it can be any memory which is slower than the memory which is more close to the processor.
 

Apologies, its still under construction....



NEXT => Spatial Locality And Temporal Locality

Cache Organizations:
Direct-Mapped-Cache
4-Way-Set-Associative Cache
2-Way-Set-Associative Cache


Keywords:
What is a cache memory
How does a cache memory work
What is Cache Miss
What is Cache Hit
What is 'tag'.
Why do you need cache memory.