Cache Memory Tutorial. What is a Cache Memory. Why Cache Memories are Needed. How does Cache Memories Work

The Processor: The processor on the SoC is ARM Cortex-M3 processor which can run say at 100 MHz. This means a clock cycle period of 10 ns. This means that the processor is able to execute 1 instruction every 10 ns.
The Memory: The way the SPI-Flash memory works is, it needs a 'ritual' of things to be done, before it can send back read data. First the chip select must be asserted, followed by a series of steps, i.e sending command, address etc, and finally the SPI-Flash starts sending the data corresponding to the address. In summary it can be 100s of ns before the SPI-Flash will send the data corresponding to the address. The initial 'ritual' once performed, the data can be fetched sequentially from the start address to any length. This means it is efficient to read multiple data bytes once the initial 'ritual' is performed. On the other hand if we just read one byte or word at a time, and de-assert the chip-select, the whole 'ritual' must be performed again, before the next data byte/word may be fetched. A bit more about SPI/QSPI is here.

The processor is running a lot faster, the memory is a lot slow. When the processor requests one 32 bit word at a time, it will be desirable to get few more 32 bit 'anticipatory' data words from SPI-Flash, and store them somewhere, such that, if the following request from processor is a sequential fetch from next address location, the SPI-Flash 'ritual' can be avoided. This arrangement is shown in the figure below: Notice the '256 bit Register + Control Logic' between the processor and SPI-controller,

Figure-2: Simple System with Processor, SPI-Memory Controller, and SPI Flash Memory with 256-bit register+control Logic

Using the scheme of Figure 2 shown above, when the processor sends its first access i.e. ACCESS0 @ address 0, it waits for a long time to get the data back, as the full SPI 'ritual' must be performed. However if the following accesses from the processor e.g. ACCESS1, ACCESS2, ACCESS3 are at address 1, address 2 and address 3 sequentially, then for these requests, the processor will not wait as much as it has to wait for ACCESS0, because the '256 bit reg + controller' logic fetches the locations at these addresses in anticipation and stores them in the 256 bit registers, and hence the SPI initial 'ritual' is avoided for these subsequent accesses. Though the processor will still 'stall' between accesses, but the 'stall' times are clearly reduced.

Worked Out example:
SPI initial 'ritual' time (Tr) = 1000 ns
SPI time to fetch 1 32 bit data word following ritual (Ta) = 200 ns
Time needed to get 8 32 bit data words in Figure 1 (Ttf1)
Ttf1 = 8 x (Tr + Ta) = 8 x (1000 + 200) ns = 9600 ns.
Time needed to get 8 32 bit data words in Figure 2 (Ttf2)
Ttf2 = Tr + 8 x Ta = 1000 + 8 * 200 = 2600 ns.

It is very clear how the introduction of a 256 bit register + controller has helped the processor to reduce average access times, and run a lot faster. The controller is the one which fetches the 'anticipatory' data words following the first request from the processor, and stores it in 256 bit register.

Consider one more aspect. Once this 256 bit register has fetched the contents from SPI flash, it may save it for future use. Now any time the processor fetches data from those addresses corresponding to which this 256 bit register has the data, external SPI accesses could be completely avoided. The access times to access 8 32 bit words in this case will be 10 ns x 8 = 80 ns. This is in-crrrrreee-diiiiib-le. In a hypothetical example, if the code processor executes takes less than 256 bits to store, then the entire code can be stored in this 256 bit register and the processor will work very fast.
Hence, there are 2 advantages of this '256 bit register + controller'.
i). It helps in reducing average access times form the Off Chip SPI-Flash.
ii). If the code that the processor intends to run can be found in this 256 bit register, the external fetches can be avoided for the portion of the code that is stored in this 256 bit register, and this portion of the code will run extremely fast.
Note that the second advantage is the one which is exploited rather more frequently, and this is the one cache-memory typically tries to address. And in practice, the cache will be a lot larger then just 256 bits, as will be seen in later chapters.
This brings a lot of questions E.g. which 256 bits should be kept in the register, when these 256 bits be replaced/renewed from the SPI-Flash etc.etc. These will be addressed in later chapters.

Now the question is, how and when the 'control logic' around this 256 bit register should issue a read request to the Flash memory. The answer is: when it does not have the data corresponding to the Address the processor has asked for. So the control logic first must determine if it has the data corresponding to the address the processor has issued. This means in addition to maintain the 256 bits data, the control logic must also store the addresses (or a part of the address bits as we will later learn), for which it has data. Upon receiving a transaction request (i.e. a read data request or a write data operation) from the processor, the control logic makes a search for the transaction address within its address storage. If the control logic determines that the 256 bit register has the data corresponding to the address issued by the processor, it returns that data, if the 256 bit register does not have corresponding data it can issue a read request to the Flash memory, and get 8 32 bit words.

In the above example, the 256 bit register is a simple 'cache memory', the control logic is a simple cache controller. The storage space of addresses corresponding to which the cache memory has data is called 'tag' memory (see cache organization chapter to learn more on TAG and TAG memory.). At the moment the functionality of cache controller is not being discussed, it will come later.

Why do we need cache memory?
It is clear from the above example and explanation cache memory is needed to help speed up processor execution from slow memories, while keeping the costs under control. Theoretically it is possible to have a RAM or even a ROM of the size of the flash memory, not have a cache at all, and store everything in that RAM/ROM which is on chip, close to the processor. This will make the processing very fast, but the cost will be way too high. Remember the cost of on-chip ROM/RAM per bit is much higher than the cost of per bit storage in off chip flash. The cache memory is a compromise, which would help the processor to run faster, let the system have lots of low cost-storage (e.g. SPI-Flash), and will help in keeping the total system costs checked.

Cache Hit
If the process of searching the addresses for which corresponding data may be present in the cache results in a +ive, that is, the data is present in cache memory it is called a 'Cache Hit'.
Cache Miss.
On the other hand if the process of searching the addresses for which corresponding data may be present in cache results in a -ive, that is, the data is not present in the cache memory it is called a 'Cache Miss'. A 'Miss' will then trigger a read of the Flash memory, or the slower memory.

Note that a cache memory has storage for data and storage of all corresponding addresses (or part of addresses as it will be evident later) for which it has data. Hence a cache memory is just a couple of RAMs , i.e the 'DATA RAM' and the 'TAG RAM' organized in a way which enables the functionality of the cache memory. While the data ram stores the data, the tag ram stores the addresses or a part of addresses.

This example explains a very simple cache memory, a 256 bit cache memory. For more elaborate example(s) and more cache fundamentals see Next chapters.
Conclusion: What is a cache memory?
A cache memory is an intermediate memory which is located between the processor and a relatively slower memory, to help the processor speed up its execution. In the example above the slower memory is Flash Memory, but slower memory does not have to be Flash, it can be any memory which is slower than the memory which is more close to the processor, e.g. the slow memory can be DRAM.