This is a very simplified explanation of how a processor reads from memory, how the performance flaw affects it and how to bypass the flaw. It is intended for not so serious programmers and, hopefully, for advanced users. Serious programmers should go to in-depth technical analysis.

Modern processors read from memory or secondary cache in bursts. A burst (also known as line fill) is a series of 4 successive contiguous reads. The reason processors read in bursts and not in single accesses is that bursts are quicker (e.g. when a single read is 5 cycles, a burst read is 5+2+2+2= 11c, that is 11/4= 2.75 cycles per read). In this page we will assume that a burst is 5+2+2+2 = 11 cycles.

The size of each burst is fixed at 4 transfers of 8 bytes each, 32 bytes in total.
All cycles in this page are of a Pentium with VX/HX/TX chipset and EDO memory. Slightly different timings apply for other configurations. (The read accesses in this page have been simplified as 64 bit ones).

A' Conventional reading:
The conventional (normal) method of reading/searching/transfering is the sequential which is also recommended by Intel.

1' First read:

The first read is being performed. It takes 5 cycles.

2' Second read:

According to Intel's data sheet, the 2nd read cannot be served until the whole burst is finished. That is after the remaining 2+2+2= 6 cycles. But as we found out, it's longer than that: there is also this mysterious undocumented penalty when a burst line is accessed while it's not been loaded completely yet! Therefore the delay will be 2+2+2+penalty cycles.

3' Third and Fourth reads:

Because the whole burst is already loaded (and stored in the primary cache) the remaining 2 reads incur no delay. It should be noted that the processor's bus is currently idle, that is nothing is read from memory/secondary cache while the 3rd and 4th reads are made.

The total time taken for the conventional method is 5+2+2+2+penalty cycles + a few cycles for the program's instructions to execute which results in a total of about 17 cycles per burst, which is about 117 Mbytes per second.

B' Innovative reading:
The unconventional method we discovered is non-sequential.

1' First read:

As in the conventional method, the first read is being performed. It takes 5 cycles.

2' Second read:

The second read is not done at the address 8, but at address 32, that is at the start of the next burst! Of course this second read can only be served after the whole current burst is finished, that is after 2+2+2= 6 cycles, and it takes 3 instead of 5 cycles for it to be performed, because on Intel's chipsets there's a special case: when a burst is initiated immediately after another one has finished, it is considered an extension of the previous one and takes only 3 cycles! With the conventional method, this cannot happen because of the processor's penalty. In this case there's no penalty because no access is made on the current burst. Total delay: 9 cycles.

3' Third, fourth and fifth reads: 

Because the whole first burst is already loaded (and stored in primary cache) the remaining 3 reads incur no delay. It should be noted that at the same time the processor makes these 3 reads, it continues to load the 2nd burst, so there is no lost bus time, this method exploits the processor's parallelism to its maximum: all the instructions of the program run while the processor loads bursts from outside, that is the bus operates at it's maximum bandwidth!

The total time taken for the innovative method is 5+2+2+2+3 cycles =14 cycles for one full burst and the start of the next one. It is obvious that for burst 1 it takes 5+2+2+2=11 cycles and for burst 2 it will take 3+2+2+2= 9 cycles; that is 10 cycles on average per burst, which is exactly 200 Mbytes per second!

Comments and feedback are welcome.
For questions, go to the Q&A page.
 Return to main page.

Everything at this web site is the property of Intelligent Firmware Ltd. You may not repost/publish this information without our explicit permission.