Assume you are designing a hardware


Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated "nonunit stride" hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. In contrast, software prefetching can determine nonunit strides as easily as it can determine unit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). 
c) For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches need to be outstanding at a given time? 

Request for Solution File

Ask an Expert for Answer!!
Basic Computer Science: Assume you are designing a hardware
Reference No:- TGS097091

Expected delivery within 24 Hours