Department of Computer and IT Engineering University of Kurdistan Computer Architecture

Department of Computer and IT Engineering University of Kurdistan Computer Architecture Memory Structure By: Dr. AlirezaAbdollahpouri

حافظهاصلی • حافظهاصلیازمداراتسریعیساختهمیشودکهبرنامههاودادههایموردنیازرادرهنگاماجرانگهداریمینماید. • بردونوعاست: • ROM • RAM

حافظه ROM • حافظهایاستفقطخواندنیکهمحتویآنیکبارنوشتهشدهوپسازنصبدرکامپیوترتغییریدرآندادهنمیشود. • معمولاازاینحافظهبرایذخیرهبرنامههائینظیر bootstrap loader کهبرایراهاندازیاولیهکامپیوترموردنیازهستنداستفادهمیشود. • اینحافظهانواعمختلفیدارد: Masked ROM PROM EPROM EEPROM

حافظه RAM • عمدهحافظهاصلیکامپیوترازحافظه RAM ساختهمیشود. • معمولادرکامپیوترهادونوعحافظه RAM مورداستفادههستند: • DRAM: Dynamic Random Access Memory • High density, low power, cheap, slow • Dynamic: need to be “refreshed” regularly • SRAM: Static Random Access Memory • Low density, high power, expensive, fast • Static: content will last “forever” (until lose power)

word (row select) 0 1 word 0 1 bit bit bit bit ساختار سلولحافظه SRAM 6-Transistor SRAM Cell • Write: • 1. Drive bit lines (bit=1, bit=0) • 2. Select row • Read: • 1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal! • 2. Select row • 3. Cell pulls one line low • 4. Sense amplifier on column detects difference between bit and bit

row select bit ساختارحافظه DRAM • Write: • 1. Drive bit line • 2. Select row • Read: • 1. Precharge bit line to Vdd/2 • 2. Select row • 3. Cell and bit line share charges • Very small voltage changes on the bit line • 4. Sense (fancy sense amp) • Can detect changes of ~106 electrons • 5. Write: restore the value • Refresh • 1. Just do a dummy read to every cell. Storage Capacitor

بلوکدیاگرامحافظه Chip Select 1 Chip Seleect2 Read Write 7-bit address cs1 cs2 RD WR AD 8-bit data bus 128X8 RAM Chip Select 1 Chip Seleect2 9-bit address cs1 Cs2 AD 8-bit data bus 512X8 ROM

نقشهآدرسحافظه • درموقعطراحیکامپیوترمقدارحافظهاختصاصدادهشدهبههریکازحافظههای RAM , ROM مشخصمیگردند. • مثال: اتصال يك كيلوبايت حافظه به CPU (512 بايت RAM و 512 بايت ROM) A8 A7 DEC

ارتباطحافظهباپردازنده CPU chip register file ALU memory bus bus interface main memory Modules Memory Interface

Gap grew 50% per year اهمیتحافظه • حافظهازچهنظراهمیتدارد؟ • سرعتدسترسیبهحافظهیکگلوگاهاصلیبرایافزایشکارائیکامپیوتراست

DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns پیشرفتتکنولوژیحافظه • درحالیکهدرسالیانگذشتهحجمحافظهبسرعتافزایشپیداکردهاست،کاهشدرزماندسترسیبهمحتویحافظهپیشرفتبسیارکمتریداشتهاشت. 2:1! 1000:1!

صورتمسئله • ویژگیهایحافظه: • حافظههایبزرگکندهستند • حافظههایسریعکوچکهستند • سوال: چگونهمیتوانحافظهبزرگ،سریعوارزانیداشت؟ • استفادهازسلسلهمراتب • دسترسیموازی

اصلمحلیبودنرجوعبهحافظه • باآنالیزبرنامههایکامپیوتریمشخصشدهاستکه دریکفاصلهزمانیرجوعبهحافظهبرایدسترسیبهدادهیادستورالعملمعمولادراطرافیکناحیهحولدادههاودستوراتیاستکهاخیراموردرجوعقرارگرفتهاند. • اینامربهایندلیلاستکهبرنامههادستوراتراازمحلهایپشتسرهماجرامیکنندوهمچنیندرهنگاماجرایحلقههادستورانآنحلقهبطورمدامتکرارمیشوند. sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

اصلمحلیبودنرجوعبهحافظه • محلیبودنرجوعبهحافظهبدوصورتاست: • Temporal locality: • دادههاودستوراتیکهاخیرامورداستفادهقرارگرفتهاند،مجدداموردرجوعقرارخواهندگرفت • Spatial locality: • دادههایادستوراتیکهدرخانههاینزدیکبههمحافظهقراردارندبافاصلهزمانیاندکیازهمموردرجوعقرارخواهندگرفت.

مثال • کدامیکازبرنامههایزیردارایخاصیترجوعمحلیبهتریهستند؟ int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum } int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum }

Cycle Time Access Time Time زماندسترسیبهحافظه • زماندسترسیبهحافظه: (Access Time) عبارتاستاززمانیکهبینتقاضایدادهوآمادهشدنآنطولمیکشد • زمانسیکل: (Cycle Time) عبارتاستاززمانبیندوتکرارمتوالی • برایمثالدرحالیکهزماندسترسیبهیکحافظه DRAM ممکناست 60ns باشدبااحتسابزمانهایلازمبرایانتقالآدرسازپردازندهبهحافظه،تاخیرباسها،و ... زمانسیکلآنممکناستبه 180-250 ns برسد.

حافظه cache • اگرقسمتفعالبرنامهودادههارادرحافظهسریعوکوچکیقراردهیم،میتوانیمباکمکردنمیانگینزماندسترسیبهحافظهزماناجرایبرنامهراکاهشدهیم. • اینحافظهسریعوکوچکراحافظه cache مینامند. • معمولا حافظه cache زماندسترسیبسیارکمترینسبتبهحافظهاصلیدارد 5 )تا 10 برابرکمتر( • دریککامپیوترممکناستعمل cache تاچندسطحتکرارگردد.

Cache introduction • Today we’ll answer the following questions. • What are the challenges of building big, fast memory systems? • What is a cache? • Why caches work? (answer: locality) • How are caches organized? • Where do we put things and how do we find them?

L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote network servers. مثالیازسلسلهمراتبحافظه Smaller, faster, and costlier (per byte) storage devices L0: registers CPU registers hold words retrieved from L1 cache. on-chip L1 cache (SRAM) L1: off-chip L2 cache (SRAM) L2: main memory (DRAM) L3: Larger, slower, and cheaper (per byte) storage devices local secondary storage (local disks) L4: remote secondary storage (distributed file systems, Web servers) L5:

نحوهعمل cache • وقتیکه cpu نیازبهدسترسیبهحافظهداردابتداحافظه cache راجستجومینماید: اگردادهدراینحافظهسریعموجودبودازآناستفادهمیشوددرغیراینصورتبارجوعبهحافظهبلوکیازدادهکهشاملدادهموردنیاز cpu میشودازحافظهاصلیبهحافظه cache منتقلمیگردد. CPU حافظهاصلی 32K x 12 حافظه cache 512 x 12 Word access Block transfer

درصدموفقیت (hit ratio) • کارائیحافظه cache باضریبیبانامدرصدموفقیت hit ratio اندازهگیریمیشود. • ایندرصدعبارتاستازنسبتتعداددفعاتیکهدادهدر cache یافتشدهبهتعدادکلدفعاترجوعبهحافظه. • هرچهمقدارایندرصدبیشترباشدسرعتدسترسیبهحافظهبهسرعت cache نزدیکترمیشود. • مثال: • یککامپیوتربازماندسترسی 100ns برایحافظه cache و 1000nsبرایحافظهاصلیدرصورتداشتندرصدموفقیت 0.9زماندسترسیبرابربا 200ns خواهدداشت.

Cache, Hit/Miss Rate, and Effective Access Time Cache is transparent to user; transfers occur automatically Main (slow) memory Line Word Cache (fast) memory CPU Reg file Data is in the cache fraction h of the time (say, hit rate of 98%) Go to main 1 – h of the time (say, cache miss rate of 2%) One level of cache with hit rate h Ceff = hCfast + (1 – h)(Cslow + Cfast) = Cfast + (1 – h)Cslow

مثال • در يک کامپیوتر دارای cache دو سطحی (L2, L1) تاخیر دستیابی به L1 برابر 1 ns و برای L2 برابر با 10 ns است. زمان دستیابی به حافظه اصلی برای یک بلوک 100 ns می باشد. اگر در صد موفقیت برای L1 و L2 به ترتیب 90%و 50% باشد، متوسط زمان رجوع به حافظه چیست؟ tav= tL1 + (1-hL1).tL2 + (1-hL1)(1-hL2).tm = 1+ 1+ 5 = 7 ns

1% 4% 95% 8cycles 60cycles Performance of a Two-Level Cache System Example A system with L1 and L2 caches has a CPI of 1.2 with no cache miss. There are 1.1 memory accesses on average per instruction. What is the effective CPI with cache misses factored in? What are the effective hit rate and miss penalty overall if L1 and L2 caches are modeled as a single cache? LevelLocal hit rateMiss penalty L1 95 % 8 cycles L2 80 % 60 cycles Solution Ceff = Cfast + (1 – h1)[Cmedium + (1 – h2)Cslow] Because Cfast is included in the CPI of 1.2, we must account for the rest CPI = 1.2 + 1.1(1 – 0.95)[8 + (1 – 0.8)60] = 1.2 + 1.10.0520 = 2.3 Overall: hit rate 99% (95% + 80% of 5%), miss penalty 60 cycles

تاثیر اندازه بلوک های cache • در حالت کلی با بزرگ شدن اندازه بلوک های cache میزان hit rate افزایش می یابد. اما به همین ترتیب میزان miss penalty یعنی زمانی که برای انتقال اطلاعات به کش لازم است نیز افزایش خواهد یافت. Average Access Time Miss Rate Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size • Average Access Time: • = Hit Time x Hit Ratio + Miss Penalty x (1 – Hit Ratio)

حافظه associative • ازاینحافظهبرایتسریععملجستجودربیندادههایذخیرهشدهاستفادهمیشود. • برایکاهشزمانجستجوبجایاستفادهازآدرس،ازمحتویخوددادهاستفادهمیشود. اینروشراگاهی content addressable memory (CAM) هممینامند • هنگامنوشتندادهدراینحافظهنیازیبهآدرسنیست! • اینحافظهقادراستتاجایخالیراپیدانمودهودادهرادرآنجاذخیرهمینماید. • هنگامخواندادهازحافظه،مقداردادهویابخشیازآنبهحافظهارائهشدهوحافظهتمامیکلماتذخیرهشدهایراکهبادادهموردنظریکسانهستندراعلامتگذاریکردهوآمادهخواندنمینماید

سختافزارمربوطه Argument Register (A) Key Register (K) match register مثال: Associative memory array and logic m words n bit per words N Input A = 101 11100 K = 111 00000 Word1 = 100 11100 N Word2 = 101 00001 Y read write Output

نگاشت • عملانتقالدادهازحافظهاصلیبهحافظه cache رانگاشتمینامند.این کاربهسهطریقانجاممیشود: • Associative Mapping • Direct Mapping • Set-associative mapping

Fully Associative mapping • در این روش که سریعترین راه پیاده سازی یک cache است از یک حافظه associative استفاده میشود. • در این حافظه هم آدرس و هم محتوی یک کلمه ذخیره میشوند. در نتیجه cache میتواند محتوی هر محل از حافظه را ذخیره نماید. • هنگام جستجو برای یک داده آدرس آن به حافظه associativeعرضه میشود. در صورتیکه ورودی متناظری در حافظه باشد، داده مربوطه در خروجی ظاهر میگردد. در غیر اینصورت هر دوی آدرس و داه در حافظه associative ذخیره خواهند شد. • معمولا از الگوریتم FIFOبرای جایگزینی در cache استفاده میشود

Fully Associative mapping CPU Address Argument Register Address Data 00 1220 02 6710 Cache

نگاشتمستقیم Direct Mapping • دراینروشبرایپیادهسازی cache ازیکحافظه RAM سریعاستفادهمیشود (static RAM): • اگرحافظهاصلیدارای 2nکلمهوحافظه cache دارای 2kکلمهباشنددرنتیجهبهترتیببه n و k بیتآدرسنیازخواهندداشت. • آدرس n بیتیحافظهاصلیبصورتزیرتقسیممیشود Index Tag n-k k

نگاشتمستقیم Direct Mapping • درحافظه cache علاوهبردادهاطلاعاتمربوطبه Tag همذخیرهمیشود. 00000 00FFF 01000 01FFF 02000 2FFFF 1220 Index Tag Data 000 00 1220 2340 3450 4560 5670 FFF 2F 6710 6710 Cache Main memory

نگاشتمستقیم Direct Mapping • وقتیکهآدرسیتوسط CPU تولیدمیشودبااستفادهازقسمت Index آنیککلمهازحافظه cache خواندهشدهومقدار Tag ذخیرهشدهدرآنبامقدار Tag آدرسمقایسهمیگردد. • اگردو Tag یکسانبودندداده cacheمورداستفادهقرارمیگیرد (hit) • درغیراینصورت (miss) ، دادهازحافظهخواندهشدهوهمراهبا Tag جدیددر cache نوشتهمیشود. • دراینروشاگررجوعبهآدرسهائیبا index یکسانزیاداتفاقبیافتد،درصدموفقیتپائینمیاید.

Direct mapped

نگاشتمستقیمبا انداز بلوک بزرگتر • معمولا اندازه بلوک بزرگتر از یک در نظر گرفته میشود: Tag Data Index 000 007 01 01 3450 6578 Block 0 010 017 00 00 1340 1658 Block 1 Block Tag Word Cache n-k k

Direct-Mapped Cache (Block size=4) Direct-mapped cache holding 32 words within eight 4-word lines. Each line is associated with a tag and a valid bit.

Accessing a Direct-Mapped Cache - example Show cache addressing for a byte-addressable memory with 32-bit addresses. Cache line W = 16 B. Cache size L = 4096 lines (64 KB). Solution Byte offset in line is log216 = 4 b. Cache line index is log24096 = 12 b. This leaves 32 – 12 – 4 = 16 b for the tag. Components of the 32-bit address in an example direct-mapped cache with byte addressing.

7 35 3 3 34 6 2 2 1 33 1 5 4 0 32 0 Direct-Mapped Cache Behavior Address trace: 1, 7, 6, 5, 32, 33, 1, 2, . . . 1: miss, line 3, 2, 1, 0 fetched 7: miss, line 7, 6, 5, 4 fetched 6: hit 5: hit 32: miss, line 35,34,33,32 fetched (replaces 3, 2, 1, 0) 33: hit 1: miss, line 3, 2, 1, 0 fetched (replaces 35, 34, 33, 32) 2: hit ... and so on

شبیه ساز Cache http://www.ecs.umass.edu/ece/koren/architecture/Cache/default.htm

Set-associative mapping • دراینروشهرمحلازحافظه cache میتواندچندینکلمهباآدرس Index یکسانراذخیرهنماید. • تعداد data-tag هایذخیرهشدهدرحافظهیک set خواندهمیشود. • برایمقایسه tag آدرستولیدشدهبامقادیرذخیرهشدهازحافظه associative استفادهمیشود. • بابزرگشدن set هادرصدموفقیت cache افزایشمییابد

Set-associative mapping Index Tag Data Tag Data 000 00 1220 02 5670 FFF 02 6710 00 2340 Cache Cache 2-way Set associative Cache

Cache Address Mapping - example A 64 KB four-way set-associative cache is byte-addressable and contains 32 B lines. Memory addresses are 32 b wide. a. How wide are the tags in this cache? b. Which main memory addresses are mapped to set number 5? Solution a. Address (32 b) = 5 b byte offset + 9 b set index + 18 b tag b. Addresses that have their 9-bit set index equal to 5. These are of the general form 214a + 255 + b; e.g., 160-191, 16 554-16 575, . . . 32-bit address Tag Set index Offset 18 bits 9 bits 5 bits Tag width = 32 – 5 – 9 = 18 Set size = 4  32 B = 128 B Number of sets = 216/27 = 29 Line width = 32 B = 25 B

Accessing a Sample Cache • 32 KB cache, 2-way set-associative, 16-byte block size 31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 tag index word offset 10 18 tag data tag data valid valid 0 1 2 ... ... ... ... 1021 1022 1023 32 KB / 16 bytes / 2 = 1 K cache sets = = hit/miss

Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. 0 1 2 3 4 5 6 7 Block no. 0 1 2 3 4 5 6 7 Set 0 Set 1 Set 2 Set 3 Mapping- Comparison • Block 12 placed in 8 block cache: • Fully associative, direct mapped, 2-way set associative • S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. 0 1 2 3 4 5 6 7 Block-frame address Block no. 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

الگوریتمهایجایگزینی • وقتیکهیک miss اتفاقمیافتد،درصورتیکهیک set پرباشدلازممیشودتایکیازدادههایآنخالیشدهودادهجدیدوارد cache شود. • روشهایمختلفیبرایاینکاروجوددارد: • first-in first-out • Random replacement • Least recently used (LRU)

نوشتندرحافظه cache • یکیازمسایلمهمدرارتباطباحافظه cache نحوهعملدرهنگامنوشتناطلاعاتدرحافظهاست. • هنگامخواندندادههاوقتیدادهدر cache وجودداشتهباشد،نیازیبهرجوعبهحافظهاصلینیستاماوقتیدادهرادرحافظهمینویسیمممکناستبهدوطریقعملشود: • : write through دراینروشدرهربارنوشتندرحافظهدادههمدر cache وهمدرحافظهاصلینوشتهمیشود. • : write back دراینروشدادهفقطدر cache نوشتهشدهوبایکپرچم set میشود. تازمانیکهایندادهدر cache قرارداردازایندادهاستفادهخوادشد،امادرصورتانتقالدادهاز cacheمقدارآندرحافظهاصلینیز update میگردد.

Average memory access time • Average memory access time = % instructions * (Hit_time + instruction miss rate*miss_penality) + % data * (Hit_time + data miss rate*miss_penality)

Average memory access time • Assume 40% of the instructions are data accessing instruction. • Let a hit take 1 clock cycle and the miss penalty is 100 clock cycle • Assume instruction miss rate is 4% and data access miss rate is 12%, what is the average memory access time?

Average memory access time 60% * (1 + 4% * 100) + 40% * (1 + 12% * 100) = 0.6 * (5) + 0.4 * (13) = 8.2 (clock cycle)

سلسلهمراتبحافظهدرپردازندههایجدیدسلسلهمراتبحافظهدرپردازندههایجدید Intel Pentium 4, 2.2 GHz Processor.

Department of Computer and IT Engineering University of Kurdistan Computer Architecture