# NEC # **User's Manual** # $VR10000^{TM}, VR12000^{TM}$ **64-bit Microprocessor** $\mu$ PD30700 $\mu$ PD30700L $\mu$ PD30710 [MEMO] # **Summary of Contents** | Chapter 1 | Introduction to the R10000 Processor | <b></b> 21 | |------------|---------------------------------------|------------| | Chapter 2 | System Configurations | <b></b> 53 | | Chapter 3 | Interface Signal Descriptions | <b></b> 57 | | Chapter 4 | Cache Organization and Coherency | <b></b> 65 | | Chapter 5 | Secondary Cache Interface | <b></b> 81 | | Chapter 6 | System Interface Operations | 101 | | Chapter 7 | Clock Signals | 177 | | Chapter 8 | Initialization | 181 | | Chapter 9 | Error Protection and Handling | 191 | | Chapter 10 | JTAG Interface Operation | 211 | | Chapter 11 | Electrical Specifications | 217 | | Chapter 12 | Packaging | 227 | | Chapter 13 | Coprocessor 0 | 241 | | Chapter 14 | Floating-Point Unit | 291 | | Chapter 15 | Memory Management | 303 | | Chapter 16 | CPU Exceptions | 319 | | Chapter 17 | Cache Test Mode | 347 | | Appendix A | Glossary | 357 | | Appendix B | Differences between R10000 and R12000 | 365 | | Appendix C | Index | 377 | | | | | #### NOTES FOR CMOS DEVICES - #### (1) PRECAUTION AGAINST ESD FOR SEMICONDUCTORS Note: Strong electric field, when exposed to a MOS device, can cause destruction of the gate oxide and ultimately degrade the device operation. Steps must be taken to stop generation of static electricity as much as possible, and quickly dissipate it once, when it has occurred. Environmental control must be adequate. When it is dry, humidifier should be used. It is recommended to avoid using insulators that easily build static electricity. Semiconductor devices must be stored and transported in an anti-static container, static shielding bag or conductive material. All test and measurement tools including work bench and floor should be grounded. The operator should be grounded using wrist strap. Semiconductor devices must not be touched with bare hands. Similar precautions need to be taken for PW boards with semiconductor devices on it. #### (2) HANDLING OF UNUSED INPUT PINS FOR CMOS Note: No connection for CMOS device inputs can be cause of malfunction. If no connection is provided to the input pins, it is possible that an internal input level may be generated due to noise, etc., hence causing malfunction. CMOS devices behave differently than Bipolar or NMOS devices. Input levels of CMOS devices must be fixed high or low by using a pull-up or pull-down circuitry. Each unused pin should be connected to VDD or GND with a resistor, if it is considered to have a possibility of being an output pin. All handling related to the unused pins must be judged device by device and related specifications governing the devices. ## ③ STATUS BEFORE INITIALIZATION OF MOS DEVICES Note: Power-on does not necessarily define initial status of MOS device. Production process of MOS does not define the initial operation status of the device. Immediately after the power source is turned ON, the devices with reset function have not yet been initialized. Hence, power-on does not guarantee out-pin levels, I/O settings or contents of registers. Device is not initialized until the reset signal is received. Reset operation must be executed immediately after power-on for devices having reset function. VR3000,VR4200,VR4400,VR4400,VR5000,VR10000,VR12000,and,VR-Series are trademarks of NEC Corporation. RISCompiler, RISC/os, R2000,R3000,R4000,and R6000 are trademarks of MIPS Computer Systems Inc. MIPS,R4200,R4300,R4400,R8000, and R10000 are trademarks of MIPS Technologies, Inc. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. The export of this product from Japan is prohibited without governmental license. To export or re-export this product from a country other than Japan may also be prohibited without a license from that country. Please call an NEC sales representative. Exporting this product or equipment that includes this product may require a governmental license from the U.S.A. for some countries because this product utilizes technologies limited by the export control regulations of the U.S.A. #### The information in this document is subject to change without notice. No part of this document may be copied or reproduced in any form or by any means without the prior written consent of NEC Corporation. NEC Corporation assumes no responsibility for any errors which may appear in this document. NEC Corporation does not assume any liability for infringement of patents, copyrights or other intellectual property rights of third parties by or arising from use of a device described herein or any other liability arising from use of such device. No license, either express, implied or otherwise, is granted under any patents, copyrights or other intellectual property rights of NEC Corporation or others. While NEC Corporation has been making continuous effort to enhance the reliability of its semiconductor devices, the possibility of defects cannot be eliminated entirely. To minimize risks of damage or injury to persons or property arising from a defect in an NEC semiconductor device, customers must incorporate sufficient safety measures in its design, such as redundancy, fire-containment, and anti-failure features. NEC devices are classified into the following three quality grades: "Standard", "Special", and "Specific". The Specific quality grade applies only to devices developed based on a customer designated "quality assurance program" for a specific application. The recommended applications of a device depend on its quality grade, as indicated below. Customers must check the quality grade of each device before using it in a particular application. - Standard: Computers, office equipment, communications equipment, test and measurement equipment, audio and visual equipment, home electronic appliances, machine tools, personal electronic equipment and industrial robots - Special: Transportation equipment (automobiles, trains, ships, etc.), traffic control systems, anti-disaster systems, anti-crime systems, safety equipment and medical equipment (not specifically designed for life support) - Specific: Aircrafts, aerospace equipment, submersible repeaters, nuclear reactor control systems, life support systems or medical equipment for life support, etc. The quality grade of NEC devices is "Standard" unless otherwise specified in NEC's Data Sheets or Data Books. If customers intend to use NEC devices for applications other than those specified for Standard quality grade, they should contact an NEC sales representative in advance. Anti-radioactive design is not implemented in this product. [MEMO] # RegionalInformation Some information contained in this document may vary from country to country. Before using any NEC product in your application, please contact the NEC office in your country to obtain a list of authorized representatives and distributors. They will verify: - · Device availability - · Ordering information - · Product release schedule - · Availability of related technical literature - Development environment specifications (for example, specifications for third-party tools and components, host computers, power plugs, AC supply voltages, and so forth) - · Network requirements In addition, trademarks, registered trademarks, export restrictions, and other legal issues may also vary from country to country. #### **NEC Electronics Inc. (U.S.)** Santa Clara, California Tel: 408-588-6000 800-366-9782 Fax: 408-588-6130 800-729-9288 #### **NEC Electronics (Germany) GmbH** Duesseldorf, Germany Tel: 0211-65 03 02 Fax: 0211-65 03 490 #### **NEC Electronics (UK) Ltd.** Milton Keynes, UK Tel: 01908-691-133 Fax: 01908-670-290 ## NEC Electronics Italiana s.r.1. Milano, Italy Tel: 02-66 75 41 Fax: 02-66 75 42 99 #### **NEC Electronics (Germany) GmbH** Benelux Office Eindhoven, The Netherlands Tel: 040-2445845 Fax: 040-2444580 #### **NEC Electronics (France) S.A.** Velizy-Villacoublay, France Tel: 01-30-67 58 00 Fax: 01-30-67 58 99 #### **NEC Electronics (France) S.A.** Spain Office Madrid, Spain Tel: 01-504-2787 Fax: 01-504-2860 #### **NEC Electronics (Germany) GmbH** Scandinavia Office Taeby, Sweden Tel: 08-63 80 820 Fax: 08-63 80 388 ## **NEC Electronics Hong Kong Ltd.** Hong Kong Tel: 2886-9318 Fax: 2886-9022/9044 #### **NEC Electronics Hong Kong Ltd.** Seoul Branch Seoul, Korea Tel: 02-528-0303 Fax: 02-528-4411 #### **NEC Electronics Singapore Pte. Ltd.** United Square, Singapore 1130 Tel: 65-253-8311 Fax: 65-250-3583 #### **NEC Electronics Taiwan Ltd.** Taipei, Taiwan Tel: 02-719-2377 Fax: 02-719-5951 #### **NEC do Brasil S.A.** Cumbica-Guarulhos-SP, Brasil Tel: 011-6465-6810 Fax: 011-6465-6829 J98. 2 ## **Main Revision in This Edition** | Page | Description | |-------|-----------------------------------------------------------------------------------------------| | p.26 | Addition of Load/store dependency is speculatively ignored (R12000) in 1.3 What is an | | | R10000 Microprocessor? | | p.29 | Addition of Increase in pre-decode buffering (R12000) in 1.3 What is an R10000 | | | Microprocessor? | | p.30 | Addition of Branch Target Address Cache (R12000) in 1.3 What is an R10000 | | | Microprocessor? | | p.31 | Addition of Additional cycles for System Interface Transactions (R12000) in 1.3 What is an | | | R10000 Microprocessor? | | p.32 | Addition of FP and Integer-Queue Issue Policy in 1.4 Instruction Queues | | p.32 | Addition of Address calculation for load/store instructions uses integer queue (R12000) in | | | 1.4 Instruction Queues | | p.63 | Addition of <r12000> Changed Spare (1,3) pins to NC (No Connection) in Table3-4 Test</r12000> | | | Interface Signals | | p.69 | Addition of DCache set locking relaxed (R12000) in 4.2 Primary Data Cache | | p.73 | Addition of <r12000> in 4.3 Secondary Cache</r12000> | | p.87 | Addition of Increased the Way Prediction Table (MRU table) to 16K single-bit entries, | | | Direct Cache Test Mode in 5.4 Secondary Cache Way Prediction Table | | p.187 | Modification of description in Table 8-1 Mode Bits | | p.262 | Modification of description in Table 13-15 Config Register Field Definitions | | p.262 | Modification of description in Figure 13-16 Config Register Format | | p.269 | Addition of description in 13.19 Diagnostic Register (22) | | p.293 | Addition of Eliminate traps for Denorm/NaN FP inputs (R12000) in 14.2 Floating-Point Unit | | | Control | | p.357 | Addition of Appendix A Glossary | | p.365 | Addition of Appendix B Differences between R10000 and R12000 | The mark ★ shows major revised points. #### **PREFACE** **Readers** This manual targets users who intends to understand the functions of the Vr10000 and Vr12000, and to design application systems using this microprocessor. Purpose This manual introduces the architecture and hardware functions of the Vr10000, Vr12000 to users, following the organization described below. **Organization** This manual consists of the following contents: - Introduction - Cache - Hardware - Coprocessor 0 - Floating point unit - Memory management system - Exception processing - · Instruction set details #### How to read this manual It is assumed that the reader of this manual has general knowledge in the fields of electric engineering, logic circuits, and microcomputers. The R3000<sup>TM</sup> in this manual represents the V<sub>R</sub>3000<sup>TM</sup>. The R4200<sup>TM</sup> in this manual represents the V<sub>R</sub>4200<sup>TM</sup>. The R4300<sup>TM</sup> in this manual represents the VR4300<sup>TM</sup>. The R4400 $^{\text{TM}}$ in this manual represents the $V_{\text{R}}4400^{\text{TM}}.$ The $R10000^{TM}$ in this manual represents the $V_R10000^{TM}$ . The $R12000^{TM}$ in this manual represents the $V_R12000^{TM}$ . To learn about detailed function of a specific instruction. → Read Chapter 14 Floating-Point Unit, Chapter 16 CPU Exceptions, or refer to Vr5000, Vr10000 User's Manual INSTRUCTION which is separately available. To learn about the overall functions of the V<sub>R</sub>10000 and V<sub>R</sub>12000 → Read this manual in sequential order. To learn about electrical specifications, → Refer to **Data Sheet** which is separately available. Unless otherwise specified, the R10000 is treated as the representative model throughout this document. **Legend** Data significance: Higher on left and lower on right Active low: XXX\* Numeric representation: binary ... XXXX or XXXX<sub>2</sub> decimal ... XXXX hexadecimal ... 0xXXXX Important information Underlined #### **Related Documents** The related documents indicated here may include preliminary version. However, preliminary versions are not marked as such. · Data sheet Vr10000, Vr12000 Data Sheet To be issued • User's Manual VR5000, VR10000 User's Manual INSTRUCTION U12754E [MEMO] ## **CONTENTS** ## 1. Introduction to the R10000 Processor | 1.1 | MIPS Instruction Set Architecture (ISA) | 22 | |-----|-----------------------------------------------------------------------------|----| | 1.2 | What is a Superscalar Processor? | 23 | | | Pipeline and Superpipeline Architecture | 23 | | | Superscalar Architecture | 23 | | 1.3 | What is an R10000 Microprocessor? | 24 | | | R10000 Superscalar Pipeline | 25 | | | Instruction Queues | 26 | | | Execution Pipelines | | | | Load/store dependency is speculatively ignored (R12000) | | | | 64-bit Integer ALU Pipeline | | | | Load/Store Pipeline | | | | 64-bit Floating-Point Pipeline | | | | Functional Units | | | | Increase in pre-decode buffering (R12000) | | | | Primary Instruction Cache (I-cache) | | | | Primary Data Cache (D-cache) | | | | Branch Target Address Cache (R12000) | | | | Instruction Decode And Rename Unit | | | | Branch Unit | | | | External Interfaces | | | | Additional cycles for System Interface transactions (R12000) | | | 1.4 | Instruction Queues | | | 1.4 | FP and Integer-Queue Issue Policy (R12000) | | | | Integer Queue | | | | Address calculation for load/store instructions uses integer queue (R12000) | | | | Floating-Point Queue | | | | Address Queue | | | 1.5 | | | | 1.3 | Program Order and Dependencies | | | | Instruction Dependencies | | | | Execution Order and Stalling | | | | Branch Prediction and Speculative Execution | | | | Resolving Operand Dependencies | | | | Resolving Exception Dependencies | | | | Strong Ordering | | | 1. | An Example of Strong Ordering | | | 1.6 | R10000 Pipelines | | | | Stage 1 | | | | Stage 2 | 39 | | | Stage 3 | | | | Stages 4-6 | | | | Floating-Point Multiplier (3-stage Pipeline) | | | | Floating-Point Divide and Square-Root Units | | | | Floating-Point Adder (3-stage Pipeline) | | | | Integer ALU1 (1-stage Pipeline) | | | | Integer ALU2 (1-stage Pipeline) | | | | Address Calculation and Translation in the TLB | | | 1.7 | Implications of R10000 Microarchitecture on Software | | | | Superscalar Instruction Issue | | | | Speculative Execution | | | | Side Effects of Speculative Execution | | | | Nonblocking Caches | 47 | | | 1.8 | Performance | 48 | |----|-------|----------------------------------------------------------------------------------------------------|----| | | | User Instruction Latency and Repeat Rate | 49 | | | | Other Performance Issues | 51 | | | | Cache Performance | 51 | | 2. | Syste | em Configurations | | | | | | | | | 2.1 | Uniprocessor Systems | | | | 2.2 | Multiprocessor Systems | | | | | Multiprocessor Systems Using Dedicated External Agents Multiprocessor Systems Using a Cluster Bus | | | 3. | Inte | rface Signal Descriptions | | | | 3.1 | Power Interface Signals. | 58 | | | 3.2 | Secondary Cache Interface Signals | | | | 3.3 | System Interface Signals | | | | 3.4 | Test Interface Signals | | | | | Unused Inputs | | | 4. | Cacl | ne Organization and Coherency | | | | 4.1 | Primary Instruction Cache | 66 | | | 4.2 | Primary Data Cache | | | | | DCache set locking relaxed (R12000) | | | | 4.3 | Secondary Cache | 71 | | | | <r12000></r12000> | 73 | | | 4.4 | Cache Algorithms | 74 | | | | Descriptions of the Cache Algorithms | 75 | | | | Uncached | 75 | | | | Cacheable Noncoherent | | | | | Cacheable Coherent Exclusive | | | | | Cacheable Coherent Exclusive on Write | 75 | | | | Uncached Accelerated | | | | 4.5 | Relationship Between Cached and Uncached Operations | | | | 4.6 | Cache Algorithms and Processor Requests | | | | 4.7 | Cache Block Ownership | 79 | | 5. | Seco | ndary Cache Interface | | | | 5.1 | Tag and Data Arrays | 82 | | | 5.2 | Secondary Cache Interface Frequencies | 83 | | | 5.3 | Secondary Cache Indexing | | | | | Indexing the Data Array | 84 | | | | Indexing the Tag Array | | | | 5.4 | Secondary Cache Way Prediction Table | | | | | Increased the Way Prediction Table (MRU table) to 16K single-bit entries | | | | | Direct Cache Test Mode | | | | 5.5 | Secondary Cache Tag | | | | | SCTag(25:4), Physical Tag | | | | | SCTag(3:2), PIdx | 89 | | | | SCTag(1:0), Cache Block State | | |----|-------|-------------------------------------------------------|-----| | | 5.6 | Read Sequences | | | | | 4-Word Read Sequence | | | | | 8-Word Read Sequence | | | | | 16 or 32-Word Read Sequence | | | | | Tag Read Sequence | | | | 5.7 | Write Sequences | 95 | | | | 4-Word Write Sequence | 96 | | | | 8-Word Write Sequence | 97 | | | | 16 or 32-Word Write Sequence | 98 | | | | Tag Write Sequence | 99 | | 6. | Syste | em Interface Operations | | | | 6.1 | Request and Response Cycles | 102 | | | 6.2 | System Interface Frequencies | 102 | | | 6.3 | Register-to-Register Operation | | | | 6.4 | System Interface Signals | | | | 6.5 | Master and Slave States | | | | 6.6 | Connecting to an External Agent | | | | 6.7 | Cluster Bus | | | | 6.8 | System Interface Connections | | | | 0.0 | Uniprocessor System | | | | | Multiprocessor System Using Dedicated External Agents | | | | | Multiprocessor System Using the Cluster Bus | | | | 6.9 | System Interface Requests and Responses | | | | 0.5 | Processor Requests | | | | | External Responses | | | | | External Requests | | | | | Processor Responses | | | | | Outstanding Requests and Request Numbers | | | | | Request and Response Relationship | | | | 6.10 | System Interface Buffers | | | | 0.10 | Cluster Request Buffer | | | | | Cached Request Buffer | | | | | Incoming Buffer | | | | | Outgoing Buffer | | | | | Uncached Buffer | | | | 6.11 | System Interface Flow Control | | | | 0.11 | Processor Write and Eliminate Request Flow Control | | | | | Processor Read and Upgrade Request Flow Control | | | | | Processor Coherency Data Response Flow Control | | | | | External Request Flow Control | | | | | External Data Response Flow Control | | | | 6.12 | System Interface Block Data Ordering | | | | 0.12 | · · · | | | | | External Block Data Responses | | | | | Processor Coherency Data Responses | | | | 6 12 | Processor Block Write Requests | | | | 6.13 | System Interface Bus Encoding | | | | | SysCmd[111] Encoding | | | | | SysCmd[11] Encoding | | | | | SysCmd[10:0] Address Cycle Encoding | | | | | SysCmd[10:0] Data Cycle Encoding | 121 | | | SysCmd[11:0] Map | 123 | |------|-------------------------------------------------------------|-----| | | SysAD[63:0] Encoding | 124 | | | SysAD[63:0] Address Cycle Encoding | 124 | | | SysAD[63:0] Data Cycle Encoding | | | | SysState[2:0] Encoding | | | | SysResp[4:0] Encoding | | | 6.14 | Interrupts | | | | Hardware Interrupts | | | | Software Interrupts | | | | Timer Interrupt | | | | Nonmaskable Interrupt | | | 6.15 | Protocol Abbreviations | | | 6.16 | System Interface Arbitration | | | 0.10 | System Interface Arbitration Rules | | | | Uniprocessor System | | | | Multiprocessor System Using Cluster Bus | | | 6.17 | System Interface Request and Response Protocol | | | 0.17 | Processor Request Protocol | | | | Processor Block Read Request Protocol | | | | Processor Double/Single/Partial-Word Read Request Protocol | | | | Processor Block Write Request Protocol | | | | Processor Double/Single/Partial-Word Write Request Protocol | | | | Processor Upgrade Request Protocol | | | | Processor Eliminate Request Protocol | | | | Processor Request Flow Control Protocol | | | | External Response Protocol | | | | External Block Data Response Protocol | | | | External Double/Single/Partial-Word Data Response Protocol | | | | External Completion Response Protocol | | | | External Request Protocol | | | | External Intervention Request Protocol | | | | External Allocate Request Number Request Protocol | | | | External Invalidate Request Protocol | | | | External Interrupt Request Protocol | | | | | | | | Processor Response Protocol | | | | Processor Coherency State Response Protocol | | | c 10 | Processor Coherency Data Response Protocol | | | 6.18 | System Interface Coherency | | | | External Intervention Shared Request | | | | External Intervention Exclusive Request | | | | External Invalidate Request | | | | External Coherency Request Action | | | | Coherency Conflicts | | | | Internal Coherency Conflicts | | | | External Coherency Conflicts | | | | External Coherency Request Latency | | | | SysGblPerf* Signal | | | 6.19 | Cluster Bus Operation | | | 6.20 | Support for I/O | | | 6.21 | Support for External Duplicate Tags | | | 6.22 | Support for a Directory-Based Coherency Protocol | | | 6.23 | Support for Uncached Attribute | | | 6.24 | Support for Hardware Emulation | 176 | # 7. Clock Signals | | 7.1 | System Interface Clock and Internal Processor Clock Domains | 178 | |-----|--------|-------------------------------------------------------------|-----| | | 7.2 | Secondary Cache Clock | 179 | | | 7.3 | Phase-Locked-Loop | 180 | | 8. | Initia | alization | | | | 8.1 | Initialization of Logical Registers | 187 | | | 8.2 | Power-On Reset Sequence | | | | 8.3 | Cold Reset Sequence | | | | 8.4 | Soft Reset Sequence | | | | 8.5 | Mode Bits | | | 9. | Erro | or Protection and Handling | | | | 9.1 | Correctable Errors | 192 | | | 9.2 | Uncorrectable Errors | | | | 9.3 | Propagation of Uncorrectable Errors | | | | 9.4 | Cache Error Exception | | | | 9.5 | CP0 CacheErr Register EW Bit | | | | 9.6 | CP0 Status Register DE Bit | 196 | | | 9.7 | CACHE Instruction | 196 | | | 9.8 | Error Protection Schemes Used by R10000 | | | | | Sparse Encoding | | | | | ECC | | | | 9.9 | Primary Instruction Cache Error Protection and Handling | 198 | | | | Error Protection | | | | | Error Handling | | | | 9.10 | Primary Data Cache Error Protection and Handling | | | | | Error Protection | | | | | Error Handling | | | | 9.11 | Secondary Cache Error Protection and Handling | | | | | Error Protection | | | | | Error Handling | | | | | Data Array | | | | 0.12 | Tag Array System Interface Error Protection and Handling | 203 | | | 9.12 | Error Protection | | | | | Error Handling | | | | | SysCmd(11:0) Bus | | | | | SysAD(63:0) Bus | | | | | SysState(2:0) Bus | | | | | SysResp(4:0) Bus | | | | | Protocol Observation | | | 10. | JTA | G Interface Operation | | | | 10.1 | Test Access Port (TAP) | 212 | | | | TAP Controller (Input) | | | | | | 213 | | 10.3 | Bypass Register | 213 | |-----------|-----------------------------------------------------------------|-----| | 10.4 | Boundary Scan Register | | | 11. Elect | trical Specifications | | | 11.1 | DC Electrical Specification | 218 | | | DC Power Supply Levels | | | | DCOk and Power Supply Sequencing | | | | Maximum Operating Conditions | | | | Input Signal Level Sensing | | | | Mode Definitions | | | | Vref[SC,Sys] | | | | DC Input/Output Specifications | | | 11.2 | AC Electrical Specification | | | | Maximum Operating Conditions | | | | Test Specification | 222 | | | Secondary Cache and System Interface Timing | 222 | | | Enable/Output Delay, Setup, Hold Time | 223 | | | Asynchronous Inputs | 223 | | 11.3 | Signal Integrity Issues | 224 | | | Reference Voltage | 224 | | | Power Supply Regulation | 224 | | | Maximum Input Voltage Levels | 224 | | | Decoupling Capacitance | 225 | | 12.1 Pack | R10000 Single-Chip Package, 599CLGA Mechanical Characteristics | | | | Electrical Characteristics | | | | Thermal Characteristics | 230 | | | Assembly Drawings and Pinout List | 230 | | | 599CLGA Pinout | 232 | | 13. Сорі | rocessor 0 | | | 13.1 | Index Register (0) | 243 | | 13.2 | Random Register (1) | | | 13.3 | EntryLo0 (2), and EntryLo1 (3) Registers | 245 | | 13.4 | Context (4) | 247 | | 13.5 | PageMask Register (5) | 248 | | 13.6 | Wired Register (6) | 249 | | 13.7 | BadVAddr Register (8) | 250 | | 13.8 | Count and Compare Registers (9 and 11) | 250 | | 13.9 | EntryHi Register (10) | 251 | | 13.10 | Status Register (12) | 252 | | | Status Register Fields | 254 | | | Diagnostic Status Field | 255 | | | Coprocessor Accessibility | 257 | | 13.11 | Cause Register (13) | | | 13.12 | Exception Program Counter (14) | | | 13.13 | Processor Revision Identifier (PRId) Register (15) | 261 | | | | | | 13.14 | Config Register (16) | 262 | |------------------|---------------------------------------------------------------|-----| | 13.15 | Load Linked Address (LLAddr) Register (17) | 263 | | 13.16 | WatchLo (18) and WatchHi (19) Registers | 264 | | 13.17 | XContext Register (20) | 265 | | 13.18 | FrameMask Register (21) | 266 | | 13.19 | Diagnostic Register (22) | 267 | | 13.20 | Performance Counter Registers (25) | 270 | | 13.21 | ECC Register (26) | 279 | | 13.22 | CacheErr Register (27) | 280 | | | CacheErr Register Format for Primary Instruction Cache Errors | 280 | | | CacheErr Register Format for Primary Data Cache Errors | 281 | | | CacheErr Register Format for Secondary Cache Errors | 282 | | | CacheErr Register Format for System Interface Errors | 283 | | 13.23 | TagLo (28) and TagHi (29) Registers | 284 | | | CacheOp is Index Load/Store Tag | 284 | | | Primary Instruction Cache Operation | 285 | | | Primary Data Cache Operation | 285 | | | Secondary Cache Operation | 287 | | | CacheOp is Index Load/Store Data | 288 | | | Primary Instruction Cache Operation | 288 | | | Primary Data Cache Operation | 289 | | | Secondary Cache Operation | 289 | | 13.24 | ErrorEPC Register (30) | 290 | | <b>14. Float</b> | Floating Point Unit Operations | 292 | | 14.2 | Floating-Point Unit Control | | | | Eliminate traps for Denorm/NaN FP inputs (R12000) | | | 14.3 | Floating-Point General Registers (FGRs) | | | | 32- and 64-Bit Operations | | | | Load and Store Operations | | | 14.4 | Floating-Point Control Registers | | | | Floating-Point Implementation and Revision Register | | | | Floating-Point Status Register (FSR) | | | | Bit Descriptions of the FSR | 300 | | | Loading the FSR | 301 | | 15. Mem | ory Management | | | 15.1 | Processor Modes | | | | Processor Operating Modes | | | | Addressing Modes | | | 15.2 | Virtual Address Space | | | | User Mode Operations | | | | 32-bit User Mode (useg) | | | | 64-bit User Mode (xuseg) | | | | Supervisor Mode Operations | | | | 32-bit Supervisor Mode, User Space (suseg) | | | | 32-bit Supervisor Mode, Supervisor Space (sseg) | | | | 64-bit Supervisor Mode, User Space (xsuseg) | | | | 64-bit Supervisor Mode, Current Supervisor Space (xsseg) | | | | 64-bit Supervisor Mode, Separate Supervisor Space (csseg) | 309 | | | Kernel Mode Operations | 310 | |--------------|-------------------------------------------------------------------------|---------------------------------| | | 32-bit Kernel Mode, User Space (kuseg) | | | | 32-bit Kernel Mode, Kernel Space 0 (kseg0) | | | | 32-bit Kernel Mode, Kernel Space 1 (kseg1) | | | | 32-bit Kernel Mode, Supervisor Space (ksseg) | | | | 32-bit Kernel Mode, Kernel Space 3 (kseg3) | | | | 64-bit Kernel Mode, User Space (xkuseg) | | | | 64-bit Kernel Mode, Current Supervisor Space (xksseg) | | | | 64-bit Kernel Mode, Physical Spaces (xkphys) | | | | 64-bit Kernel Mode, Kernel Space (xkseg) | | | | 64-bit Kernel Mode, Compatibility Spaces (ckseg1:0, cksseg, ckseg3) | | | | Address Space Access Privilege Differences Between the R4400 and R10000 | | | 15.3 | Virtual Address Translation | | | 13.3 | | | | | Virtual Pages | | | | Virtual Page Size Encodings | | | | Using the TLB | | | | Cache Algorithm Field | | | | Format of a TLB Entry | | | | Address Translation | | | | Address Space Identification (ASID) | | | | Global Processes (G) | | | | Avoiding TLB Conflict | 318 | | 16.1<br>16.2 | Causing and Returning from an Exception | | | | • | | | 16.3 | TLB Refill Vector Selection | | | | Priority of Exceptions | | | | Cold Reset Exception | | | | Soft Reset Exception | | | | NMI Exception | | | | Address Error Exception | | | | TLB Exceptions | | | | TLB Refill Exception | | | | TLB Invalid Exception | | | | TLB Modified Exception | | | | Cache Error Exception | | | | Virtual Coherency Exception | | | | Bus Error Exception | | | | Integer Overflow Exception | | | | Trap Exception | | | | System Call Exception | | | | Breakpoint Exception | | | | Reserved Instruction Exception. | | | | Coprocessor Unusable Exception | 340 | | | Floating-Point Exception | 2/1 | | | 1 10 wing 1 0 mil 2 morphism | 341 | | | Watch Exception | 342 | | | | 342 | | 16.4 | Watch Exception | 342 | | 16.4<br>16.5 | Watch Exception Interrupt Exception | 342<br>343<br>344 | | | Watch Exception | 342<br>343<br>344<br>345<br>345 | # 17. Cache Test Mode | 17.1 | Interface Signals | | |--------|--------------------------------------------------------------------------|-----| | 17.2 | System Interface Clock Divisor | 348 | | 17.3 | Entering Cache Test Mode | 349 | | 17.4 | Exit Sequence | 350 | | 17.5 | SysAD(63:0) Encoding | 351 | | 17.6 | Cache Test Mode Protocol | 352 | | | Normal Write Protocol | 352 | | | Auto-Increment Write Protocol | 353 | | | Normal Read Protocol | 354 | | | Auto-Increment Read Protocol | 355 | | Append | ix A Glossary | | | A.1 | Superscalar Processor | | | A.2 | Pipeline | | | A.3 | Pipeline Latency | | | A.4 | Pipeline Repeat Rate | | | A.5 | Out-of-Order Execution | | | A.6 | Dynamic Scheduling | | | A.7 | Instruction Fetch, Decode, Issue, Execution, Completion, and Graduation | | | A.8 | Active List | 359 | | A.9 | Free List and Busy Registers | 360 | | A.10 | Register Renaming | 360 | | A.11 | Nonblocking Loads and Stores | 361 | | A.12 | Speculative Branching | | | A.13 | Logical and Physical Registers | 363 | | A.14 | Register Files | 363 | | A.15 | ANDES Architecture | 364 | | Append | ix B Differences between R10000 and R12000 | | | B.1 | Mode bits changed in R12000 | 366 | | B.2 | DSD (Delay Speculative Dirty) | 367 | | B.3 | Config Register[22] | 368 | | B.4 | Config Register[23] | 368 | | B.5 | Changes in the Branch Diag Register | 368 | | B.6 | Eliminate traps for Denorm/NaN FP inputs | 369 | | B.7 | Increase in pre-decode buffering | 370 | | B.8 | Increased penalty for indirect branches | 370 | | B.9 | Addition of a Branch Target Address Cache | 370 | | B.10 | Use of global history in branch-prediction | 371 | | B.11 | Increase in branch prediction table size | 371 | | B.12 | Address calculation for load/store instructions uses integer queue | 371 | | B.13 | Load/store dependency is speculatively ignored | 371 | | B.14 | DCache set locking relaxed | 372 | | B.15 | SC refill blocking reduced | | | B.16 | Increased the Way Prediction Table (MRU table) to 16K single-bit entries | 372 | | B.17 | Additional cycles for System Interface transactions | 372 | | B.18 | FP and Integer-Queue Issue Policy | | | B.19 | Active List entries are increased to 48 | 373 | | | | | | B.20 | Cache Error inhibits graduation | 373 | |------|----------------------------------------------------------------|-----| | | Changed Spare(1, 3) pins to NC (No Connection) | | | | CacheOp Index Write Back Invalidate(D) also clears Primary Tag | | | | Summary of the differences | | # Appendix C Index # 1. Introduction to the R10000 Processor This user's manual describes the R10000 superscalar microprocessor for the system designer, paying special attention to the external interface and the transfer protocols. This chapter describes the following: - MIPSTM ISA - what makes a generic superscalar microprocessor - specifics of the R10000 superscalar microprocessor - implementation-specific CPU instructions ## 1.1 MIPS Instruction Set Architecture (ISA) MIPS has defined an instruction set architecture (ISA), implemented in the following sets of CPU designs: - MIPS I, implemented in the R2000<sup>TM</sup> and R3000 - MIPS II, implemented in the R6000<sup>TM</sup> - MIPS III, implemented in the R4400 - MIPS IV, implemented in the R8000™ and R10000 The original MIPS I CPU ISA has been extended forward three times, as shown in Figure 1-1; each extension is backward compatible. The ISA extensions are inclusive; each new architecture level (or version) includes the former levels.<sup>†</sup> Figure 1-1 MIPS ISA with Extensions The practical result is that a processor implementing MIPS IV is also able to run MIPS I, MIPS II, or MIPS III binary programs without change. <sup>†</sup> For more ISA information, please refer to the *MIPS IV Instruction Set Architecture*, published by MIPS Technologies, and written by Charles Price. Contact information is provided both in the *Preface*, and inside the front cover, of this manual. ## 1.2 What is a Superscalar Processor? A superscalar processor is one that can fetch, execute and complete more than one instruction in parallel. ## Pipeline and Superpipeline Architecture Previous MIPS processors had linear pipeline architectures; an example of such a linear pipeline is the R4400 superpipeline, shown in Figure 1-2. In the R4400 superpipeline architecture, an instruction is executed each cycle of the pipeline clock (PCycle), or each *pipe stage*. Figure 1-2 R4400 Pipeline ## **Superscalar Architecture** The structure of 4-way superscalar pipeline is shown in Figure 1-3. At each stage, four instructions are handled in parallel. Note that there is only one EX stage for integers. Figure 1-3 4-Way Superscalar Pipeline ## 1.3 What is an R10000 Microprocessor? The R10000 processor is a single-chip superscalar RISC microprocessor that is a followon to the MIPS RISC processor family that includes, chronologically, the R2000, R3000, R6000, R4400, and R8000. The R10000 processor uses the MIPS ANDES architecture, or *Architecture with Non-sequential Dynamic Execution Scheduling*. The R10000 processor has the following major features (terms in **bold** are defined in the Glossary): - it implements the 64-bit MIPS IV instruction set architecture (ISA) - it can decode four instructions each pipeline cycle, appending them to one of three *instruction queues* - it has five *execution pipelines* connected to separate internal integer and floating-point execution (or *functional*) units - it uses dynamic instruction scheduling and out-of-order execution - it uses speculative instruction issue (also termed "speculative branching") - it uses a precise exception model (exceptions can be traced back to the instruction that caused them) - it uses non-blocking caches - it has separate on-chip 32-Kbyte primary instruction and data caches - · it has individually-optimized secondary cache and System interface ports - it has an internal controller for the external secondary cache - it has an internal System interface controller with multiprocessor support ## Errata The R10000 processor is implemented using 0.35-micron CMOS VLSI circuitry on a single 17 mm-by-18 mm chip that contains about <u>6.7</u> million transistors, including about 4.4 million transistors in its primary caches. ## **R10000 Superscalar Pipeline** The R10000 superscalar processor fetches and decodes four instructions in parallel each cycle (or pipeline stage). Each pipeline includes stages for fetching (stage 1 in Figure 1-4), decoding (stage 2) issuing instructions (stage 3), reading register operands (stage 3), executing instructions (stages 4 through 6), and storing results (stage 7). Figure 1-4 Superscalar Pipeline Architecture in the R10000 ## **Instruction Queues** As shown in Figure 1-4, each instruction decoded in stage 2 is appended to one of three instruction *queues*: - · integer queue - · address queue - floating-point queue ## **Execution Pipelines** The three instruction queues can issue (see the Glossary for a definition of *issue*) one new instruction per cycle to each of the five execution pipelines: - the integer queue issues instructions to the two integer ALU pipelines - the address queue issues one instruction to the Load/Store Unit pipeline - the floating-point queue issues instructions to the floating-point adder and multiplier pipelines A sixth pipeline, the fetch pipeline, reads and decodes instructions from the instruction cache. #### **★** Load/store dependency is speculatively ignored (R12000) When a load follows a store in program-order, and the address of the load is known to the Address Queue (AQ) before the address of the store, then the AQ may speculatively issue the load to tag-check and data access. When the address of the store is determined, the AQ can undo the effects of the load through the use of the "soft-exception" mechanism. Since almost all loads which are actually dependent on previous stores use the same registers to form their addresses, normally either the two instructions are independent, or their addresses are resolved in program order, so the soft-exception should occur rarely. #### 64-bit Integer ALU Pipeline The 64-bit integer pipeline has the following characteristics: - it has a 16-entry integer instruction queue that dynamically issues instructions - it has a 64-bit 64-location integer physical register file, with seven read and three write ports (32 logical registers; see *register renaming* in the Glossary) - it has two 64-bit arithmetic logic units: - ALU1 contains an arithmetic-logic unit, shifter, and integer branch comparator - ALU2 contains an arithmetic-logic unit, integer multiplier, and divider ## **Load/Store Pipeline** The load/store pipeline has the following characteristics: - it has a 16-entry address queue that dynamically issues instructions, and uses the integer register file for base and index registers - it has a 16-entry address stack for use by non-blocking loads and stores - it has a 44-bit virtual address calculation unit - it has a 64-entry fully associative **Translation-Lookaside Buffer** (TLB), which converts virtual addresses to physical addresses, using a 40-bit physical address. Each entry maps two pages, with sizes ranging from 4 Kbytes to 16 Mbytes, in powers of 4. ## **64-bit Floating-Point Pipeline** The 64-bit floating-point pipeline has the following characteristics: - it has a 16-entry instruction queue, with dynamic issue - it has a 64-bit 64-location floating-point physical register file, with five read and three write ports (32 logical registers) - it has a 64-bit parallel multiply unit (3-cycle pipeline with 2-cycle latency) which also performs move instructions - it has a 64-bit add unit (3-cycle pipeline with 2-cycle latency) which handles addition, subtraction, and miscellaneous floating-point operations - it has separate 64-bit divide and square-root units which can operate concurrently (these units share their issue and completion logic with the floating-point multiplier) A block diagram of the processor and its interfaces is shown in Figure 1-5, followed by a description of its major logical blocks. Figure 1-5 Block Diagram of the R10000 Processor #### **Functional Units** The five execution pipelines allow overlapped instruction execution by issuing instructions to the following five functional units: - two integer ALUs (ALU1 and ALU2) - the Load/Store unit (address calculate) - the floating-point adder - the floating-point multiplier There are also three "iterative" units to compute more complex results: - Integer multiply and divide operations are performed by an Integer Multiply/ Divide execution unit; these instructions are issued to ALU2. ALU2 remains busy for the duration of the divide. - Floating-point divides are performed by the Divide execution unit; these instructions are issued to the floating-point multiplier. - Floating-point square root are performed by the Square-root execution unit; these instructions are issued to the floating-point multiplier. #### **★** Increase in pre-decode buffering (R12000) Up to 12 instruction may be buffered before being decoded. This should normally be invisible to the end user, but can be important when debugging systems in uncached-mode, since fetch and decode are now further de-coupled. ## **Primary Instruction Cache (I-cache)** The primary instruction cache has the following characteristics: - it contains 32 Kbytes, organized into 16-word blocks, is 2-way set associative, using a **least-recently used** (LRU) replacement algorithm - it reads four consecutive instructions per cycle, beginning on any word boundary within a cache block, but cannot fetch across a block boundary. - its instructions are predecoded, its fields are rearranged, and a 4-bit unit select code is appended - · it checks parity on each word - it permits non-blocking instruction fetch #### Primary Data Cache (D-cache) The primary data cache has the following characteristics: - it has two interleaved arrays (two 16 Kbyte ways) - it contains 32 Kbytes, organized into 8-word blocks, is 2-way set associative, using an LRU replacement algorithm. - it handles 64-bit load/store operations - it handles 128-bit refill or write-back operations - it permits non-blocking loads and stores - it checks parity on each byte ## **★** Branch Target Address Cache (R12000) This 32-entry two-way set-associative cache holds the target addresses of previously-taken branches. When a branch is executed a hit in the BTAC eliminates the one-cycle fetch bubble with the R10000 experiences for every taken branch. However, if a branch which hits in the BTAC is actually predicted not-taken, then a one cycle fetch bubble is introduced where none was present before. Performance simulations indicate that the BTAC is a net win, but because of its "mixed-blessing" nature, a mechanism has been provided to disable it via software. (See description of changes to diag register). #### **Instruction Decode And Rename Unit** The instruction decode and rename unit has the following characteristics: - it processes 4 instructions in parallel - it replaces logical register numbers with physical register numbers (register renaming) - it maps integer registers into a 33-word-by-6-bit mapping table that has 4 write and 12 read ports - it maps floating-point registers into a 32-word-by-6-bit mapping table that has 4 write and 16 read ports - it has a 32-entry active list of all instructions within the pipeline. #### **Branch Unit** The branch unit has the following characteristics: - it allows one branch per cycle - conditional branches can be executed speculatively, up to 4-deep - it has a 44-bit adder to compute branch addresses - it has a 4-quadword branch-resume buffer, used for reversing mispredicted speculatively-taken branches ## Errata - the Branch <u>Return Cache</u> contains four instructions following a subroutine call, for rapid use when returning from leaf subroutines - it has program trace RAM that stores the program counter for each instruction in the pipeline ## **External Interfaces** The external interfaces have the following characteristics: - a 64-bit System interface allows direct-connection for 2-way to 4-way multiprocessor systems. 8-bit ECC Error Check and Correction is made on address and data transfers. - a secondary cache interface with 128-bit data path and tag fields. 9-bit ECC Error Check and Correction is made on data quadwords, 7-bit ECC is made on tag words. It allows connection to an external secondary cache that can range from 512 Kbytes to 16 Mbytes, using external static RAMs. The secondary cache can be organized into either 16- or 32-word blocks, and is 2-way set associative. Bit definitions are given in Chapter 3. ## **★** Additional cycles for System Interface transactions (R12000) All transactions which go through the system interface unit (in particular, SCache refills and writebacks) have one additional CPU-clock of latency added to them. ## 1.4 Instruction Queues The processor keeps decoded instructions in three instruction queues, which dynamically issue instructions to the execution units. The queues allow the processor to fetch instructions at its maximum rate, without stalling because of instruction conflicts or dependencies. Each queue uses instruction tags to keep track of the instruction in each execution pipeline stage. These tags set a *Done* bit in the active list as each instruction is completed. ## **★** FP and Integer-Queue Issue Policy (R12000) The integer and floating-point queues are altered so that they are now composed of two 8-entry banks. Instructions are issued into the two banks in an alternating fashion. Each bank independently nominates instructions for the functional units. For each FU, the banks nominate the oldest instruction they contain which is ready to execute. If both banks nominate an instruction for a given FU, a winner is chosen by a priority bit which alternates between the two banks on each cycle. ## **Integer Queue** The integer queue issues instructions to the two integer arithmetic units: ALU1 and ALU2. The integer queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded integer instructions are written into empty entries in no particular order. Instructions remain in this queue only until they have been issued to an ALU. Branch and shift instructions can be issued only to ALU1. Integer multiply and divide instructions can be issued only to ALU2. Other integer instructions can be issued to either ALU. The integer queue controls six dedicated ports to the integer register file: two operand read ports and a destination write port for each ALU. #### **★** Address calculation for load/store instructions uses integer queue (R12000) When load, store, cacheop, or prefetch instructions are decoded, they are sent to both the AQ and IQ units. The IQ treats the address-calculate unit as a third "ALU" and issues instructions to it. When an instruction completes address calculation, the results are forwarded to the AQ. Unlike previously, if an address instruction must be retried for any reason, address calculation is not redone. If the address queue is full, but the integer queue has free entries at the time a load/store instruction is decoded, the load/store is sent only to the integer queue. When the address queue has an available entry the calculated address is forwarded to that entry and the remainder of the load/store execution continues. ## **Floating-Point Queue** The floating-point queue issues instructions to the floating-point multiplier and the floating-point adder. The floating-point queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded floating-point instructions are written into empty entries in random order. Instructions remain in this queue only until they have been issued to a floating-point execution unit. The floating-point queue controls six dedicated ports to the floating-point register file: two operand read ports and a destination port for each execution unit. The floating-point queue uses the multiplier's issue port to issue instructions to the square-root and divide units. These instructions also share the multiplier's register ports. The floating-point queue contains simple sequencing logic for multiple-pass instructions such as Multiply-Add. These instructions require one pass through the multiplier, then one pass through the adder. #### **Address Queue** The address queue issues instructions to the load/store unit. The address queue contains 16 instruction entries. Unlike the other two queues, the address queue is organized as a circular **First-In First-Out** (FIFO) buffer. A newly decoded load/store instruction is written into the next available sequential empty entry; up to four instructions may be written during each cycle. The FIFO order maintains the program's original instruction sequence so that memory address dependencies may be easily computed. Instructions remain in this queue until they have graduated; they cannot be deleted immediately after being issued, since the load/store unit may not be able to complete the operation immediately. The address queue contains more complex control logic than the other queues. An issued instruction may fail to complete because of a memory dependency, a cache miss, or a resource conflict; in these cases, the queue must continue to reissue the instruction until it is completed. The address queue has three issue ports: • First, it issues each instruction once to the address calculation unit. This unit uses a 2-stage pipeline to compute the instruction's memory address and to translate it in the TLB. Addresses are stored in the address stack and in the queue's dependency logic. This port controls two dedicated read ports to the integer register file. If the cache is available, it is accessed at the same time as the TLB. A tag check can be performed even if the data array is busy. - Second, the address queue can re-issue accesses to the data cache. The queue allocates usage of the four sections of the cache, which consist of the tag and data sections of the two cache banks. Load and store instructions begin with a tag check cycle, which checks to see if the desired address is already in cache. If it is not, a refill operation is initiated, and this instruction waits until it has completed. Load instructions also read and align a doubleword value from the data array. This access may be either concurrent to or subsequent to the tag check. If the data is present and no dependencies exist, the instruction is marked *done* in the queue. - Third, the address queue can issue store instructions to the data cache. A store instruction may not modify the data cache until it graduates. Only one store can graduate per cycle, but it may be anywhere within the four oldest instructions, if all previous instructions are already completed. The access and store ports share four register file ports (integer read and write, floating-point read and write). These shared ports are also used for Jump and Link and Jump Register instructions, and for move instructions between the integer and register files. ## 1.5 Program Order and Dependencies From a programmer's perspective, instructions appear to execute sequentially, since they are fetched and graduated in program order (the order they are presented to the processor by software). When an instruction stores a new value in its destination register, that new value is immediately available for use by subsequent instructions. Internal to the processor, however, instructions are executed dynamically, and some results may not be available for many cycles; yet the hardware must behave as if each instruction is executed sequentially. This section describes various conditions and dependencies that can arise from them in pipeline operation, including: - instruction dependencies - execution order and stalling - branch prediction and speculative execution - · resolving operand dependencies - resolving exception dependencies ## **Instruction Dependencies** Each instruction depends on all previous instructions which produced its operands, because it cannot begin execution until those operands become valid. These dependencies determine the order in which instructions can be executed. #### **Execution Order and Stalling** The actual execution order depends on the processor's organization; in a typical pipelined processor, instructions are executed only in program order. That is, the next sequential instruction may begin execution during the next cycle, if all of its operands are valid. Otherwise, the pipeline stalls until the operands do become valid. Since instructions execute in order, stalls usually delay all subsequent instructions. A clever compiler can improve performance by re-arranging instructions to reduce the frequency of these stall cycles. - In an *in-order superscalar processor*, several consecutive instructions may begin execution simultaneously, if all their operands are valid, but the processor stalls at any instruction whose operands are still busy. - In an *out-of-order superscalar processor*, such as the R10000, instructions are decoded and stored in queues. Each instruction is eligible to begin execution as soon as its operands become valid, independent of the original instruction sequence. In effect, the hardware rearranges instructions to keep its execution units busy. This process is called *dynamic issuing*. ## **Branch Prediction and Speculative Execution** Although one or more instructions may begin execution during each cycle, each instruction takes several (or many) cycles to complete. Thus, when a branch instruction is decoded, its branch condition may not yet be known. However, the R10000 processor can *predict* whether the branch is taken, and then continue decoding and executing subsequent instructions along the predicted path. #### Errata When a branch prediction is wrong, the processor must back up to the original branch and take the other path. This technique is called *speculative execution*. Whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores the <u>processor's state</u> to the state it held before the branch. <u>However, the cache state is not restored (see the section titled "Side Effects of Speculative Execution").</u> Branch prediction can be controlled by the CPO *Diagnostic* register. Branch Likely instructions are always predicted as taken, which also means the instruction in the delay slot of the Branch Likely instruction will always be speculatively executed. Since the branch predictor is neither used nor updated by branch-likely instructions, these instructions do not affect the prediction of "normal" conditional branches. ## **Resolving Operand Dependencies** Operands include registers, memory, and condition bits. Each operand type has its own dependency logic. In the R10000 processor, dependencies are resolved in the following manner: - register dependencies are resolved by using register renaming and the associative comparator circuitry in the queues - memory dependencies are resolved in the Load/Store Unit - condition bit dependencies are resolved in the active list and instruction queues ## **Resolving Exception Dependencies** In addition to operand dependencies, each instruction is implicitly dependent upon any previous instruction that generates an exception. Exceptions are caused whenever an instruction cannot be properly completed, and are usually due to either an untranslated virtual address or an erroneous operand. The processor design implements precise exceptions, by: - identifying the instruction which caused the exception - preventing the exception-causing instruction from graduating - aborting all subsequent instructions Thus, all register values remain the same as if instructions were executed singly. Effectively, all previous instructions are completed, but the faulting instruction and all subsequent instructions do not modify any values. ## **Strong Ordering** A multiprocessor system that exhibits the same behavior as a uniprocessor system in a multiprogramming environment is said to be *strongly ordered*. The R10000 processor behaves as if strong ordering is implemented, although it does not actually execute all memory operations in strict program order. In the R10000 processor, store operations remain pending until the store instruction is ready to graduate. Thus, stores are executed in program order, and memory values are precise following any exception. For improved performance however, cached load operations my occur in any order, subject to memory dependencies on pending store instructions. To maintain the appearance of strong ordering, the processor detects whenever the reordering of a cached load might alter the operation of the program, backs up, and then re-executes the affected load instructions. Specifically, whenever a primary data cache block is invalidated due to an external coherency request, its index is compared with all outstanding load instructions. If there is a match and the load has been completed, the load is prevented from graduating. When it is ready to graduate, the entire pipeline is flushed, and the processor is restored to the state it had before the load was decoded. An uncached or uncached accelerated load or store instruction is executed when the instruction is ready to graduate. This guarantees strong ordering for uncached accesses. Since the R10000 processor behaves as if it implemented strong ordering, a suitable system design allows the processor to be used to create a shared-memory multiprocessor system with strong ordering. ### **An Example of Strong Ordering** Given that locations X and Y have no particular relationship—that is, they are not in the same cache block—an example of strong ordering is as follows: - Processor A performs a store to location X and later executes a load from location Y. - Processor B performs a store to location Y and later executes a load from location X. The two processors are running asynchronously, and the order of the above two sequences is unknown. For the system to be strongly ordered, either processor A must load the new value of Y, or processor B must load the new value of X, or both processors A and B must load the new values of Y and X, respectively, under all conditions. If processors A and B both load old values of Y and X, respectively, under any conditions, the system is not strongly ordered. | New Value | | Strongly | | |-------------|-------------|----------|--| | Processor A | Processor B | Ordered | | | No | No | No | | | Yes | No | Yes | | | No | Yes | Yes | | | Yes | Yes | Yes | | # **1.6 R10000 Pipelines** This section describes the stages of the superscalar pipeline. Instructions are processed in six partially-independent pipelines, as shown in Figure 1-4. The Fetch pipeline reads instructions from the instruction cache<sup>†</sup>, decodes them, renames their registers, and places them in three instruction queues. The instruction queues contain integer, address calculate, and floating-point instructions. From these queues, instructions are dynamically issued to the five pipelined execution units. ## Stage 1 In stage 1, the processor fetches four instructions each cycle, independent of their alignment in the instruction cache — except that the processor cannot fetch across a 16-word cache block boundary. These words are then aligned in the 4-word *Instruction* register. If any instructions were left from the previous decode cycle, they are merged with new words from the instruction cache to fill the *Instruction* register. ## Stage 2 In stage 2, the four instructions in the *Instruction* register are decoded and renamed. (Renaming determines any dependencies between instructions and provides precise exception handling.) When renamed, the *logical* registers referenced in an instruction are mapped to *physical* registers. Integer and floating-point registers are renamed independently. A logical register is mapped to a new physical register whenever that logical register is the destination of an instruction. Thus, when an instruction places a new value in a logical register, that logical register is renamed (mapped) to a new physical register, while its previous value is retained in the old physical register. As each instruction is renamed, its logical register numbers are compared to determine if any dependencies exist between the four instructions decoded during this cycle. After the physical register numbers become known, the Physical Register Busy table indicates whether or not each operand is valid. The renamed instructions are loaded into integer or floating-point instruction queues. Only one branch instruction can be executed during stage 2. If the instruction register contains a second branch instruction, this branch is not decoded until the next cycle. The branch unit determines the next address for the Program Counter; if a branch is taken and then reversed, the branch resume cache provides the instructions to be decoded during the next cycle. <sup>†</sup> The processor checks only the instruction cache during an instruction fetch; it does not check the data cache. ## Stage 3 In stage 3, decoded instructions are written into the queues. Stage 3 is also the start of each of the five execution pipelines. ## Stages 4-6 In stages 4 through 6, instructions are executed in the various functional units. These units and their execution process are described below. ### Floating-Point Multiplier (3-stage Pipeline) Single- or double-precision multiply and conditional move operations are executed in this unit with a 2-cycle latency and a 1-cycle repeat rate. The multiplication is completed during the first two cycles; the third cycle is used to pack and transfer the result. ### Floating-Point Divide and Square-Root Units Single- or double-precision division and square-root operations can be executed in parallel by separate units. These units share their issue and completion logic with the floating-point multiplier. #### Floating-Point Adder (3-stage Pipeline) Single- or double-precision add, subtract, compare, or convert operations are executed with a 2-cycle latency and a 1-cycle repeat rate. Although a final result is not calculated until the third pipeline stage, internal bypass paths set a 2-cycle latency for dependent add or multiply instructions. #### **Integer ALU1 (1-stage Pipeline)** Integer add, subtract, shift, and logic operations are executed with a 1-cycle latency and a 1-cycle repeat rate. This ALU also verifies predictions made for branches that are conditional on integer register values. #### **Integer ALU2 (1-stage Pipeline)** Integer add, subtract, and logic operations are executed with a 1-cycle latency and a 1-cycle repeat rate. Integer multiply and divide operations take more than one cycle. #### Address Calculation and Translation in the TLB A single memory address can be calculated every cycle for use by either an integer or floating-point load or store instruction. Address calculation and load operations can be calculated out of program order. ## Errata The calculated address is translated from a 44-bit virtual address into a 40-bit physical address using a translation-lookaside buffer. The TLB contains 64 entries, each of which can translate two pages. Each entry can select a page size ranging from 4 Kbytes to 16 Mbytes, inclusive, in <u>powers</u> of 4, as shown in Figure 1-6. Load instructions have a 2-cycle latency if the addressed data is already within the data cache. Store instructions do not modify the data cache or memory until they graduate. # 1.7 Implications of R10000 Microarchitecture on Software The R10000 processor implements the MIPS architecture by using the following techniques to improve throughput: - · superscalar instruction issue - speculative execution - non-blocking caches These microarchitectural techniques have special implications for compilation and code scheduling. ## **Superscalar Instruction Issue** The R10000 processor has parallel functional units, allowing up to four instructions to be fetched and up to five instructions to be issued or completed each cycle. An ideal code stream would match the fetch bandwidth of the processor with a mix of independent instructions to keep the functional units as busy as possible. To create this ideal mix, every cycle the hardware would select one instruction from each of the columns below. (Floating-point divide, floating-point square root, integer multiply and integer divide cannot be started on each cycle.) The processor can look ahead in the code, so the mix should be kept close to the ideal described below. | Column A | Column B | Column C | Column D | Column E | |----------|----------|----------|----------|----------| | FPadd | FP mul | FPload | add/sub | add/sub | | | FPdiv | FPstore | shift | mul | | | FPsqrt | load | branch | div | | | | store | logical | logical | Data dependencies are detected in hardware, but limit the degree of parallelism that can be achieved. Compilers can intermix instructions from independent code streams. ## **Speculative Execution** Speculative execution increases parallelism by fetching, issuing, and completing instructions even in the presence of unresolved conditional branches and possible exceptions. Following are some suggestions for increasing program efficiency: - Compilers should reduce the number of branches as much as possible - "Jump Register" instructions should be avoided. - Aggressive use of the new integer and floating point conditional move instructions is recommended. - Branch prediction rates may be improved by organizing code so that each branch goes the same direction most of the time, since a branch that is taken 50% of the time has higher average cost than one taken 90% of the time. The MIPS IV conditional move instructions may be effective in improving performance by replacing unpredictable branches. ## Errata #### **Side Effects of Speculative Execution** To improve performance, R10000 instructions can be speculatively fetched and executed. Side-effects are harmless in cached coherent operations; however there are potential side-effects with non-coherent cached operations. These side-effects are described in the sections that follow. Speculatively fetched instructions and speculatively executed loads or stores to a cached address initiate a *Processor Block Read Request* to the external interface if it misses in the cache. The speculative operation may modify the cache state and/or data, and this modification may not be reversed even if the speculation turns out to be incorrect and the instruction is aborted. #### Speculative Processor Block Read Request to an I/O Address Accesses to I/O addresses often cause side-effects. Typically, such I/O addresses are mapped to an uncached region and uncached reads and writes are made as double/single/partial-word reads and writes (non-block reads and writes) in R10000. Uncached reads and writes are guaranteed to be non-speculative. However, if R10000 has a "garbage" value in a register, a speculative block read request to an unpredictable physical address can occur, if it speculatively fetches data due to a Load or Jump Register instruction specifying this register. Therefore, speculative block accesses to load-sensitive I/O areas can present an unwanted side-effect. #### **Unexpected Write Back Due to Speculative Store Instruction** When a Store instruction is speculated and the target address of the speculative Store instruction is missing in the cache, the cache line is refilled and the state is marked to be *Dirty*. However the refilled data may not be actually changed in the cache if this *store* instruction is later aborted. This could present a side-effect in cases such as the one described below: - The processor is storing data sequentially to memory area A, using a code-loop that includes Store and Cond.branch instructions. - A DMA write operation is performed to memory area B. - DMA area B is contiguous to the sequential storage area A. - The DMA operation is noncoherent. - The processor does not cache any lines of DMA area B. If the processor and the DMA operations are performed in sequence, the following could occur: - 1. Due to speculative execution at the exit of the code-loop, the line of data beyond the end of the memory area A that is, the starting line of memory area B is refilled to the cache. This cache line is then marked *Dirty*. - 2. The DMA operation starts writing noncoherent data into memory area B. - 3. A cache line replacement is caused by later activities of the processor, in which the cache line is written back to the top of area B. Thus, the first line of the DMA area B is overwritten by old cache data, resulting in incorrect DMA operation and data. The OS can restrict the writable pages for each user process and so can prevent a user process from interfering with an active DMA space. The kernel, on the other hand, retains *xkphys* and *kseg0* addresses in registers. There is no write protection against the speculative use of the address values in these registers. User processes which have pages mapped to physical spaces not in RAM may also have side-effects. These side-effects can be avoided if DMA is coherent. #### **Speculative Instruction Fetch** The change in a cache line's state due to a speculative instruction fetch is not reversed if the speculation is aborted. This does not cause any problems visible to the program except during a noncoherent memory operation. Then the following side-effect exists: if a noncoherent line is changed to *Clean Exclusive* and this line is also present in noncoherent space, the noncoherent data could be modified by an external component and the processor would then have stale data. #### **Workarounds for Noncoherent Cached Systems** The suggestions presented below are not exhaustive; the solutions and trade-offs are system dependent. Any one or more of the items listed below might be suitable in a particular system, and testing and simulations should be used to verify their efficacy. - 1. The external agent can reject a *processor block read request* to any I/O location in which a speculative load would cause an undesired affect. Rejection is made by returning an external *NACK completion response*. - 2. A serializing instruction such as a cache barrier or a CP0 instruction can be used to prevent speculation beyond the point where speculative stores are allowed to occur. This could be at the beginning of a basic block that includes instructions that can cause a store with an unsafe pointer. (Stores to addresses like stack-relative, global-pointer-relative and pointers to non-I/O memory might be safe.) Speculative loads can also cause a side-effect. To make sure there is no stale data in the cache as a result of undesired speculative loads, portions of the cache referred by the address of the DMA read buffers could be flushed after every DMA transfer from the I/O devices. - 3. Make references to appropriate I/O spaces uncached by changing the cache coherency attribute in the TLB. - 4. Generally, arbitrary accesses can be controlled by mapping selected addresses through the TLB. However, references to an unmapped cached *xkphys* region could have hazardous affects on I/O. A solution for this is given below: First of all, note that the *xkphys* region is hard-wired into cached and uncached regions, however the cache attributes for the *kseg0* region are programmed through the *Config* register. Therefore, clear the *KX* bit (to a zero) and set (to ones) the *SX* and *UX* bits in the *Status* register. This disables access to the *xkphys* region and restricts access to only the User and Supervisor portions of the 64-bit address space. In general, the system needs either a coherent or a noncoherent protocol — but not both. Therefore these cache attributes can be used by the external hardware to filter accesses to certain parts of the <code>kseg0</code> region. For instance, the cache attributes for the <code>kseg0</code> address space might be defined in the <code>Config</code> register to be <code>cache coherent</code> while the cache attributes in the TLB for the rest of virtual space are defined to be <code>cached-noncoherent</code> or <code>uncached</code>. The external hardware could be designed to reject all <code>cache coherent</code> mode references to the memory except to that prior-defined <code>safe</code> space in <code>kseg0</code> within which there is no possibility of an I/O DMA transfer. Then before the DMA read process and before the cache is flushed for the DMA read buffers, the cache attributes in the TLB for the I/O buffer address space are changed from <code>noncoherent</code> to <code>uncached</code>. After the DMA read, the access modes are returned to the <code>cached-noncoherent</code> mode. 5. Just before load/store instruction, use a *conditional move* instruction which tests for the reverse condition in the speculated branch, and make all aborted branch assignments *safe*. An example is given below: ``` bne r1, r0, label ---- r0, r1 # test to see if r1 != 0; if r1 != 0 then branch ra, movn # is mispredicted; move safe address (r0) # into ra ld r4, 0 (ra) # Without the previous movn, this Ild # could create damaging read. label: ``` In the above example, without the MOVN the read to the address in register *ra* could be speculatively executed and later aborted. It is possible that this load could be premature and thus damaging. The MOVN guarantees that if there is a misprediction (*r1* is not equal to 0) *ra* will be loaded with an address to which a read will not be damaging. 6. The following is similar to the conditional-move example given above, in that it protects speculation only for a single branch, but in some instances it may be more efficient than either the conditional move or the cache barrier workarounds. This workaround uses the fact that branch-likely instructions are always predicted as taken by the R10000. Thus, any incorrect speculation by the R10000 on a branch-likely always occurs on a taken path. Sample code is: ``` beql rx, r1, label nop sw r2, 0x0(r1) label: ---- ``` The store to r1 will never be to an address referred to by the content of rx, because the store will never be executed speculatively. Thus, the address referred to by the content of rx is protected from any spurious write-backs. This workaround is most useful when the branch is often taken, or when there are few instructions in the protected block that are not memory operations. Note that no instructions in a block following a branch-likely will be initiated by speculation on that branch; however, in the case of a *serial instruction* workaround, only memory operations are prevented from speculative initiation. In the case of the *conditional-move* workaround, speculative initiation of all instructions continues unimpeded. Also, similar to the *conditional-move* workaround, this workaround only protects fall-through blocks from speculation on the immediately preceding branch. Other mechanisms must be used to ensure that no other branches speculate into the protected block. However, if a block that *dominates* the fall-through block can be shown to be protected, this may be sufficient. Thus, if block (a) dominates block (b), and block (b) is the fall-through block shown above, and block (a) is the immediately previous block in the program (i.e., only the single conditional branch that is being replaced intervenes between (a) and (b)), then ensuring that (a) is protected by *serial instruction* means a branch-likely can safely be used as protection for (b). # **Nonblocking Caches** As processor speed increases, the processor's data latency and bandwidth requirements rise more rapidly than the latency and bandwidth of cost-effective main memory systems. The memory hierarchy of the R10000 processor tries to minimize this effect by using large set-associative caches and higher bandwidth cache refills to reduce the cost of loads, stores, and instruction fetches. Unlike the R4400, the R10000 processor does not stall on data cache misses, instead defers execution of any dependent instructions until the data has been returned and continues to execute independent instructions (including other memory operations that may miss in the cache). Although the R10000 allows a number of outstanding primary and secondary cache misses, compilers should organize code and data to reduce cache misses. When cache misses are inevitable, the data reference should be scheduled as early as possible so that the data can be fetched in parallel with other unrelated operations. As a further antidote to cache miss stalls, the R10000 processor supports prefetch instructions, which serve as hints to the processor to move data from memory into the secondary and primary caches when possible. Because prefetches do not cause dependency stalls or memory management exceptions, they can be scheduled as soon as the data address can be computed, without affecting exception semantics. Indiscriminate use of prefetch instructions can slow program execution because of the instruction-issue overhead, but selective use of prefetches based on compiler miss prediction can yield significant performance improvement for dense matrix computations. <sup>†</sup> In compiler parlance, block (a) *dominates* block (b) if and only if every time block (b) is executed, block (a) is executed first. Note that block (a) does not have to immediately precede block (b) in execution order; some other block may intervene. # 1.8 Performance As it executes programs, the R10000 superscalar processor performs many operations in parallel. Instructions can also be executed out of order. Together, these two facts greatly improve performance, but they also make it difficult to predict the time required to execute any section of a program, since it often depends on the instruction mix and the critical dependencies between instructions. The processor has five largely independent execution units, each of which are individualized for a specific class of instructions. Any one of these units may limit processor performance, even as the other units sit idle. If this occurs, instructions which use the idle units can be added to the program without adding any appreciable delay. # **User Instruction Latency and Repeat Rate** Table 1-1 shows the latencies and repeat rates for all user instructions executed in ALU1, ALU2, Load/Store, Floating-Point Add and Floating-Point Multiply functional units (definitions of *latency* and *repeat rate* are given in the Glossary). Kernel instructions are not included, nor are control instructions not issued to these execution units. Table 1-1 Latencies and Repeat Rates for User Instructions | Instruction Type | <b>Execution Unit</b> | Latency | Repeat<br>Rate | Comment | | |---------------------------------------------------|-----------------------|-------------|----------------|-----------------------------------------------------------------------------------------------|--| | Integer Instructions | | | | | | | Add/Sub/Logical/Set | ALU 1/2 | 1 | 1 | | | | MF/MT HI/LO | ALU 1/2 | 1 | 1 | | | | Shift/LUI | ALU 1 | 1 | 1 | | | | Cond. Branch Evaluation | ALU 1 | 1 | 1 | | | | Cond. Move | ALU 1 | 1 | 1 | | | | MULT | ALU 2 | 5/6 | 6 | Latency relative to Lo/Hi | | | MULTU | ALU 2 | 6/7 | 7 | Latency relative to Lo/Hi | | | DMULT | ALU 2 | 9/10 | 10 | Latency relative to Lo/Hi | | | DMULTU | ALU 2 | 10/11 | 11 | Latency relative to Lo/Hi | | | DIV/DIVU | ALU 2 | 34/35 | 35 | Latency relative to Lo/Hi | | | DDIV/DDIVU | ALU 2 | 66/67 | 67 | Latency relative to Lo/Hi | | | Load (not include loads to CP1) | Load/Store | 2 | 1 | Assuming cache hit | | | Store | Load/Store | - | 1 | Assuming cache hit | | | | Floati | ng-Point In | structions | | | | MTC1/DMTC1 | ALU 1 | 3 | 1 | | | | Add/Sub/Abs/Neg/Round/Trunc/<br>Ceil/Floor/C.cond | FADD | 2 | 1 | | | | CVT.S.W/CVT.S.L | FADD | 4 | 2 | Repeat rate is on average | | | CVT (others) | FADD | 2 | 1 | - | | | Mul | FMPY | 2 | 1 | | | | MFC1/DMFC1 | FMPY | 2 | 1 | | | | Cond. Move/Move | FMPY | 2 | 1 | | | | DIV.S/RECIP.S | FMPY | 12 | 14 | | | | DIV.D/RECIP.D | FMPY | 19 | 21 | | | | SQRT.S | FMPY | 18 | 20 | | | | SQRT.D | FMPY | 33 | 35 | | | | RSQRT.S | FMPY | 30 | 20 | | | | RSQRT.D | FMPY | 52 | 35 | | | | MADD | FADD+FMPY | 2/4 | 1 | Latency is 2 only if the result is used as the operand specified by <i>fr</i> of another MADD | | | LWC1/LDC1/LWXC1/LDXC1 | LoadStore | 3 | 1 | Assuming cache hit | | Please note the following about Table 1-1: - For integer instructions, conditional trap evaluation takes a single cycle, like conditional branches. - Branches and conditional moves are not conditionally issued. - The repeat rate above for Load/Store does not include Load Link and Store Conditional. - Prefetch instruction is not included here. - The latency for multiplication and division depends upon the next instruction. - An instruction using register *Lo* can be issued one cycle earlier than one using *Hi*. - For floating-point instructions, CP1 branches are evaluated in the Graduation Unit. - CTC1 and CFC1 are not included in this table. - The repeat pattern for the CVT.S.(W/L) is "I I x x I I x x ..."; the repeat rate given here, 2, is the average. - The latency for MADD instructions is 2 cycles if the result is used as the operand specified by *fr* of the second MADD instruction. - Load Linked and Store Conditional instructions (LL, LLD, SC, and SCD) do not implicitly perform SYNC operations in the R10000. Any of the following events that occur between a Load Linked and a Store Conditional will cause the Store Conditional to fail: an exception; execution of an ERET, a load, a store, a SYNC, a CacheOp, a prefetch, or an external intervention/invalidation on the block containing the linked address. Instruction cache misses do not cause the Store Conditional to fail. - Up to four branches can be evaluated at one cycle. For more information about implementations of the LL, SC, and SYNC instructions, please see the section titled, R10000-Specific CPU Instructions, in this chapter. beq r2,r3,L1 A comparison of r2 and r3 is made to determine whether the branch is taken or not. If the branch prediction is correct, the branch instruction is graduated. Otherwise, the processor must back out of the instruction stream decoded after this branch, and inform the IFetch to fetch the correct instructions. The evaluation is made in the ALU for integer branches and in the Graduation Unit for floating-point branches. A single integer branch can be evaluated during any cycle, but there may be up to 4 condition codes waiting to be evaluated for floating-point branches. Once the condition code is evaluated, all dependant FP branches can be evaluated during the same cycle. <sup>†</sup> Only one branch can be decoded at any particular cycle. Since each conditional branch is predicted, the real direction of each branch must be "evaluated." For example, #### **Other Performance Issues** Table 1-1 shows execution times within the functional units only. Performance may also be affected by instruction fetch times, and especially by the execution of conditional branches. In an effort to keep the execution units busy, the processor predicts branches and speculatively executes instructions along the predicted path. When the branch is predicted correctly, this significantly improves performance: for typical programs, branch prediction is 85% to 90% correct. When a branch is mispredicted, the processor must discard instructions which were speculatively fetched and executed. Usually, this effort uses resources which otherwise would have been idle, however in some cases speculative instructions can delay previous instructions. #### **Cache Performance** The execution of load and store instructions can greatly affect performance. These instructions are executed quickly if the required memory block is contained in the primary data cache, otherwise there are significant delays for accessing the secondary cache or main memory. Out-of-order execution and non-blocking caches reduce the performance loss due to these delays, however. The latency and repeat rates for accessing the secondary cache are summarized in Table 1-2. These rates depend on the ratio of the secondary cache's clock to the processor's internal pipeline clock. The best performance is achieved when the clock rates are equal; slower external clocks add to latency and repeat times. The primary data cache contains 8-word blocks, which are refilled using 2-cycle transfers from the quadword-wide secondary cache. Latency runs to the time in which the processor can use the addressed data. The primary instruction cache contains 16-word blocks, which are refilled using 4-cycle transfers. | SCClkDiv<br>Mode | Latency <sup>‡</sup><br>(PClk Cycles) | Repeat<br>Rate*<br>(PClk Cycles) | |------------------|---------------------------------------|-----------------------------------------| | 1 | 6 | 2 (data cache) 4 (instruction cache) | | 1.5 | 8-10 <sup>†</sup> | 3 (data cache)<br>6 (instruction cache) | | 2 | 9-12 <sup>†</sup> | 4 (data cache)<br>8 (instruction cache) | Table 1-2 Latency and Repeat Rates for Secondary Cache Reads <sup>‡</sup> Assumes the cache way was correctly predicted, and there are no conflicting requests. <sup>\*</sup> Repeat rate = **PClk** cycles needed to transfer 2 quadwords (data cache) or 4 quadwords (instruction cache). Rate is valid for bursts of 2 to 3 cache misses; if more than three cache misses in a row, there can be a 1-cycle "bubble." <sup>†</sup> Clock synchronization causes variability. The processor mitigates access delays to the secondary cache in the following ways: - The processor can execute up to 16 load and store instructions speculatively and out-of-order, using non-blocking primary and secondary caches. That is, it looks ahead in its instruction stream to find load and store instructions which can be executed early; if the addressed data blocks are not in the primary cache, the processor initiates cache refills as soon as possible. - If a speculatively executed load initiates a cache refill, the refill is completed even if the load instruction is aborted. It is likely the data will be referenced again. - The data cache is interleaved between two banks, each of which contains independent tag and data arrays. These four sections can be allocated separately to achieve high utilization. Five separate circuits compete for cache bandwidth (address calculate, tag check, load unit, store unit, external interface.) - The external interface gives priority to its refill and interrogate operations. The processor can execute tag checks, data reads for load instructions, or data writes for store instructions. When the primary cache is refilled, any required data can be streamed directly to waiting load instructions. - The external interface can handle up to four non-blocking memory accesses to secondary cache and main memory. Main memory typically has much longer latencies and lower bandwidth than the secondary cache, which make it difficult for the processor to mitigate their effect. Since main memory accesses are non-blocking, delays can be reduced by overlapping the latency of several operations. However, although the first part of the latency may be concealed, the processor cannot look far enough ahead to hide the entire latency. Programmers may use pre-fetch instructions to load data into the caches before it is needed, greatly reducing main memory delays for programs which access memory in a predictable sequence. | 2. | System Configurations | | |----|-----------------------|---------------------------------------------------------------------------------------------------------------------| | | | | | | | | | | | | | | The P10000 masses | | | | chapter describes som | r provides the capability for a wide range of computer systems; this e of the uni- and multiprocessor alternatives. | | | | | | | | | # 2.1 Uniprocessor Systems In a typical uniprocessor system, the System interface of the R10000 processor connects in a point-to-point fashion with an external agent. Such a system is shown in Figure 2-1. The external agent is typically an ASIC that provides a gateway to the memory and I/O subsystems; in fact, this ASIC may incorporate the memory controller itself. If hardware I/O coherency is desired, the external agent may use the multiprocessor primitives provided by the processor to maintain cache coherency for interventions and invalidations. External duplicate tags can be used by the external agent to filter external coherency requests. Figure 2-1 Uniprocessor System Organization # 2.2 Multiprocessor Systems Two types of multiprocessor systems can be implemented with R10000 processor: - a dedicated external agent interfaces with each R10000 processor - up to four R10000 processors and an external agent reside on a cluster bus ## **Multiprocessor Systems Using Dedicated External Agents** A multiprocessor system may be created with R10000 processors by providing a dedicated external agent for each processor; such a system is shown in Figure 2-2. The external agent provides a path between the processor System interface and some type of coherent interconnect. In such a system, the processor provides support for three coherency schemes: - · snoopy-based - snoopy-based with external duplicate tags and control - directory-based with external directory structure and control Figure 2-2 Multiprocessor System Organization using Dedicated External Agents ## **Multiprocessor Systems Using a Cluster Bus** A multiprocessor system may be created with R10000 processors by using a cluster bus configuration. Such a system is shown in Figure 2-3. A cluster bus is created by attaching the System interfaces of up to four R10000 processors with an external agent (the *cluster coordinator*). The cluster coordinator is responsible for managing the flow of data within the cluster. This organization can reduce the number of ASICs and the pin count needed for a small multiprocessor systems. The cluster bus protocol supports three coherency schemes: - · snoopy-based - snoopy-based with external duplicate tags and control - directory-based with external directory structure and control Figure 2-3 Multiprocessor System Organization Using the Cluster Bus # 3. Interface Signal Descriptions This chapter gives a list and description of the interface signals. The R10000 interface signals may be divided into the following groups: - Power interface - Secondary Cache interface - System interface - · Test interface The following sections present a summary of the external interface signals for each of these groups. An asterisk (\*) indicates signals that are asserted as a logical 0. # 3.1 Power Interface Signals Table 3-1 presents the R10000 processor power interface signals. Table 3-1 Power Interface Signals | Signal Name | Description | Туре | |-------------|------------------------------------------------------------------------------------------------------------------------------------------|-------| | Vcc | Vcc core Vcc for the core circuits. | Input | | VccQSC | Vcc output driver secondary cache Vcc for the secondary cache interface output drivers. | Input | | VccQSys | Vcc output driver system Vcc for the System interface output drivers. | Input | | VrefSC | Voltage reference secondary cache Voltage reference for the secondary cache interface input receivers. | Input | | VrefSys | Voltage reference system Voltage reference for the System interface input receivers. | Input | | VrefByp | Voltage reference bypass This pin must be tied to <b>Vss</b> (preferably) or <b>VrefSys</b> , through at least a 100 ohm resistor. | Input | | Vss | Vss Vss for the core circuits and output drivers. | Input | | VccPa | Vcc PLL analog Vcc for the PLL analog circuits. | Input | | VssPa | Vss PLL analog Vss for the PLL analog circuits. | Input | | VccPd | Vcc PLL digital Vcc for the PLL digital circuits. | Input | | VssPd | Vss PLL digital Vss for the PLL digital circuits. | Input | | DCOk | DC voltages are OK The external agent asserts these two signals when Vcc, VccQ[SC,Sys], Vref[SC,Sys], Vcc[Pa,Pd], and SysClk are stable. | Input | # Errata **VrefByp** description changed in Table 3-1. # 3.2 Secondary Cache Interface Signals # Errata *Table 3-2; description of SCBAddr(18:0) is revised.* Table 3-2 presents the R10000 processor secondary cache interface signals. Table 3-2 Secondary Cache Interface Signals | Signal Name | Description | Type | |--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | | SSRAM <sup>‡</sup> Clock Signals | | | SCClk(5:0)<br>SCClk*(5:0) | Secondary cache clock Duplicated complementary secondary cache clock outputs. | Output | | | SSRAM Address Signals | | | SCAAddr(18:0)<br>SCBAddr(18:0) | Secondary cache address bus SCBAddr is complementary SCAAddr 19-bit bus, which specifies the set address of the secondary cache data and tag SSRAM that is to be accessed. | Output | | SCTagLSBAddr | Secondary cache tag LSB address Signal that specifies the least significant bit of the address for the secondary cache tag SSRAM. | Output | | | SSRAM Data Signals | | | SCADWay<br>SCBDWay | Secondary cache data way Duplicated signal that indicates the way of the secondary cache data SSRAM that is to be accessed. | Output | | SCData(127:0) | Secondary cache data bus 128-bit bus to read/write cache data from/to secondary cache data SSRAM. | Bidirectional | | SCDataChk(9:0) | Secondary cache data check bus A 10-bit bus used to read/write ECC and even parity from/to the secondary cache data SSRAM. | Bidirectional | | SCADOE*<br>SCBDOE* | Secondary cache data output enable Duplicated signal that enables the outputs of the secondary cache data SSRAM. | Output | | SCADWr*<br>SCBDWr* | Secondary cache data write enable Duplicated signal that enables writing the secondary cache data SSRAM. | Output | | SCADCS*<br>SCBDCS* | Secondary cache data chip select Duplicated signal that enables the secondary cache data SSRAM. | Output | <sup>‡</sup> All cache static RAM (SRAM) are synchronous SRAM (SSRAM). Table 3-2 (cont.) Secondary Cache Interface Signals | Signal Name | Description | Type | | |---------------|---------------------------------------------------------------------------------------------------------|---------------|--| | | SSRAM Tag Signals | | | | SCTWay | Secondary cache tag way Signal indicating the way of the secondary cache tag SSRAM to be accessed. | Output | | | SCTag(25:0) | Secondary cache tag bus A 26-bit bus to read/write cache tags from/to the secondary cache tag SSRAM. | Bidirectional | | | SCTagChk(6:0) | Secondary cache tag check bus A 7-bit bus used to read/write ECC from/to the secondary cache tag SSRAM. | Bidirectional | | | SCTOE* | Secondary cache tag output enable A signal that enables the outputs of the secondary cache tag SSRAM. | Output | | | SCTWr* | Secondary cache tag write enable A signal that enables writing the secondary cache tag SSRAM. | Output | | | SCTCS* | Secondary cache tag chip select A signal which enables the secondary cache tag SSRAM. | Output | | # 3.3 System Interface Signals Table 3-3 presents the R10000 processor System interface signals. Table 3-3 System Interface Signals | Signal Name | Description | Туре | | | |-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|--|--| | System Clock Signals | | | | | | SysClk<br>SysClk* | System clock Complementary system clock input. | Input | | | | SysClkRet<br>SysClkRet* | System clock return Complementary system clock return output used for termination of the system clock. | Output | | | | | System Arbitration Signals | | | | | SysReq* | System request The processor asserts this signal when it wants to perform a processor request and it is not already master of the System interface. | Output | | | | SysGnt* | System grant The external agent asserts this signal to grant mastership of the System interface to the processor. | Input | | | | SysRel* | System release The master of the System interface asserts this signal for one <b>SysClk</b> cycle to indicate that it will relinquish mastership of the System interface in the following <b>SysClk</b> cycle. | Bidirectional | | | | | System Flow Control Signals | | | | | SysRdRdy* | System read ready The external agent asserts this signal to indicate that it can accept processor read and upgrade requests. | Input | | | | SysWrRdy* | System write ready The external agent asserts this signal to indicate that it can accept processor write and eliminate requests. | Input | | | | | System Address/Data Bus Signals | | | | | SysCmd(11:0) | System command A 12-bit bus for transferring commands between processor and the external agent. | Bidirectional | | | | SysCmdPar | System command bus parity Odd parity for the system command bus. | Bidirectional | | | | SysAD(63:0) | System address/data bus A 64-bit bus for transferring addresses and data between R10000 and the external agent. | Bidirectional | | | Table 3-3 (cont.) System Interface Signals | Signal Name | Description | Туре | | | | |---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|--|--|--| | | System State Bus Signals | | | | | | SysADChk(7:0) | System address/data check bus An 8-bit ECC bus for the system address/data bus. | Bidirectional | | | | | SysVal* | System valid The master of the System interface asserts this signal when it is driving valid information on the system command and system address/data buses. | Bidirectional | | | | | SysState(2:0) | System state bus A 3-bit bus used for issuing processor coherency state responses and also additional status indications. | Output | | | | | SysStatePar | System state bus parity Odd parity for the system state bus. | Output | | | | | SysStateVal* | System state bus valid The processor asserts this signal for one <b>SysClk</b> cycle when issuing a processor coherency state response on the system state bus. | Output | | | | | | System Response Bus Signals | | | | | | SysResp(4:0) | System response bus A 5-bit bus used by the external agent for issuing external completion responses. | Input | | | | | SysRespPar | System response bus parity Odd parity for the system response bus. | Input | | | | | SysRespVal* | System response bus valid The external agent asserts this signal for one <b>SysClk</b> cycle when issuing an external completion response on the system response bus. | Input | | | | | | System Miscellaneous Signals | | | | | | SysReset* | System reset The external agent asserts this signal to reset the processor. | Input | | | | | SysNMI* | System non-maskable interrupt The external agent asserts this signal to indicate a non-maskable interrupt. | Input | | | | | SysCorErr* | System correctable error The processor asserts this signal for one <b>SysClk</b> cycle when a correctable error is detected and corrected. | Output | | | | | SysUncErr* | System uncorrectable error The processor asserts this signal for one <b>SysClk</b> cycle when an uncorrectable tag error is detected. | Output | | | | | SysGblPerf* | System globally performed The external agent asserts this signal to indicate that all processor requests have been globally performed with respect to all external agents. | Input | | | | | SysCyc* | System cycle The external agent may use this signal to define a virtual System interface clock in a hardware emulation environment. | Input | | | | # **3.4 Test Interface Signals** Table 3-4 presents the R10000 processor test interface signals. # Errata PLLDis and SelDVCO signal descriptions are revised in Table 3-4. Table 3-4 Test Interface SignalsPLLDis | Signal Name | Description | Туре | | | |---------------|---------------------------------------------------------------------------------------------------------------------------------|--------|--|--| | JTAG Signals | | | | | | JTDI | JTAG serial data input<br>Serial data input. | Input | | | | JTDO | JTAG serial data output<br>Serial data output. | Output | | | | JTCK | JTAG clock<br>Clock input. | Input | | | | JTMS | JTAG mode select<br>Mode select input. | Input | | | | | Miscellaneous Test Signals | | | | | TCA | Testability control A (for manufacturing test only) This signal must be tied to <b>Vss</b> , through a 100 ohm resistor. | Input | | | | ТСВ | Testability control B (for manufacturing test only) This signal must be tied to <b>Vss</b> , through a 100 ohm resistor. | Input | | | | PLLDis | PLL disable (for manufacturing test only) This signal must be tied to Vss through a 100 ohm resistor. | Input | | | | PLLRC | PLL Control Node (for manufacturing test only) There must be no connection made to this signal. | | | | | PLLSpare(1:4) | These four pins must be tied to <b>Vss</b> . | | | | | Spare(1,3) | These two pins must be tied to <b>Vss</b> , through a 100 ohm resistor. | | | | | 3-State | 3-state Control The system asserts this signal to 3-state all outputs and input/output pads except for SCClk, SCCLK*, and JTDO. | Input | | | | SelDVCO | Select differential VCO (for manufacturing test only) This signal must be tied to Vcc. | Input | | | #### \* ### <R12000> Changed Spare (1, 3) pins to NC (No Connection) The spare (1, 3), shown in this manual, page 63 tied to Vss through a 100 ohm resister, is used in R12000 for diagnostic purpose and thus for R12000 should not be connected to anything. ## **Unused Inputs** Several input pins are unused during normal system operation, and should be tied to **V**cc through resistors: - JTDI - JTCK - JTMS Several input pins are unused during normal system operation, and should be tied to **Vss** through 100 ohm resistors: - TCA, TCB - PLLDis - Spare1, Spare3 Several input pins are unused during normal system operation, and should be tied to Vss: - PLLSpare1, PLLSpare2, PLLSpare3, PLLSpare4 - SelDVCO #### Errata The following input pins may be unused in certain system configurations, and each of them should be tied to **VccQSys**, preferably, through a resistor of 100 ohms or greater value: SysNMI\* The following input pins may be unused in certain system configurations, and each of them should be tied to **Vss**, preferably, through a resistor of 100 ohms or greater value: - SysRdRdy\* - SysWrRdy\* - SysGblPerf\* - SysCyc\* The following input pins may be unused in certain system configurations, and each of them should be tied (preferably) to **Vss**, or **VccQSys**, through a resistor of 100 ohms or greater value: • SysADChk(7:0) # 4. Cache Organization and Coherency The processor implements a two-level cache structure consisting of separate primary instruction and data caches and a joint secondary cache. Each cache is two-way set associative and uses a **write back protocol**; that is, two cache blocks are assigned to each set (as shown in Figure 4-1), and a cache store writes data into the cache instead of writing it directly to memory. Some time later this data is independently written to memory. A write-invalidate cache coherency protocol (described later in this chapter) is supported through a set of cache states and external coherency requests. # 4.1 Primary Instruction Cache The processor has an on-chip 32-Kbyte primary instruction cache (also referred to simply as the *instruction cache*), which is a subset of the secondary cache. Organization of the instruction cache is shown in Figure 4-1. The instruction cache has a fixed block size of 16 words and is two-way set associative with a **least-recently-used** (LRU) replacement algorithm.<sup>†</sup> The instruction cache is indexed with a virtual address and tagged with a physical address. Figure 4-1 Organization of Primary Instruction Cache Each instruction cache block is in one of the following two states: - Invalid - Valid <sup>†</sup> The precise implementation of the LRU algorithm is affected by the speculative execution of instructions. An instruction cache block can be changed from one state to the other as a result of any one of the following events: - a primary instruction cache read miss - subset property enforcement - any of various CACHE instructions - external intervention exclusive and invalidate requests These events are illustrated in Figure 4-2, which shows the primary instruction cache state diagram. Figure 4-2 Primary Instruction Cache State Diagram ## 4.2 Primary Data Cache The processor has an on-chip 32-Kbyte primary data cache (also referred to simply as the *data cache*), which is a subset of the secondary cache. The data cache uses a fixed block size of 8 words and is two-way set associative (that is, two cache blocks are assigned to each set, as shown in Figure 4-3) with an LRU replacement algorithm.<sup>†</sup> Figure 4-3 Organization of Primary Data Cache The data cache uses a write back protocol, which means a cache store writes data into the cache instead of writing it directly to memory. Sometime later this data is independently written to memory, as shown in Figure 4-4. Figure 4-4 Write Back Protocol Write back from the primary data cache goes to the secondary cache, and write back from the secondary cache goes to main memory, through the system interface. The primary data cache is written back to the secondary cache before the secondary cache is written back to the system interface. <sup>†</sup> The precise implementation of the LRU algorithm is affected by the speculative execution of instructions. The data cache is indexed with a virtual address and tagged with a physical address. Each primary cache block is in one of the following four states: - Invalid - CleanExclusive - DirtyExclusive - Shared A primary data cache block is said to be *Inconsistent* when the data in the primary cache has been modified from the corresponding data in the secondary cache. The primary data cache is maintained as a subset of the secondary cache where the state of a block in the primary data cache always matches the state of the corresponding block in the secondary cache A data cache block can be changed from one state to another as a result of any one of the following events: - · primary data cache read/write miss - primary data cache write hit - subset enforcement - a CACHE instruction - external intervention shared request - intervention exclusive request - invalidate request These events are illustrated in Figure 4-5, which shows the primary data cache state diagram. #### **★** DCache set locking relaxed (R12000) In R10000, when an AQ entry accesses a DCache line, that line is locked into the cache until the entry graduates, so that the entry will not be removed from the cache until the access completes. If another entry which needs to access exactly the same line arrives in the AQ before the first completes, the two may share the lock. In this way, a line is locked in the cache until all access to it complete. In order to prevent a deadlock from arising, whenever a cache line is locked in this way, only the oldest AQ entry can obtain a lock on the other "way" of the same cache set, thus ensuring that forward progress can be made. This algorithm can cause problems, because often the oldest entry in the AQ is the one which already owns the lock on the first way - thus ensuring that no other entries can access the second way of the cache for that set index. For some algorithms, most notably FFT's, this can cause severe performance degradation. R12000 allows an entry to obtain the lock on the second way of a set if it is the oldest entry which does not already own a lock. Thus, any entries which have already acquired a lock, including those locking the first way, will not prevent another, younger, entry from accessing that second way. Figure 4-5 Primary Data Cache State Diagram # 4.3 Secondary Cache The R10000 processor must have an external secondary cache, ranging in size from 512 Kbytes to 16 Mbytes, in powers of 2, as set by the **SCSize** mode bit. The **SCBlkSize** mode bit selects a block size of either 16 or 32 words. The secondary cache is two-way set associative (that is, two cache blocks are assigned to each set, as shown in Figure 4-6) with an LRU replacement algorithm.<sup>†</sup> The secondary cache uses a write back protocol, which means a cache store writes data into the cache instead of writing it directly to memory. Some time later this data is independently written to memory. The secondary cache is indexed with a physical address and tagged with a physical address. Figure 4-6 Organization of Secondary Cache Each secondary cache block is in one of the following four states: - Invalid - CleanExclusive - DirtyExclusive - Shared <sup>†</sup> The precise implementation of the LRU algorithm is affected by the speculative execution of instructions. A secondary cache block can be changed from one state to another as a result of any of the following events: - primary cache read/write miss - primary cache write hit to a Shared or CleanExclusive block - secondary cache read miss - secondary cache write hit to a Shared or CleanExclusive block - a CACHE instruction - external intervention shared request - intervention exclusive request - invalidate request These events are illustrated in Figure 4-7, which shows the secondary cache state diagram. Figure 4-7 Secondary Cache State Diagram #### ★ <R12000> #### Pad-ring clock slowed The clock used to drive data to/from SC around the pad-ring has been slowed to a 2:3 clock divisor, thus sometimes adding an additional cycle of latency to secondary-cache accesses. #### SC refill blocking reduced In R10000, during the time that an SCache line is being refilled from system interface via the "incoming buffer (IB)", no other accesses to the SCache are allowed. If the external interface sees an ACK to a line that is being refilled before the last words of the SCache line are received by R10000, this means that several cycles can elapse during which SCache access is blocked. By breaking the SCache refill transaction into 64-byte blocks, and allowing other requests to proceed during breaks between the blocks, this effect could be reduced. R12000 pulls in SCache lines with two "pause points." This first occurs when R12000 receives the ACK for a request. If the first two quad-words are already valid in the Incoming Buffer at that time, then R12000 will proceed to refill the SCache with those two, and forward the results to the DCache or ICache at the same time as normal. The next two quad-words will be refilled as they return, thus continuing to block any other access to the SCache just as today. If however, when the initial ACK is received, the first two are not valid (i.e., either 0 or 1 quad-words are valid at that time) then R12000 will "pause" the SCache refill and wait for both of them to be brought in to the IB. Once the first half is filled in to the SCache, R12000 will again check the IB to see if an additional 3 quad-words are valid (thus 7 out of the 8 quad-words in the SCache line should have arrived into the IB). Until that is the case, R12000 will again "pause" the SCache refill and allow other accesses to reach the SCache. These two pauses allow for other requests to slip in during an SCache refill. Using only two pauses both simplifies the logic and reduces bus turnarounds. #### DCache writebacks never piggyback In R10000 when a DCache line is written back to SCache, the following line in the DCache might be written back in a "piggybacked" manner. In order for this to occur the following line must have the same tag as the initially-written line, and must be in the "dirty inconsistent" state. This feature is being dropped form R12000. #### DCache writebacks never bypass In R10000 when a DCache line is written back to SCache, if the SCache interface is not otherwise occupied when the writeback begins, the writeback is bypassed directly to the SCache interface, avoiding the cycles required to write the data into the writeback buffer. This feature is being dropped form R12000. ## 4.4 Cache Algorithms The behavior of the processor when executing load and store instructions is determined by the cache algorithm specified for the accessed address. The processor supports five different cache algorithms: - uncached - · cacheable noncoherent - cacheable coherent exclusive - · cacheable coherent exclusive on write - · uncached accelerated Cache algorithms are specified in three separate places, depending upon the access: - the cache algorithm for the mapped address space is specified on a per-page basis by the 3-bit cache algorithm field in the TLB - the cache algorithm for the *kseg0* address space is specified by the 3-bit *K0* field of the CP0 *Config* register - the cache algorithm for the *xkphys* address space is specified by **VA[61:59]** Table 4-1 presents the encoding of the 3-bit cache algorithm field used in the TLB; EntryLo0 and EntryLo1 registers; CP0 Config register K0 field for the kseg0 address space; and VA[61:59] for the xkphys address space. Table 4-1 Cache Algorithm Field Encodings | Value | Cache Algorithm | | | |-------|---------------------------------------|--|--| | 0 | Reserved | | | | 1 | Reserved | | | | 2 | Uncached | | | | 3 | Cacheable noncoherent | | | | 4 | Cacheable coherent exclusive | | | | 5 | Cacheable coherent exclusive on write | | | | 6 | Reserved | | | | 7 | Uncached accelerated | | | #### **Descriptions of the Cache Algorithms** This section describes the cache algorithms listed in Table 4-1. #### Uncached Loads and stores under the *Uncached* cache algorithm bypass the primary and secondary caches. They are issued directly to the System interface using processor double/single/partial-word read or write requests. #### Cacheable Noncoherent Under the *Cacheable noncoherent* cache algorithm, load and store secondary cache misses result in processor noncoherent block read requests. External agents containing caches need not perform a coherency check for such processor requests. #### Cacheable Coherent Exclusive Under the *Cacheable coherent exclusive* cache algorithm, load and store secondary cache misses result in processor coherent block read exclusive requests. Such processor requests indicate to external agents containing caches that a coherency check must be performed and that the cache block must be returned in an *Exclusive* state. #### Cacheable Coherent Exclusive on Write The Cacheable coherent exclusive on write cache algorithm is similar to the Cacheable coherent exclusive cache algorithm except that load secondary cache misses result in processor coherent block read shared requests. Such processor requests indicate to external agents containing caches that a coherency check must be performed and that the cache block may be returned in either a Shared or Exclusive state. Store hits to a *Shared* block result in a processor upgrade request. This indicates to external agents containing caches that the block must be invalidated. #### **Uncached Accelerated** The R10000 processor implements a new cache algorithm, *Uncached accelerated*. This allows the kernel to mark the TLB entries for certain regions of the physical address space, or certain blocks of data, as uncached while signalling to the hardware that data movement optimizations are permissible. This permits the hardware implementation to gather a number of uncached writes together, either a series of writes to the same address or sequential writes to all addresses in the block, into an uncached accelerated buffer and then issue them to the system interface as processor block write requests. The *uncached accelerated* algorithm differs from the *uncached* algorithm in that block write gathering is not performed. There is no difference between an uncached accelerated load and an uncached load. Only word or doubleword stores can take advantage of this mode. Stores under the *Uncached accelerated* cache algorithm bypass the primary and secondary caches. Stores to identical or sequential addresses are gathered in the uncached buffer, described in Chapter 6, the section titled "Uncached Buffer." Completely gathered uncached accelerated blocks are issued to the System interface as processor block write requests. Incompletely gathered uncached accelerated blocks are issued to the System interface using processor double/single-word write requests; this is also described in Chapter 6, the section titled "Uncached Buffer." ## 4.5 Relationship Between Cached and Uncached Operations Uncached and uncached accelerated load and store instructions are executed in order, and non-speculatively. Such accesses are buffered in the uncached buffer by the processor until they can be issued to the System interface. All uncached and uncached accelerated accesses retain program order within the uncached buffer. The processor continues issuing cached accesses while uncached accesses are queued in the uncached buffer. NOTE: Cached accesses do not probe the uncached buffer for conflicts. Buffered uncached stores prevent a SYNC instruction from graduating. However buffered uncached accelerated stores do not prevent a SYNC instruction from graduating. The processor continues issuing cached accesses speculatively and out of order beyond a SYNC instruction that is waiting to graduate. An uncached load may be used to guarantee that the uncached buffer is flushed of all uncached and uncached accelerated accesses. A SYNC instruction and the **SysGblPerf**\* signal may be used to guarantee that all cache accesses and uncached stores have been globally performed as described in Chapter 6, the section titled "SysGblPerf\* Signal." An uncached load followed by a SYNC instruction may be used to guarantee that all cache accesses, uncached accesses, and uncached accelerated accesses have been globally performed. # 4.6 Cache Algorithms and Processor Requests The cache algorithm determines the type of processor request generated for secondary cache load misses, secondary cache store misses, and store hits. Table 4-2 presents the relationship between the cache algorithm and processor requests. Table 4-2 Cache Algorithms and Processor Requests | Cache Algorithm | Load Miss | Store Miss | Store Hit | |---------------------------------------|----------------------------------------------------|----------------------------------|---------------------------------------| | Uncached | Double/single/partial-word read | Double/single/partial-word write | NA | | Cacheable noncoherent | Noncoherent block read | Noncoherent block read | Upgrade if <i>Shared</i> <sup>‡</sup> | | Cacheable coherent exclusive | Coherent block read exclusive | Coherent block read exclusive | Upgrade if Shared <sup>‡</sup> | | Cacheable coherent exclusive on write | Coherent block read shared | Coherent block read exclusive | Upgrade if Shared | | Uncached accelerated | rached accelerated Double/single/partial-word read | | NA | <sup>‡</sup> Should not occur under normal circumstances. Most systems return the *Exclusive* state for a cacheable noncoherent line; therefore, the *Shared* state is not normal. ## 4.7 Cache Block Ownership The processor requires cache blocks to have a single owner at all times. The owner is responsible for providing the current contents of the cache block to any requestor. The processor uses the following ownership rules: - The processor assumes ownership of a cache block if the state of the cache block becomes *DirtyExclusive*. For a processor block read request, the processor assumes ownership of the block after receiving the last doubleword of a *DirtyExclusive* external block data response and an external ACK completion response. For a processor upgrade request, the processor assumes ownership of the block after receiving an external ACK completion response. - The processor gives up ownership of a cache block if the state of the cache block changes to *Invalid*, *CleanExclusive*, or *Shared*. - *CleanExclusive* and *Shared* cache blocks are always considered to be owned by memory. # [MEMO] # 5. Secondary Cache Interface The processor supports a mandatory secondary cache by providing an internal secondary cache controller with a dedicated secondary cache port. The cache's tag and data arrays each consist of an external bank of industry-standard synchronous SRAM (SSRAM). This SSRAM must have registered inputs and outputs, asynchronous output enables, and use the late write protocol (data is expected one cycle after the address). ## 5.1 Tag and Data Arrays The secondary cache consists of a 138-bit wide data array (128 data bits + 9 ECC bits + 1 parity bit) and a 33-bit wide tag array (26 tag bits + 7 ECC bits), as shown in Figure 5-1. ECC is supported for both the data and tag arrays to improve data integrity. Figure 5-1 Secondary Cache Data and Tag Array The secondary cache is implemented as a two-way set associative, combined instruction/data cache, which is physically addressed and physically tagged, as described in Chapter 4, the section titled "Cache Organization and Coherency." ### Errata The **SCSize** mode bits specify the secondary cache size; minimum secondary cache size is 512 Kbytes and the maximum secondary cache size is 16 Mbytes, in power of 2 (512 Kbytes, 1 Mbyte, 2 Mbytes, etc.). The **SCBlkSize** mode bit specifies the secondary cache block size. When negated, the block size is 16 words, and when asserted, the block size is 32 words. ## **5.2 Secondary Cache Interface Frequencies** The secondary cache interface operates at the frequency of **SCClk**, which is derived from **PClk**. The **SCClkDiv** mode bits select a **PClk** to **SCClk** divisor of 1, 1.5, 2, 2.5, or 3, using the formula described in Chapter 7, the section titled "Secondary Cache Clock." Synchronization between the **PClk** and **SCClk** is performed internally and is invisible to the system. The processor supplies six complementary copies of the secondary cache clock on **SCClk(5:0)** and **SCClk(5:0)**\*. #### Errata The outputs and inputs at this interface are triggered by an internal SCClk. The relationship between the internal SCClk and the external SCClk[5:0]/SCClk[5:0]\* can be programmed during boot time by setting the SCClkTap mode bits (see the section titled "Mode Bits" in Chapter 8 for detail on mode bits). ## 5.3 Secondary Cache Indexing The secondary cache data array width is one quadword, and therefore **PA(3:0)**, which specify a byte within a quadword, are unused by the Secondary Cache interface. #### **Indexing the Data Array** Since the maximum secondary cache size is 16 Mbytes (8 Mbytes per way), each way requires a maximum of 23 bits to index a byte within a selected way, or 19 bits to index a quadword within a way. Consequently, the processor supplies **PA(22:4)** on **SC(A,B)Addr(18:0)** to index a quadword within a way. The processor selects a secondary cache data way with the **SC(A,B)DWay** signal. Table 5-1 presents the secondary cache data array index for each secondary cache size; for instance, a 4 Mbyte cache uses the 17 address bits, **PA(20:4)** on **SC(A,B)Addr(16:0)**, concatenated with the way bit, **SC(A,B)DWay**, to index a quadword within a 2 Mbyte way. Table 5-1 Secondary Cache Data Array Index | SCSize<br>Mode<br>Bits | Secondary<br>Cache Size | Secondary Cache Data Array Index | Physical Address<br>Bits Used | |------------------------|-------------------------|-------------------------------------------|-------------------------------| | 0 | 512 Kbyte | $SC(A,B)DWay \parallel SC(A,B)Addr(13:0)$ | PA(17:4) | | 1 | 1 Mbyte | $SC(A,B)DWay \parallel SC(A,B)Addr(14:0)$ | PA(18:4) | | 2 | 2 Mbyte | $SC(A,B)DWay \parallel SC(A,B)Addr(15:0)$ | PA(19:4) | | 3 | 4 Mbyte | $SC(A,B)DWay \parallel SC(A,B)Addr(16:0)$ | PA(20:4) | | 4 | 8 Mbyte | $SC(A,B)DWay \parallel SC(A,B)Addr(17:0)$ | PA(21:4) | | 5 | 16 Mbyte | $SC(A,B)DWay \parallel SC(A,B)Addr(18:0)$ | PA(22:4) | #### **Indexing the Tag Array** The processor supplies the secondary cache tag array's least significant index bit on **SCTagLSBAddr** to support two block sizes without system hardware changes. This signal functions normally as a least significant index bit when the secondary cache block size is 16 words. However, when the secondary cache block size is 32 words, this signal is always negated, since only half as many tags are required. The processor supplies the secondary cache tag way on **SCTWay**. Table 5-2 presents the secondary cache tag array index for each secondary cache size; it shows each index is composed of a physical address loaded onto SC(A,B)Addr(), concatenated with SCTWay and SCTagLSBAddr. Table 5-2 Secondary Cache Tag Array Index | SCSize<br>Mode<br>Bits | Secondary<br>Cache Size | Secondary Cache Tag Array Index | |------------------------|-------------------------|---------------------------------------------| | 0 | 512 Kbyte | SCTWay SC(A,B)Addr(13:3) SCTagLSBAddr | | 1 | 1 Mbyte | SCTWay SC(A,B)Addr(14:3) SCTagLSBAddr | | 2 | 2 Mbyte | SCTWay SC(A,B)Addr(15:3) SCTagLSBAddr | | 3 | 4 Mbyte | SCTWay SC(A,B)Addr(16:3) SCTagLSBAddr | | 4 | 8 Mbyte | SCTWay SC(A,B)Addr(17:3) SCTagLSBAddr | | 5 | 16 Mbyte | SCTWay SC(A,B)Addr(18:3) SCTagLSBAddr | For a system design that only supports a secondary cache block size of 32 words, the secondary cache tag array need not use **SCTagLSBAddr** as an index bit. ## 5.4 Secondary Cache Way Prediction Table The primary and secondary caches are two-way set associative. However, the implementation of the secondary cache is different than the primary caches. The primary caches read simultaneously from two separate tag arrays, corresponding to each way in the cache, and then select the data based on the result of two parallel tag compares. The secondary cache does not use this implementation because it would either require too many pins to read in two full copies of the data and tags, or add latency to externally multiplex two banks of memory. Instead, a way prediction table is used to determine which way to read from first. The way prediction table is internal to the processor and has 8K one-bit entries, each entry corresponding to a pair of secondary cache blocks. The bit entry indicates which way of the addressed set has been most-recently used (MRU). When the secondary cache is accessed, this prediction bit is used as an address bit; thus the two ways in the secondary cache are shared in the same SSRAM bank. The secondary cache way prediction table is indexed with a subset of 11 to 13 bits of the physical address, based on both the secondary cache block size, and the secondary cache size, as shown in Table 5-3. "0 |" indicates a zero bit concatenated to the address to pad the index out to a full 13-bits. | SCSize<br>Mode Bits | Secondary Cache<br>Size | SCBlkSize<br>Mode Bit | Secondary Cache<br>Block Size | Secondary Cache<br>Way Prediction Table Index | |---------------------|-------------------------|-----------------------|-------------------------------|-----------------------------------------------| | 0 | 512 Kbyte | 0 | 16-word | 0 PA(17:6) | | | | 1 | 32-word | 0 0 PA(17:7) | | 1 | 1 Mbyte | 0 | 16-word | PA(18:6) | | | | 1 | 32-word | 0 PA(18:7) | | 2 to 5 | 2M to 16 Mbyte | 0 | 16-word | PA(18:6) | | | | 1 | 32-word | PA(19:7) | Table 5-3 Secondary Cache Way Prediction Table Index Three states are possible in the way prediction table: - the desired data is in the predicted way - · the desired data is in the non-predicted way - the desired data is not in the secondary cache The tags for both ways are read "underneath" the data access cycles in order to discern as rapidly as possible which of these states are valid. This reading is possible because it takes two accesses to read a primary data block (8 words) and 4 cycles to read a primary instruction block (16 words); thus the bandwidth needed to read the tag array twice exists in all cases. Only an extra address pin to the tag array is needed to make this operation parallel and this is implemented by the **SCTWay** pin. The three possible states are handled in the following manner: - If, after reading the tags for both ways, it is discovered that the data exists in the predicted way, the processor continues normally. - If the data exists in the non-predicted way, the processor accesses this non-predicted way in the secondary cache and updates the way prediction table to point to this way. #### Errata • If the access misses in both ways of the secondary cache, the data is fetched from the system interface. If the state of the predicted way is found to be *invalid*, the fetched data is placed in it and the MRU is unchanged. However, if the state of the predicted way is found to be *valid* then the fetched data is placed into the non-predicted way, and the way prediction table is updated to point to this way since it is now the most-recently-used. The way prediction table can cover up to a 2 Mbyte secondary cache when the secondary cache block size is 32 words. If the secondary cache exceeds this size, the accuracy of the way prediction table diminishes slightly. However, the extremely large performance gain made by making the secondary cache larger far outstrips any performance loss in the way prediction table. #### **★** Increased the Way Prediction Table (MRU table) to 16K single-bit entries The size of the table has been increased to 16K entries, so that 4MB caches with 128B lines or 2MB caches with 64B lines can be fully mapped. #### **★** Direct Cache Test Mode Due to the increase size of the Way Prediction Table, Direct Cache Test Mode have been modified for testing the Way Prediction Table. ## 5.5 Secondary Cache Tag The secondary cache tag, transferred on the **SCTag(25:0)** bus, is divided into three fields, as shown in Figure 5-2 below. Figure 5-2 Secondary Cache Tag Fields #### SCTag(25:4), Physical Tag The minimum secondary cache size is 512 Kbytes (256 Kbytes per way), so a minimum of 18 bits are required to index a data byte within a selected way. Since the processor supports 40 physical bits, a maximum of 22 bits are required for the physical tag: 40 physical address bits - 18 minimum required = 22 Consequently, the processor supplies the 22 physical address bits, **PA(39:18)**, on **SCTag(25:4)** for the physical tag. When the secondary cache is larger than the minimum size, the secondary cache tag array must still maintain the full physical tag supplied by the processor, even though some bits are redundant. #### SCTag(3:2), PIdx Bits SCTag(3:2) of the secondary cache tag contain the primary cache index, Pldx. The *PIdx* field contains VA(13:12), which are the two lowest virtual address bits above the minimum 4 Kbyte page size. This field is written into the secondary cache tag during a secondary cache refill. For each processor-initiated secondary cache access, the virtual address bits are compared with the *PIdx* field of the secondary cache tag. If a mismatch occurs, a virtual coherency condition exists and the value of the *PIdx* field is used by internal control logic to purge primary cache locations, so that all primary cache blocks holding valid data have indices known to the secondary cache. This mechanism, unlike that of the R4400 processor, is implemented in hardware. It helps preserve the integrity of cached accesses to a physical address using different virtual addresses, an occurrence called **virtual aliasing**. For each external coherency request, the *PIdx* field of the secondary cache tag provides a mechanism to locate subset lines in the primary caches. ## SCTag(1:0), Cache Block State The lower two bits of the secondary cache tag, **SCTag(1:0**), contain the cache block state, which can be *Invalid*, *Shared*, *CleanExclusive*, or *DirtyExclusive* as shown in Table 5-4. Table 5-4 Secondary Cache Tag State Field Encoding | SCTag(1:0) | State | |------------|----------------| | 0 | Invalid | | 1 | Shared | | 2 | CleanExclusive | | 3 | DirtyExclusive | Since the secondary cache tags are updated immediately for stores to the primary data cache, and all caches use a write back protocol, the data in the secondary cache may not always be consistent with data in the primary cache even though the tags always reflect the correct state of a secondary cache block. # **5.6 Read Sequences** There are five basic read sequences: - · a 4-word read - an 8-word read - a 16-word read - a 32-word read - a tag read #### Errata The SCClk referred in the secondary cache read and write timing diagrams is an internal SCClk. The relationship between this internal SCClk and the external SCClk[5:0]/ SCClk[5:0]\* can be programmed during boot time by setting the SCClkTap mode bits (see the section titled "Mode Bits" in Chapter 8 for detail on mode bits). #### 4-Word Read Sequence A 4-word read sequence is performed by a CACHE Index Load Data (S) instruction to read a doubleword of data and 10 check bits from the secondary cache data array. Figure 5-3 depicts a secondary cache 4-word read sequence. A quadword is read from the index specified by **PA(23:6)**, and the way specified by **VA(0)** of the CACHE instruction. The doubleword specified by **VA(3)** is then stored into the CP0 *TagHi* and *TagLo* registers, and the corresponding check bits are stored into the CP0 *ECC(9:0)* register. The data may be examined by copying the CP0 *TagHi*, *TagLo*, and *ECC* registers to the general registers with the MTC0 instruction. Figure 5-3 4-Word Read Sequence #### 8-Word Read Sequence An 8-word read sequence refills the primary data cache from the secondary cache after a primary data cache miss. Figure 5-4 depicts a secondary cache 8-word read sequence. In it, **SC(A,B)DWay** and **SCTWay** are driven with value X on the first address cycle, which is obtained from the way prediction table. On the next address cycle, **SCTWay** is complemented in order to read the tag from the non-predicted way of the addressed set. **SC(A,B)DWay** is not changed since it is assumed that the way prediction table is correct and the read is likely to hit in the predicted way. The tag for the non-predicted way is returned to the processor in the same cycle as the second quadword of data. Reads that miss in the predicted way, but hit in the non-predicted way, are noted by the internal control logic and reissued to the secondary cache as soon as possible. Figure 5-4 8-Word Read Sequence #### 16 or 32-Word Read Sequence A 16-word read sequence refills the primary instruction cache from the secondary cache after a primary instruction cache miss. A 16-word read sequence is also performed when the secondary cache block size is 16 words, and a *DirtyExclusive* secondary cache block must be written back to the System interface. A 32-word read sequence is performed when the secondary cache block size is 32 words, and a *DirtyExclusive* secondary cache block must be written back to the System interface. Figure 5-5 depicts a secondary cache 16 or 32-word read sequence. This is similar to an 8-word read sequence except that more addresses must be issued, in order to read the appropriate number of quadwords. Figure 5-5 16 or 32-Word Read Sequence # **Tag Read Sequence** A tag read sequence is performed when the state of a secondary cache block is required, but it is not necessary to access the data array. This sequence is used for the CACHE Index Load Tag (S) instruction. Figure 5-6 depicts a secondary cache tag read sequence. Figure 5-6 Tag Read Sequence # **5.7 Write Sequences** There are five basic write sequences: - a 4-word write. - an 8-word write - a 16-word write - a 32-word write - a tag write #### Errata The SCClk referred in the secondary cache read and write timing diagrams is an internal SCClk. The relationship between this internal SCClk and the external SCClk[5:0]/ SCClk[5:0]\* can be programmed during boot time by setting the SCClkTap mode bits (see the section titled "Mode Bits" in Chapter 8 for detail on mode bits). ### **4-Word Write Sequence** A 4-word write sequence is performed by a CACHE Index Store Data (S) instruction to store a quadword of data and 10 check bits into the secondary cache data array. Figure 5-7 depicts a secondary cache 4-word write sequence. A quadword is written to the index specified by **PA(23:6)**, and the way specified by **VA(0)** of the CACHE instruction. A doubleword specified by VA(3) is obtained from the CP0 TagHi and TagLo registers, and the other half of the doubleword is padded to zeros. Normal ECC and parity generation is bypassed and the check field of the data array is written with the contents of the CP0 ECC(9:0) register. Figure 5-7 4-Word Write Sequence #### **8-Word Write Sequence** An 8-word write sequence writes back a dirty block from the primary data cache to the secondary cache. Figure 5-8 depicts a secondary cache 8-word write sequence. **SC(A,B)DWay** are driven with the way bit obtained from the primary data cache tag. The secondary cache tag is not written since it was previously updated when the primary data cache block was modified. Figure 5-8 8-Word Write Sequence ## 16 or 32-Word Write Sequence A 16- or 32-word write sequence refills a secondary cache block from the System interface after a secondary cache miss. A 16-word write sequence is performed when the secondary cache block size is 16 words, and a 32-word write sequence is performed when the secondary cache block size is 32 words. Figure 5-9 depicts a secondary cache 16 or 32-word write sequence. Figure 5-9 16/32-Word Write Sequence ## **Tag Write Sequence** A tag write sequence updates the secondary cache tag array without affecting the data array. This sequence is used for the following: - to reflect primary cache state changes in the secondary cache - for external coherency requests - for the CACHE Index Store Tag (S) instruction Figure 5-10 depicts the secondary cache tag write protocol. Figure 5-10 Tag Write Sequence # [MEMO] | 6. | System | Interface | | perations | |----|--------|-----------|---|-----------| | U. | system | merjace | U | peranons | The R10000 System interface provides a gateway between processor, with its associated secondary cache, and the remainder of the computer system. For convenience, any device communicating with the processor through the System interface is referred to as the **external agent**. ## 6.1 Request and Response Cycles The System interface supports the following request and response cycles: - Processor requests are generated by the processor, when it requires a system resource. - External responses are supplied by an external agent in response to a processor request. - **External requests** are generated by an external agent when it requires a resource within the processor. - **Processor responses** are supplied by the processor in response to an external request. # **6.2 System Interface Frequencies** The System interface operates at **SysClk** frequency, supplied by the external agent. The internal processor clock, **PClk**, is derived from this same **SysClk**. The **SysClkDiv** mode bits select a **PClk** to **SysClk** divisor of 1, 1.5, 2, 2.5, 3, 3.5, or 4, using the formula described in Chapter 7, the section titled "System Interface Clock and Internal Processor Clock Domains." # 6.3 Register-to-Register Operation The System interface is designed to operate in the following register-to-register fashion with the external agent: - all System interface outputs are sourced directly from registers clocked on the rising edge of SysClk - all System interface inputs directly feed registers that are clocked on the rising edge of SysClk This allows the System interface to run at the highest possible clock frequency. # **6.4 System Interface Signals** The R10000 System interface is composed of: - 3 arbitration signals - 2 flow-control input signals - a bidirectional 12-bit command bus - · a bidirectional 64-bit multiplexed address/data bus - a 3-bit state output bus - a 5-bit response input bus #### **6.5** Master and Slave States At any time, the System interface is either in master or slave state. In **master** state, the processor drives the bidirectional System interface signals and is permitted to issue processor requests to the external agent. In **slave** state, the processor tristates the bidirectional System interface signals and accepts external requests from the external agent. # 6.6 Connecting to an External Agent In a uni- or multiprocessor system using dedicated external agents, the System interface connects to a single external agent. In a multiprocessor system using the cluster bus (see below), the system can connect up to four R10000 processors to an external agent. This external agent is referred to as the **cluster coordinator**. #### 6.7 Cluster Bus In a multiprocessor system using the cluster bus, the cluster coordinator performs the cluster bus arbitration and data flow management. The arbitration scheme assures that either one of the processors or the cluster coordinator is master at any given time, while the remaining devices are slave. A processor request issued by the master processor is observed as an external request by all slave R10000 processors, as shown in Figure 6-1. Similarly, a processor coherency data response issued by a master processor is observed as an external data response by the slave processors. Figure 6-1 Processor Request Master/Slave Status In a multiprocessor system using the cluster bus, a mode bit specifies whether processor coherent requests are to target the external agent only, or all processors and the external agent. This allows systems with efficient snoopy, duplicate tag, or directory-based coherency protocols to be created. # **6.8 System Interface Connections** The major System interface connections required for various system configurations are presented in this section. #### **Uniprocessor System** Figure 6-2 shows the major System interface connections required for a typical uniprocessor system. Figure 6-2 System Interface Connections for Uniprocessor System #### **Multiprocessor System Using Dedicated External Agents** Figure 6-3 shows the major System interface connections required for a typical multiprocessor system using dedicated external agents. Figure 6-3 System Interface Connections for Multiprocessor using Dedicated External Agents #### **Multiprocessor System Using the Cluster Bus** Figure 6-4 presents the major System interface connections required for a typical multiprocessor system using the cluster bus. Figure 6-4 System Interface Connections for Multiprocessor Using the Cluster Bus ## 6.9 System Interface Requests and Responses The System interface supports the following: - · processor request - external response - external request - · processor response The following sections describe these request and response types, and their operations. #### **Processor Requests** Processor requests are generated by the processor when it requires a system resource. The following processor requests are supported: - coherent block read shared request - coherent block read exclusive request - · noncoherent block read request - double/single/partial-word read request - · block write request - double/single/partial-word write request - · upgrade request - · eliminate request Processor write and eliminate requests do not require or expect a response by the external agent. However, if an external agent detects an error in a processor write or eliminate request, it may use an interrupt to signal the processor. It is not possible to generate precise exceptions for processor write and eliminate requests for which an external agent detects an error. Processor read and upgrade requests require some type of response by the external agent. ## **External Responses** External responses are supplied by an external agent or another processor in response to a processor request. The following external responses are supported: - block data response - · double/single/partial-word data response - completion response ## **External Requests** External requests are generated by an external agent when it requires a resource within the processor. The following external requests are supported: - · intervention shared request - · intervention exclusive request - · allocate request number request - · invalidate request - · interrupt request External intervention and invalidate requests require some type of response by the processor. #### **Processor Responses** Processor responses are supplied by the processor in response to an external request. The following processor responses are supported: - · coherency state response - · coherency data response #### **Outstanding Requests and Request Numbers** The processor allows requests and corresponding responses to be split transactions, which enables additional processor and external requests to be issued while waiting for a prior response. The System interface supports a request number field to link requests with their corresponding responses, so responses can be returned out of order. The processor allows a maximum of eight outstanding requests on the System interface through a 3-bit request number. These outstanding requests may be composed of any mix of processor and external requests. An individual processor (as opposed to the System interface, above) supports a maximum of four outstanding processor requests at any given time. # **Request and Response Relationship** The relationship between processor and external requests, and their acceptable responses, is presented in Table 6-1. The data in this table is given with respect to a single processor, in either a uni- or multiprocessor system (independent of cluster/non-cluster configuration). Table 6-1 Request and Response Relationship | Request | Acceptable Response Sequences | |----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | | External NACK or ERR completion response | | Processor block read request | 0 or more external block data responses followed by a final external block data response with a coincidental or subsequent external ACK, NACK, or ERR completion response | | | External NACK or ERR completion response | | Processor double/single/partial-word read request | 0 or more external double/single/partial-word data responses followed by a final external double/single/partial-word data response with a coincidental or subsequent external ACK, NACK, or ERR completion response | | Processor block write request | None | | Processor double/single/partial-word write request | None | | | External ACK, NACK, or ERR completion response | | Processor upgrade request | 0 or more external block data responses followed by a final external block data response with a coincidental or subsequent external ACK, NACK, or ERR completion response | | Processor eliminate request | None | | External intervention request | Processor coherency state response followed by processor coherency data response (if <i>DirtyExclusive</i> ) with a coincidental or subsequent external ACK, NACK, or ERR completion response <sup>‡</sup> | | External allocate request number request | External ACK, NACK, or ERR completion response <sup>‡</sup> | | External invalidate request | Processor coherency state response followed by external ACK, NACK, or ERR completion response ‡ | | External interrupt request | None | <sup>‡</sup> External completion response is required to free the request number. # **6.10 System Interface Buffers** The processor contains the following five buffers to enhance the performance of the System interface and to simplify the system design: - cluster request buffer - cached request buffer - incoming buffer - · outgoing buffer - · uncached buffer These buffers are described in the following sections. ## **Cluster Request Buffer** The System interface contains an 8-entry cluster request buffer. This buffer maintains the status of the eight possible outstanding requests on the System interface. When the System interface is in master state, and it issues the address cycle of processor read or upgrade request, the processor places an entry into the cluster request buffer. When the System interface is in slave state, and an external agent issues an external coherency or allocate request number request, it places an entry into the cluster request buffer. Once an entry is placed into the cluster request buffer, the associated request number transitions from *free* to *busy*. An entry remains busy until the processor receives an external completion response. Processor requests that are ready to be issued to the System interface bus probe the cluster request buffer to detect conflict conditions. #### **Cached Request Buffer** The System interface contains a four-entry cached request buffer. This buffer holds the status of the four possible outstanding processor cached requests, including processor block read and upgrade requests. The relative order of the requests is maintained in the cached request buffer. External coherency requests probe the cached request buffer to detect conflict conditions. #### **Incoming Buffer** The System interface contains an incoming buffer for external block and double/single/partial-word data responses. The four 32-word entries of the incoming buffer correspond to the four possible outstanding processor requests. Block data in each entry of the incoming buffer is stored in subblock order, beginning with a quadword-aligned address. The incoming buffer eliminates the need for the processor to flow-control the external agent that is providing the external data responses. Regardless of the cache bandwidth or internal resource availability, the external agent may supply external data response data for all outstanding read and upgrade requests at the maximum System interface data rate. The external agent may issue any number of external data responses for a particular request number before issuing a corresponding external completion response. An external data response remains in the incoming buffer until a corresponding external completion response is received. A former buffered external data response for a particular request number is over-written by a subsequent external data response for the same request number. An external ACK completion response frees buffered data to be forwarded to the caches and other internal resources while an external NACK or ERR completion response purges any corresponding buffered data. For minimum latency, the external agent should issue an external ACK completion response coincident with the first doubleword of an external data response. External coherency requests that target blocks residing in the incoming buffer are stalled until the incoming buffer data is forwarded to the secondary cache, and the instruction that caused the secondary miss is satisfied. Each doubleword of the incoming buffer has an Uncorrectable Error flag. When an external data response provides a doubleword, the processor asserts the corresponding incoming buffer Uncorrectable Error flag if the data quality indicator, **SysCmd[5]**, is asserted, or if an uncorrectable ECC error is encountered on the system address/data bus and the ECC check indication on **SysCmd[0]** is asserted. When the processor forwards block data from an incoming buffer entry after receiving an external ACK completion response, the associated incoming buffer Uncorrectable Error flags are checked, and if any are asserted, a single Cache Error exception is posted. When the processor forwards double/single/partial-word data from an incoming buffer entry after receiving an external ACK completion response, the associated incoming buffer Uncorrectable Error flag is checked, and if asserted, a Bus Error exception is posted. ## **Outgoing Buffer** The System interface contains a five-entry outgoing buffer to provide buffering for the following: - DirtyExclusive blocks that are cast out of the secondary cache because of a block replacement - various CACHE instructions - an external intervention request. Four 32-word *typical* entries are associated with the four possible outstanding processor cached requests allowed by the processor. One 32-word *special* entry is reserved for external intervention requests only. The data is stored in each entry of the outgoing buffer in sequential order, beginning with a secondary cache block-aligned address. An instruction or data access that misses in the secondary cache but targets an entry in the outgoing buffer is stalled until the outgoing buffer entry is issued as a processor block write request or coherency data response to the System interface bus. External coherency requests probe the four typical outgoing buffer entries, with the following results: - If an external intervention request hits a typical entry, that entry is converted from a processor block write request to a processor coherency data response. - If an external invalidate request hits a typical outgoing buffer entry, that entry is deleted. - If an external intervention request does not hit a typical outgoing buffer entry, but hits a *DirtyExclusive* block in the secondary cache, the special outgoing buffer entry is used to buffer the processor coherency data response. A typical outgoing buffer entry containing a block write is ready for issue to the System interface bus when the first quadword is received from the secondary cache. The processor allows data to stream from the secondary cache to the System interface bus through the outgoing buffer. #### Errata An outgoing buffer entry containing a coherency data response is ready for issue to the System interface bus when the quadword specified by the corresponding external intervention request is received from the secondary cache. The processor then allows the data to stream from the secondary cache to the System interface bus through the outgoing buffer. Each quadword of the outgoing buffer maintains an Uncorrectable Error flag. If an uncorrectable error is encountered while a block is being cast out of the secondary cache, the associated outgoing buffer quadword Uncorrectable Error flag is asserted. When the processor empties an outgoing buffer entry by issuing a processor block write or coherency data response, the outgoing buffer Uncorrectable Error flags are reflected by the data quality indication on **SysCmd[5**]. #### **Uncached Buffer** The System interface contains an uncached buffer to provide buffering for uncached and uncached accelerated load and store operations. All operations retain program order within the uncached buffer. The uncached buffer is organized as a 4-entry FIFO followed by a 2-entry gatherer. Each gathered entry has a capacity of 16 or 32 words, as specified by the **SCBlkSize** mode bit. The uncached buffer begins gathering when an uncached accelerated double or singleword block-aligned store is executed. Gathering continues if the subsequent uncached operation executed is an uncached accelerated double or singleword store to a sequential or identical address. Once a second uncached accelerated store is gathered, the gathering mode is determined to be sequential or identical. Gathering continues until one of the following conditions occurs: - a complete block is gathered - an uncached or uncached accelerated load is executed - an uncached or uncached accelerated partial-word store is executed - · an uncached store is executed - a change in the current gathering mode is observed - · a change in the uncached attribute is observed When gathering terminates, the data is ready for issue to the System interface bus. A processor uncached accelerated block write request is used to issue a completely gathered uncached accelerated block. One or more disjoint processor uncached accelerated double or singleword write requests are used to issue an incompletely gathered uncached accelerated block. When gathering in an identical mode, uncached accelerated double or singleword stores may be freely mixed. The uncached buffer packs the associated data into the gatherer. When gathering in sequential mode, uncached accelerated singleword stores must occur in pairs, to prevent an address error exception. For instance, SW, SW, SD, SW, SW is legal. SD, SW, SD, is not. External coherency requests have no effect on the uncached buffer. CACHE instructions have no effect on the uncached buffer. SYNC instructions are prevented from graduating if an uncached store resides in the uncached buffer. # **6.11 System Interface Flow Control** The System interface supports a maximum *request rate* of one request per **SysClk** cycle, and a maximum *data rate* of one doubleword per **SysClk** cycle. Various flow control mechanisms are provided to limit these rates, as described below. ## **Processor Write and Eliminate Request Flow Control** The processor can only issue a processor write or eliminate request if: - · the System interface is in master state - SysWrRdy\* was asserted two SysClk cycles previously #### **Processor Read and Upgrade Request Flow Control** The processor can only issue a processor read or upgrade request if: - the System interface is in master state - SysRdRdy\* was asserted two SysClk cycles previously - the maximum number of outstanding processor requests specified by the PrcReqMax mode bits is not exceeded - there is a free request number ### **Processor Coherency Data Response Flow Control** The processor can only issue a processor coherency data response if: - the System interface is in master state - SysWrRdy\* was asserted two SysClk cycles previously #### **External Request Flow Control** When the System interface is in *Slave* state, it is capable of accepting external requests. An external agent may issue external requests in adjacent **SysClk** cycles. #### **External Data Response Flow Control** Since the processor has an incoming buffer, an external agent may supply external data response data in adjacent **SysClk** cycles, without regard to cache bandwidth or internal resource availability. # 6.12 System Interface Block Data Ordering During block data transfers on the System interface **SysAD[63:0]** bus, even doublewords (Dat0, Dat2,...) always correspond to **SCData[127:64]**, and odd doublewords (Dat1, Dat3,...) always correspond to **SCData[63:0]**. ## **External Block Data Responses** During the address cycle of processor block read and upgrade requests, the processor specifies a quadword-aligned address. The processor expects the external block data response to be supplied in a subblock order sequence, beginning at the specified quadword-aligned address. ## **Processor Coherency Data Responses** The address of external intervention requests are internally aligned by the processor to a quadword address. If the processor determines that it must issue a processor coherency data response, it supplies the data in a subblock order sequence beginning at the quadword-aligned address specified by the corresponding external coherency request. #### **Processor Block Write Requests** During the address cycle of processor block write requests, the processor specifies a cache block-aligned address. During the subsequent data cycles for typical processor block write requests, the processor supplies the data in sequence, beginning with the secondary cache block-aligned address. # **6.13** System Interface Bus Encoding This section presents the encoding of the following four System interface buses: - SysCmd[11:0] - SysAD[63:0] - SysState[2:0] - SysResp[4:0] ## SysCmd[11:0] Encoding This section describes address and data cycle encodings for the system command bus, **SysCmd[11:0]**. ## SysCmd[11] Encoding When **SysVal\*** is asserted, **SysCmd[11]** indicates whether the **SysAD[63:0]** bus represents an address or a data cycle, as shown in Table 6-2. Table 6-2 Encoding of SysCmd[11] | SysCmd[11] | Data/Address Cycle Indication | |------------|-------------------------------| | 0 | SysAD[63:0] address cycle | | 1 | SysAD[63:0] data cycle | ## SysCmd[10:0] Address Cycle Encoding During the address cycle of processor read and upgrade requests, **SysCmd[10:8]** contain the request number, as shown in Table 6-3. The request number provides a mechanism to associate an external response with the corresponding processor request. Table 6-3 Encoding of SysCmd[10:8] for Processor Read and Upgrade Requests | SysCmd[10:8] | Request Number | |--------------|----------------| | | <b>1</b> | During the address cycle of processor requests, **SysCmd[7:5**] contain the command, as shown in Table 6-4. Table 6-4 Encoding of SysCmd[7:5] for Processor Requests | SysCmd[7:5] | Command | |-------------|----------------------------------| | 0 | Coherent block read shared | | 1 | Coherent block read exclusive | | 2 | Noncoherent block read | | 3 | Double/single/partial-word read | | 4 | Block write | | 5 | Double/single/partial-word write | | 6 | Upgrade | | 7 | Special | During the address cycle of processor read requests, **SysCmd[4:3**] contain the read cause indication, as shown in Table 6-5. This information is useful in handling the associated external response. Table 6-5 Encoding of SysCmd[4:3] for Processor Read Requests | SysCmd[4:3] | Read Cause Indication | |-------------|-----------------------| | 0 | Instruction access | | 1 | Data typical access | | 2 | Data LL/LLD access | | 3 | Data prefetch access | During the address cycle of processor write requests, **SysCmd[4:3]** contain the write cause indication, as shown in Table 6-6. This information is useful in handling the associated write data. Table 6-6 Encoding of SysCmd[4:3] for Processor Write Requests | SysCmd[4:3] | Write Cause Indication | |-------------|---------------------------------------------| | 0 | Reserved | | 1 | Data typical access | | 2 | Data uncached accelerated sequential access | | 3 | Data uncached accelerated identical access | During the address cycle of processor upgrade requests, **SysCmd[4:3**] contain the upgrade cause indication, as shown in Table 6-7. This information useful in handling the associated external response. Table 6-7 Encoding of SysCmd[4:3] for Processor Upgrade Requests | SysCmd[4:3] | Upgrade Cause Indication | |-------------|--------------------------| | 0 | Reserved | | 1 | Data typical access | | 2 | Data SC/SCD access | | 3 | Data prefetch access | During the address cycle of processor special requests, **SysCmd[4:3]** contain the processor special cause indication, as shown in Table 6-8. This information differentiates between the various processor special requests. Table 6-8 Encoding of SysCmd[4:3] for Processor Special Requests | SysCmd[4:3] | Special Cause Indication | |-------------|--------------------------| | 0 | Reserved | | 1 | Eliminate | | 2 | Reserved | | 3 | Reserved | During the address cycle of processor block read, typical block write, upgrade, and eliminate requests, **SysCmd[2:1]** contain the secondary cache block former state, as shown in Table 6-9. This information may be useful for system designs implementing a duplicate tag or a directory-based coherency protocol. Table 6-9 Encoding of SysCmd[2:1] for Processor Block Read/Write, Upgrade, Eliminate Requests | SysCmd[2:1] | Secondary Cache Block Former State | |-------------|------------------------------------| | 0 | Invalid | | 1 | Shared | | 2 | CleanExclusive | | 3 | DirtyExclusive | During the address cycle of processor double/single/partial-word read and write requests, **SysCmd[2:0]** contain the data size indication, as shown in Table 6-10. Table 6-10 Encoding of SysCmd[2:0] for Processor Double/Single/Partial-Word Read/ Write Requests | SysCmd[2:0] | Data Size Indication | |-------------|--------------------------------| | 0 | One byte valid (Byte) | | 1 | Two bytes valid (Halfword) | | 2 | Three bytes valid (Tribyte) | | 3 | Four bytes valid (Word) | | 4 | Five bytes valid (Quintibyte) | | 5 | Six bytes valid (Sextibyte) | | 6 | Seven bytes valid (Septibyte) | | 7 | Eight bytes valid (Doubleword) | During the address cycle of external intervention and invalidate requests, **SysCmd[10:8]** contain the request number, as shown in Table 6-11. The request number provides a mechanism to associate a potential processor coherency data response with the corresponding external coherency request. Table 6-11 Encoding of SysCmd[10:8] for External Intervention and Invalidate Requests | SysCmd[10:8] Request Number | | |-----------------------------|--| |-----------------------------|--| During the address cycle of external requests, **SysCmd**[7:5] contain the command, as shown in Table 6-12. Table 6-12 Encoding of SysCmd[7:5] for External Requests | SysCmd[7:5] | Command | |-------------|-------------------------| | 0 | Intervention shared | | 1 | Intervention exclusive | | 2 | Allocate request number | | 3 | Allocate request number | | 4 | NOP | | 5 | NOP | | 6 | Invalidate | | 7 | Special | During the address cycle of external special requests, **SysCmd**[4:3] contain the external special cause indication, as shown in Table 6-13. This information is used to differentiate between the various external special requests. Table 6-13 Encoding of SysCmd[4:3] for External Special Requests | SysCmd[4:3] | Special Cause Indication | |-------------|--------------------------| | 0 | Reserved | | 1 | NOP | | 2 | Interrupt | | 3 | Reserved | #### Errata During external address cycles, **SysCmd[0]** specifies whether ECC checking and correcting is to be performed for the **SysAD[63:0]** bus, as shown in Table 6-14. <u>During the address cycle of processor block read, data typical block write, upgrade, and eliminate requests, the processor asserts SysCmd[0]. Consequently, in a multiprocessor system using the cluster bus, ECC checking and correcting is enabled for external coherency requests resulting from processor coherent block read and upgrade requests.</u> Table 6-14 Encoding of SysCmd[0] for External Address Cycles | SysCmd[0] | ECC check indication | | | | |-----------|-------------------------------------|--|--|--| | 0 | ECC checking and correcting disable | | | | | 1 | ECC checking and correcting enable | | | | #### SysCmd[10:0] Data Cycle Encoding During the data cycles of an external data response or a processor coherency data response, **SysCmd[10:8]** contain the request number associated with the original request, as shown in Table 6-15. Table 6-15 Encoding of SysCmd[10:8] for Data Responses | SysCmd[10:8] | Request Number | |--------------|----------------| |--------------|----------------| During data cycles, **SysCmd**[5] indicates the data quality, as shown in Table 6-16. Table 6-16 Encoding of SysCmd[5] for Data Cycles | SysCmd[5] | Data quality indication | | | | | |-----------|-------------------------|--|--|--|--| | 0 | Data is good | | | | | | 1 | Data is bad | | | | | During data cycles, **SysCmd[4:3]** indicate the data type, as shown in Table 6-17. Processor block write and double/single/partial-word write requests use request data and request last data type indications. External data and processor coherency data responses use response data and response last data type indications. Table 6-17 Encoding of SysCmd[4:3] for Data Cycles | SysCmd[4:3] | Data type Indication | | | | |-------------|----------------------|--|--|--| | 0 | Request data | | | | | 1 | Response data | | | | | 2 | Request last | | | | | 3 | Response last | | | | During data cycles of an external block data response or processor coherency data response, **SysCmd[2:1]** contain the state of the cache block, as shown in Table 6-18. Table 6-18 Encoding of SysCmd[2:1] for Block Data Responses | SysCmd[2:1] | Cache Block State | | | | |-------------|-------------------|--|--|--| | 0 | Reserved | | | | | 1 | Shared | | | | | 2 | CleanExclusive | | | | | 3 | DirtyExclusive | | | | During data cycles, **SysCmd[0]** specifies whether ECC checking and correcting is to be performed for the **SysAD[63:0]** bus, as shown in Table 6-19. During processor data cycles, the processor asserts **SysCmd[0]**. Consequently, in a multiprocessor system using the cluster bus, ECC checking and correcting will be enabled for external block data responses resulting from processor coherency data responses. Table 6-19 Encoding of SysCmd[0] for External Data Cycles | SysCmd[0] | ECC check indication | | | | | |-----------|-------------------------------------|--|--|--|--| | 0 | ECC checking and correcting disable | | | | | | 1 | ECC checking and correcting enable | | | | | # SysCmd[11:0] Map # Table 6-20 presents a map for the **SysCmd[11:0]** bus. Table 6-20 SysCmd[11:0] Map | Cycle | Command | | | SysCmd[11:0] Bit | | | | | | | | | | | | | |-------------|-------------------------------------------------------------|--------------------------|---------------------|------------------|---------|------|----------------|-------------|------------------------|---------------|-------------|----------|---------|-----|---|-----| | Type | | Command | | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | | | | | Coherent block read shared | | | Request number | | 0 | 0 | 0 | Read cause | | Block state | | 1 | | | | | | Coherent block read exclusive | | | | | 0 | 0 | 1 | | | | | | | | | | Noncohe | | rent block read | 1 | | | 0 | 1 | 0 | | | | | | | | | | | Double/si | ingle/partial-word read | 1 | | | 0 | 1 | 1 | | | Data size | | e | | | | | Processor | Block wr | ite | 1 | 0 | | 1 | 0 | 0 Write cau | | | Block state | | 1 | | | | | address | Double/si | ingle/partial-word write | 0 | | | 1 | 0 | 1 | Write | cause | : | Data siz | e | | | | | cycles | Upgrade | | 1 | Requ | uest nu | mber | 1 | 1 | 0 | Upgrade cause | | Block | state | 1 | | | | | | Reserved | 1 | F | Reserve | ed | | | | 0 | 0 | | Reserve | d | | | | | C1 | Eliminate | 1 | 0 | | | 4 | 0 | 1 | Block | state | 1 | | | | | | | Special | D 1 | 1 | | | | 1 1 | 1 | 1 | 1 | 0 | Reserved | | 1 | | | | | | Reserved | | ŀ | Reserve | ea | | | | 1 | 1 | | | a | | | | | Double/si | ingle/partial-word write | | 0 | | | | 0 | | | C | ) | | | | | | | Processor data cycles Block write Coherency data response | | 1 | 0 | | 0 | | Data | Data duality Data type | | Block state | 1 | | | | | | data cycles | | | | Request number | | | quality | | | | | | | | | | | | Intervention shared | | Intervention shared | | | 0 | 0 | 0 | - E | | | ECC | | | | | | | Interventi | Intervention exclusive | | Dh | | 0 | 0 | 1 | | | | | ECC | | | | | | Alloanta | roquast number | | Request number | | 0 | 1 | 0 | | | | | | | | | | | Anocate | request number | | | | 0 | 1 | 1 | X | | | X | | | | | | External | NOP | | | | X | | 1 | 0 | 0 | | | | X | Λ | | | | address | NOF | | 0 | Λ | | 1 | 0 | 1 | | | | | | | | | | cycles | S Invalidate | | | Requ | uest nu | mber | 1 | 1 | 0 | 1 | | EC | | ECC | | | | | | NOP | | | | | | | ( | 0 | 0 | | X | | | | | | Special | NOI | - | X | 1 | 1 | 1 | 0 | 1 | X | 7 | Λ | | | | | | | | Interrupt | | | | | | 1 | 0 | | • | ECC | | | | | | | NOP | | | | | | | | 1 | 1 | | | X | | | | | External | Block dat | a response | | | | | | Data | | | Block | state | | | | | | data cycles | S Double/single/partial-word data response | | | | 1 | Req | Request number | | X | | quality | Data | ı type | Х | Ι | ECC | ## SysAD[63:0] Encoding This section describes the system address/data bus encoding. #### SysAD[63:0] Address Cycle Encoding Table 6-21 presents the encoding of the **SysAD[63:0]** bus for address cycles. Table 6-21 Encoding of SysAD[63:0] for Address Cycles | SysAD[63:60] | Target Indication | |--------------|-----------------------------------------| | SysAD[63] | Target processor with <b>DevNum</b> = 3 | | SysAD[62] | Target processor with <b>DevNum</b> = 2 | | SysAD[61] | Target processor with <b>DevNum</b> = 1 | | SysAD[60] | Target processor with <b>DevNum</b> = 0 | | SysAD[59:58] | Uncached attribute | | SysAD[57] | Secondary cache block way indication | | SysAD[56:40] | Reserved | | SysAD[39:0] | Physical address | #### SysAD[63:60] During the address cycle of processor noncoherent block read, double/single/partial-word read, block write, double/single/partial-word write, and eliminate requests, the processor always drives a target indication of 0 on **SysAD[63:60]**. This indicates that the request targets the external agent only. When the **CohPrcReqTar** mode bit is negated, during the address cycle of processor coherent block read and upgrade requests, the processor also drives a target indication of 0 on **SysAD[63:60]**. However, when the **CohPrcReqTar** mode bit is asserted, during the address cycle of processor coherent block read and upgrade requests, the processor drives a target indication of 0xF on **SysAD[63:60]**. This indicates that the request targets all processors, together with the external agent, on the cluster bus. In multiprocessor systems using the cluster bus, the **CohPrcReqTar** mode bit is asserted for a snoopy-based coherency protocol, and negated for a duplicate tag or directory-based coherency protocol. When the processor is in slave state, an external agent uses the target indication field to specify which processors are targets of an external request. #### SysAD[59:58] Uncached Attribute During the address cycle of processor double/single/partial-word read and write requests and during the address cycle of processor *Uncached accelerated* block write requests, the processor drives the uncached attribute onto **SysAD[59:58]**. See the section titled, Support for Uncached Attribute, in this chapter for more information. #### **SysAD[57]** During the address cycle of processor block read, typical block write, upgrade, and eliminate requests, **SysAD**[57] contains the secondary cache block way indication. This information may be useful for system designs implementing a duplicate tag or a directory-based coherency protocol. #### **SysAD**[56:40] When processor is in master state, it drives SysAD[56:40] to zero during address cycles. #### SysAD[39:0] During the address cycle of processor and external requests, **SysAD[39:0]** contain the physical address. Table 6-22 presents the processor request address cycle address alignment. Table 6-22 Processor Request Address Cycle Alignment | Processor Request Type | Address Alignment | Address Bits Which<br>Are Driven to 0 | | | |------------------------------------------------------------|-----------------------|------------------------------------------------------------|--|--| | Block read | Quadword | 3:0 | | | | Doubleword read/write | Doubleword | 2:0 | | | | Singleword read/write | Singleword | 1:0 | | | | Halfword read/write | Halfword | 0 | | | | Byte, tribyte, quintibyte, sextibyte, septibyte read/write | Byte | - | | | | Block write | Secondary cache block | 5:0 ( <b>SCBlkSize</b> = 0)<br>6:0 ( <b>SCBlkSize</b> = 1) | | | | Upgrade | Quadword | 3:0 | | | | Eliminate | Secondary cache block | 5:0 ( <b>SCBlkSize</b> = 0)<br>6:0 ( <b>SCBlkSize</b> = 1) | | | Table 6-23 presents the external coherency request address cycle address alignment. Table 6-23 External Coherency Request Address Cycle Alignment | External Request Type | Address Alignment | Address Bits Which<br>Are Ignored | | | |-----------------------|-----------------------|--------------------------------------------|--|--| | Intervention | Quadword | 3:0 | | | | Invalidate | Secondary cache block | 5:0 (SCBlkSize = 0)<br>6:0 (SCBlkSize = 1) | | | #### SysAD[63:0] Data Cycle Encoding During System interface data cycles, when less than a doubleword is transferred on the **SysAD[63:0]** bus, the valid byte lanes depend on the request address and the **MemEnd** mode bit. For example, consider the data cycle for a byte request whose address modulo 8 is 1. When **MemEnd** is negated (little endian), the **SysAD[15:8]** byte lane is valid. When **MemEnd** is asserted (big endian), the **SysAD[55:48]** byte lane is valid. ## SysState[2:0] Encoding The processor provides a processor coherency state response by driving the targeted secondary cache block tag quality indication on **SysState[2]**, driving the targeted secondary cache block former state on **SysState[1:0]** and asserting **SysStateVal\*** for one **SysClk** cycle. Table 6-24 presents the encoding of the **SysState[2:0]** bus when **SysStateVal\*** is asserted. Table 6-24 Encoding of SysState[2:0] when SysStateVal\* Asserted | SysState[2] | Secondary cache block tag quality indication | |---------------|----------------------------------------------| | 0 | Tag is good | | 1 | Tag is bad | | SysState[1:0] | Secondary cache block former state | | 0 | Invalid | | 1 | Shared | | 2 | CleanExclusive | | 3 | DirtyExclusive | When **SysStateVal\*** is negated, **SysState[0]** indicates if a processor coherency data response is ready for issue. Table 6-25 presents the encoding of the **SysState[2:0]** bus when **SysStateVal\*** is negated. Table 6-25 Encoding of SysState[2:0] When SysStateVal\* Negated | SysState[2:1] | Reserved | |---------------|----------------------------------------------| | 0 | Reserved | | 1 | | | 2 | | | 3 | | | SysState[0] | Processor coherency data response indication | | 0 | Not ready for issue | | 1 | Ready for issue | ## SysResp[4:0] Encoding An external agent issues an external completion response by driving the request number associated with the corresponding request on **SysResp[4:2**], driving the completion indication on **SysResp[1:0**], and asserting **SysRespVal\*** for one **SysClk** cycle. Table 6-26 presents the encoding of the **SysResp[4:0**] bus. Table 6-26 Encoding of SysResp[4:0] | SysResp[4:2] | Request number | |--------------|-----------------------------| | SysResp[1:0] | Completion indication | | 0 | Acknowledge (ACK) | | 1 | Error (ERR) | | 2 | Negative acknowledge (NACK) | | 3 | Reserved | # **6.14 Interrupts** The processor supports five hardware, two software, one timer, and one nonmaskable interrupt. The Interrupt exception is described in Chapter 17, the section titled "Interrupt Exception." ## **Hardware Interrupts** Five hardware interrupts are accessible to an external agent via external interrupt requests. An external interrupt request consists of a single address cycle on the System interface. During the address cycle, **SysAD[63:60]** specify the target indication, which allows an external agent to define the target processors of the external interrupt request. If a processor determines it is an external interrupt request target, **SysAD[20:16]** are the write enables for the five individual *Interrupt* register bits and **SysAD[4:0]** are the values to be written into these bits, as shown in Figure 6-5. This allows any subset of the *Interrupt* register bits to be set or cleared with a single external interrupt request. The *Interrupt* register is an architecturally transparent, level-sensitive register that is directly readable as bits 14:10 of the *Cause* register. Since it is level-sensitive, an interrupt bit must remain asserted until the interrupt is taken, at which time the interrupt handler must cause a second external interrupt request to clear the bit. The processor clears the *Interrupt* register during any of the reset sequences. Figure 6-5 Hardware Interrupts ### **Software Interrupts** The two software interrupts are accessible as bits 9:8 of the *Cause* register, as shown in Figure 6-5. An MTC0 instruction is used to write these bits. #### **Timer Interrupt** The timer interrupt is accessible as bit 15 of the *Cause* register, **IP**[7], as shown in Figure 6-5. This bit is set when one of the following occurs: - the Count register is equal to the Compare register - either one of the two performance counters overflows #### Nonmaskable Interrupt A nonmaskable interrupt is accessible to an external agent as the **SysNMI\*** signal. To post a nonmaskable interrupt, an external agent asserts **SysNMI\*** for at least one **SysClk** cycle. The processor recognizes the nonmaskable interrupt on the first **SysClk** cycle that **SysNMI\*** is asserted. After the nonmaskable interrupt is serviced, an external agent may post another nonmaskable interrupt by first negating **SysNMI\*** for at least one **SysClk** cycle, and reasserting **SysNMI\*** for at least one **SysClk** cycle. ## 6.15 Protocol Abbreviations The following abbreviations are used in the System interface protocols: #### SysCmd[11:0] Abbreviations **Cmd** Unspecified command BlkRd Block read request command **RdShd** Coherent block read shared request command **RdExc** Coherent block read exclusive request command **DSPRd** Double/single/partial-word read command BlkWr Block write request command **DSPWr** Double/single/partial-word write request command Ugd Upgrade request command Elm Eliminate request command IvnShd Intervention shared request command IvnExc Intervention exclusive request command Alc Allocate request number command Ivd Invalidate request command Int Interrupt request command **ExtCoh** External coherency request command ReqDat Request dataRspDat Response dataReqLst Request lastRspLst Response last Empty Empty; SysCmd(11:0) and SysAD(63:0) are undefined #### SysAD[63:0] Abbreviations Adr Physical address Dat Unspecified data **Dat<n>** Doubleword n of a block ## SysState[2:0] Abbreviations State Unspecified state Ivd InvalidShd Shared ClnExc CleanExclusive DrtExc DirtyExclusive #### SysResp[4:0] Abbreviations **Rsp** Unspecified completion response **ACK** Acknowledge completion response **ERR** Error completion response **NACK** Negative acknowledge completion response #### **Master** Abbreviations **EA** External agent **Pn** R10000 processor whose device number is n Dead cycle # **6.16 System Interface Arbitration** The processor supports a simple System interface arbitration protocol, which relies on an external arbiter. This protocol is used in uniprocessor systems, multiprocessor systems using dedicated external agents, and multiprocessor systems using the cluster bus. System interface arbitration is handled by the **SysReq\***, **SysGnt\***, and **SysRel\*** signals (request, grant, and release). As described earlier in this chapter, the System interface resides in either master or slave state; the processor enters slave state during all of the reset sequences. When mastership of the System interface changes, there is always one dead **SysClk** cycle during which the bidirectional signals are not driven; the processor ignores all bidirectional signals during this dead **SysClk** cycle. The protocol supports overlapped arbitration which allows arbitration to occur in parallel with requests and responses. This results in fewer wasted cycles when mastership of the System interface changes. **Grant parking** is also supported, allowing a device to retain mastership of the System interface as long as no other device requests the System interface. In multiprocessor systems using the cluster bus, the external arbiter typically implements a round-robin priority scheme. #### **System Interface Arbitration Rules** The rules for the System interface arbitration are listed below: - If the System interface is in slave state, and a processor request or coherency data response is ready for issue, and the required resources are available (e.g. a free request number, SysRdRdy\* asserted, etc.), the processor asserts SysReq\*. The processor will not assert SysReq\* unless all of the above conditions are met. - The processor waits for the assertion of **SysGnt\***. - When the processor observes the assertion of SysGnt\* it negates SysReq\* two SysClk cycles later. Once the processor asserts SysReq\*, it does not negate SysReq\* until the assertion of SysGnt\*, even if the need for the System interface bus is contravened by an external coherency request. - When the processor observes the assertion of SysRel\*, it enters master state two SysClk cycles later, and begins to drive the System interface bus. SysRel\* may be asserted coincidentally with or later than SysGnt\*. - Once in master state, the processor does not relinquish mastership of the System interface until it observes the negation of SysGnt\*. - The processor indicates it is relinquishing mastership of the System interface bus by asserting SysRel\* for one SysClk cycle, two or more SysClk cycles after the negation of SysGnt\*. The processor may issue any type of processor request or coherency data response in the two SysClk cycles following the negation of SysGnt\*. This may delay the assertion of SysRel\*. ## **Uniprocessor System** Figure 6-6 shows how the System interface arbitration signals are used in a uniprocessor system. Note that this same configuration would be used in a multiprocessor system using dedicated external agents. Figure 6-6 Arbitration Signals for Uniprocessor System Figure 6-7 is an example of the operation of the System interface arbitration in a uniprocessor system. The *Master* row in the following figures indicates which device is driving the System interface bidirectional signals (P<sub>0</sub> and EA in Figure 6-7). When this row contains a dash (-), as shown in Cycle 12 of Figure 6-7, mastership of the System interface is changing and no device is driving the System interface bidirectional signals for this one dead **SysClk** cycle. The external agent generally asserts the **SysGnt\*** signal, which allows the processor to issue requests at any time. When the external agent needs to return an external data response, it negates **SysGnt\*** for a minimum of one cycle, waits for the processor to assert **SysRel\***, and then begins driving the System interface bus after one dead **SysClk** cycle. Figure 6-7 Arbitration Protocol for Uniprocessor System ## **Multiprocessor System Using Cluster Bus** Figure 6-8 shows how the System interface arbitration signals are used in a four-processor system using the cluster bus. Figure 6-8 Arbitration Signals for Multiprocessor System Using the Cluster Bus Figure 6-9 is an example of the System interface arbitration in a four-processor system using the cluster bus. Figure 6-9 Arbitration Protocol for Multiprocessor System Using the Cluster Bus # 6.17 System Interface Request and Response Protocol The following sections detail the System interface request and response protocol. A 32-word secondary cache block size is assumed in the examples below. ### **Processor Request Protocol** A processor request is generated when the R10000 processor requires a system resource. The processor may only issue a processor request when the System interface is in master state. If the System interface is in master state, the processor may issue a processor request immediately. Processor requests may occur in adjacent SysClk cycles. If the System interface is not in master state, the processor must first assert SysReq\*, and then wait for the external agent to relinquish mastership of the System interface bus by asserting SysGnt\* and SysRel\*. When multiple, nonconflicting processor requests and/or coherency data responses are ready and meet all issue requirements, the processor uses the following priority: - block read and upgrade requests have the highest priority, followed by - · processor coherency data responses, - processor eliminate and typical block write requests, - processor double/single/partial-word read/write and uncached accelerated block write requests, which have the lowest priority. ## **Processor Block Read Request Protocol** #### Errata A processor block read request results from a cached instruction fetch, load, store, or prefetch that misses in the secondary cache. Before issuing a processor block read request, the processor changes the secondary cache state to *Invalid*. Additionally, if the secondary cache block former state was *DirtyExclusive*, a write back is scheduled. Note that if the processor block read request receives an external NACK or ERR completion response, the secondary cache block state remains *Invalid*. The processor issues a processor block read request with a single address cycle. The address cycle consists of the following: - negating SysCmd[11] - driving a free request number on SysCmd[10:8] - driving the block read command on **SysCmd**[7:5] - driving the read cause indication on SysCmd[4:3] - driving the secondary cache block former state on **SysCmd[2:1**] - asserting SysCmd[0] - driving the target indication on SysAD[63:60] - driving the secondary cache block way on SysAD[57] - driving the physical address on SysAD[39:0] - asserting SysVal\* The processor may only issue a processor block read request address cycle when the following are true: - the System interface is in master state - SysRdRdy\* was asserted two SysClk cycles earlier - there is no conflicting entry in the outgoing buffer - the maximum number of outstanding processor requests specified by the PrcReqMax mode bits is not exceeded - there is a free request number - the processor is not the target of a conflicting outstanding external coherency request A single processor may have as many as four processor block read requests outstanding on the System interface at any given time. Figure 6-10 depicts four processor block read requests. Since the System interface is initially in slave state, the processor must first assert **SysReq\*** and then wait until the external agent relinquishes mastership of the System interface by asserting **SysGnt\*** and **SysRel\***. Figure 6-10 Processor Block Read Request Protocol #### Processor Double/Single/Partial-Word Read Request Protocol A processor double/single/partial-word read request results from an uncached instruction fetch or load. The processor issues a processor double/single/partial-word read request with a single address cycle. The address cycle consists of: - negating SysCmd[11] - driving a free request number on SysCmd[10:8] - driving the double/single/partial-word read command on SysCmd[7:5] - driving the read cause indication on SysCmd[4:3] - driving the data size indication on SysCmd[2:0] - driving the target indication on SysAD[63:60] - driving the uncached attribute on SysAD[59:58] - driving the physical address on SysAD[39:0] - asserting SysVal\* The processor may only issue a processor double/single/partial-word read request address cycle when: - the System interface is in master state - SysRdRdy\* was asserted two SysClk cycles previously - the maximum number of outstanding processor requests specified by the PrcReqMax mode bits is not exceeded - there is a free request number A single processor may have a maximum of one processor double/single/partial-word read request outstanding on the System interface at any given time. Figure 6-11 depicts a processor double/single/partial-word read request. Since the System interface is initially in slave state, the processor must first assert **SysReq\*** and then wait until the external agent gives up mastership of the System interface by asserting **SysGnt\*** and **SysRel\***. Figure 6-11 Processor Double/Single/Partial-Word Read Request Protocol #### **Processor Block Write Request Protocol** A processor block write request results from the following: - replacement of a *DirtyExclusive* secondary cache block due to a load, store, or prefetch secondary cache miss - a CACHE Index WriteBack Invalidate (S) or Hit WriteBack Invalidate (S) instruction - a completely gathered uncached accelerated block As shown in Figure 6-12, the processor issues a processor block write request with a single address cycle followed by 8 or 16 data cycles. The address cycle consists of the following: - negating SysCmd[11] - driving the block write command on **SysCmd**[7:5] - driving the write cause indication on **SysCmd[4:3**] - driving the target indication on SysAD[63:60] - driving the physical address on **SysAD[39:0**] - asserting SysVal\* #### Errata If the processor block write request results from the writeback of a secondary cache block, the *Dirty Exclusive* secondary cache block former state is driven on **SysAD[2:1]**, the secondary cache block way is driven on **SysAD[57]** and **SysCmd[0]** is asserted. If the processor block write request results from a completely gathered uncached accelerated block, the uncached attribute is driven on **SysAD[59:58]** and **SysCmd[0]** is negated. Each data cycle consists of the following: - asserting SysCmd[11] - driving the data quality indication on SysCmd[5] - driving the data type indication on **SysCmd[4:3**] - driving the data on **SysAD[63:0]** - asserting SysVal\* The first 7 or 15 data cycles have a request data type indication, and the last data cycle has a request last data type indication. The processor may negate **SysVal\*** between data cycles of a processor block write request only if the **SCClk** frequency is less than half of the **SysClk** frequency. The processor may only issue a processor block write request address cycle when the following are true: - the System interface is in master state - SysWrRdy\* was asserted two SysClk cycles previously - the processor is not the target of a conflicting outstanding external coherency request Figure 6-12 depicts two adjacent processor block write requests. Since the System interface is initially in slave state, the processor must first assert **SysReq\*** and then wait until the external agent relinquishes mastership of the System interface by asserting **SysGnt\*** and **SysRel\***. Figure 6-12 Processor Block Write Request Protocol #### Processor Double/Single/Partial-Word Write Request Protocol A processor double/single/partial-word write request results from an uncached store or incompletely gathered uncached accelerated block. As shown in Figure 6-13, the processor issues a processor double/single/partial-word write request with a single address cycle immediately followed by a single data cycle. The address cycle consists of the following: - negating SysCmd[11] - driving the double/single/partial-word write command on **SysCmd[7:5**] - driving the write cause indication on **SysCmd[4:3**] - driving the data size indication on SysCmd[2:0] - driving the target indication on SysAD[63:60] - driving the uncached attribute on **SysAD**[59:58] - driving the physical address on SysAD[39:0] - asserting SysVal\* The data cycle consists of the following: - asserting SysCmd[11] - driving the request last data type indication on **SysCmd[4:3**] - driving the write data on SysAD[63:0] - asserting SysVal\* The processor may only issue a processor double/single/partial-word write request address cycle when the System interface is in master state and **SysWrRdy\*** was asserted two **SysClk** cycles previously. Figure 6-13 depicts three processor double/single/partial write requests. Since the System interface is initially in slave state, the processor must first assert **SysReq\*** and then wait until the external agent relinquishes mastership of the System interface by asserting **SysGnt\*** and **SysRel\***. Figure 6-13 Processor Double/Single/Partial-Word Write Request Protocol #### **Processor Upgrade Request Protocol** A processor upgrade request results from a store or prefetch exclusive that hits a *Shared* block in the secondary cache. As shown in Figure 6-14, the processor issues a processor upgrade request with a single address cycle. This address cycle consists of the following: - negating SysCmd[11] - driving a free request number on SysCmd[10:8] - driving the upgrade command on **SysCmd**[7:5] - driving the upgrade cause indication on SysCmd[4:3] - driving the secondary cache block former state on SysCmd[2:1] - asserting SysCmd[0] - driving the target indication on SysAD[63:60] - driving the secondary cache block way on SysAD[57] - driving the physical address on **SysAD[39:0**] - asserting SysVal\* The processor may only issue a processor upgrade request address cycle when the following are true: - the System interface is in master state - SysRdRdy\* was asserted two SysClk cycles previously - the maximum number of outstanding processor requests specified by the PrcReqMax mode bits is not exceeded - there is a free request number - the processor is not the target of a conflicting outstanding external coherency request A single processor may have as many as four processor upgrade requests outstanding on the System interface at any given time. Figure 6-14 depicts four processor upgrade requests. Since the System interface is initially in slave state, the processor must first assert **SysReq\*** and then wait until the external agent relinquishes mastership of the System interface by asserting **SysGnt\*** and **SysRel\***. Figure 6-14 Processor Upgrade Request Protocol ### **Processor Eliminate Request Protocol** A processor eliminate request results from the following: - a cached instruction fetch, load, store, or prefetch that misses in the secondary cache and forces the replacement of a *Shared* or *CleanExclusive* secondary cache block - a CACHE Index WriteBack Invalidate (S), Hit Invalidate (S), or Hit WriteBack Invalidate (S) instruction that forces the invalidation of a *Shared* or *CleanExclusive* secondary cache block - a CACHE Hit Invalidate (S) instruction that forces the invalidation of a *DirtyExclusive* secondary cache block. A processor eliminate request notifies the external agent that a *Shared, CleanExclusive*, or *DirtyExclusive* block has been eliminated from the secondary cache. Such requests are useful for systems implementing a directory-based coherency protocol, and are enabled by asserting the **PrcElmReq** mode bit. The processor issues a processor eliminate request with a single address cycle. This address cycle consists of the following: - negating SysCmd[11] - driving the special command on SysCmd[7:5] - driving the eliminate special cause indication on **SysCmd[4:3**] - driving the secondary cache block former state on **SysCmd[2:1**] - asserting **SysCmd[0]** - driving the target indication on SysAD[63:60] - driving the secondary cache block way on SysAD[57] - driving the physical address of the eliminated secondary cache block on SysAD[39:0] - asserting SysVal\* The processor may only issue a processor eliminate request address cycle when the following are true: - the System interface is in master state - SysWrRdy\* was asserted two SysClk cycles previously - the PrcElmReq mode bit is asserted - the processor is not the target of a conflicting outstanding external coherency request Figure 6-15 depicts three processor eliminate requests. Since the System interface is initially in slave state, the processor must first assert **SysReq\*** and then wait until the external agent relinquishes mastership of the System interface by asserting **SysGnt\*** and **SysRel\***. Figure 6-15 Processor Eliminate Request Protocol ### **Processor Request Flow Control Protocol** The processor provides the signals **SysRdRdy\*** and **SysWrRdy\*** to allow an external agent to control the flow of processor requests. **SysRdRdy\*** controls the flow of processor read and upgrade requests whereas **SysWrRdy\*** controls the flow of processor write and eliminate requests. The processor can only issue a processor read or upgrade request address cycle to the System interface if **SysRdRdy\*** was asserted two **SysClk** cycles previously. Similarly, the processor can only issue the address cycle of a processor write or eliminate request to the System interface if **SysWrRdy\*** was asserted two **SysClk** cycles previously. To determine the processor request buffering requirements for the external agent, note that the processor can issue any combination of processor requests in adjacent **SysClk** cycles. Also, since the System interface operates register-to-register with the external agent, a round trip delay of four **SysClk** cycles occurs between a processor request address cycle which prompts the external agent for flow control, and the flow control actually preventing any additional processor request address cycles from occurring. Consequently, if the maximum number of outstanding processor requests specified by the **PrcReqMax** mode bits is four, the external agent must be able to accept at least four processor read or upgrade requests. Also, the external agent must be able to accept at least four processor eliminate requests, two processor double/single/partial-word write requests, or one processor block write request. Figure 6-16 depicts three processor double/single/partial-word write requests and four processor block read requests. After sensing the first processor double/single/partial-word write request, the external agent negates **SysWrRdy\***. The external agent must have buffering sufficient for one additional processor write request before the flow control takes effect. The external agent negates **SysRdRdy\*** upon observing the first processor read request. The external agent must have buffering sufficient for three additional processor read requests before the flow control takes effect. Figure 6-16 Processor Request Flow Control Protocol ### **External Response Protocol** The processor supports two classes of external responses: - external data responses provide a double/single/partial-word of data or provide a block of data using the SysAD[63:0] bus - external completion responses provide an acknowledge, error, or negative acknowledge indication using the **SysResp[4:0]** bus An external agent may only issue an external data response to the processor when the System interface is in slave state. If the System interface is not already in slave state, the external agent must first negate **SysGnt\*** and then wait for the processor to assert **SysRel\***. If the System interface is already in slave state, the external agent may issue an external data response immediately. External data responses may be accepted by the processor in adjacent **SysClk** cycles and in arbitrary order, relative to corresponding processor requests. An external agent may issue an external completion response when the System interface is in either master or slave state. External completion responses may be accepted by the processor in adjacent **SysClk** cycles and in arbitrary order, relative to the corresponding processor requests. ### **External Block Data Response Protocol** An external agent may issue an external block data response in response to a processor block read or upgrade request. An external agent issues an external block data response with 8 or 16 data cycles. Each data cycle consists of the following: - asserting SysCmd[11] - driving the request number associated with the corresponding processor request on SysCmd[10:8] - driving the data quality indication on SysCmd[5] - driving the data type indication on **SysCmd[4:3**] - driving the cache block state on SysCmd[2:1] - driving the ECC check indication on SysCmd[0] - driving the data on SysAD[63:0] - asserting SysVal\* The first 7 or 15 data cycles have a response data type indication, and the last data cycle has a response last data type indication. The external agent may negate **SysVal\*** between data cycles of an external block data response. External block data response data must be supplied in subblock order, beginning with the quadword-aligned address specified by the corresponding processor request. External block data responses for processor coherent block read shared or noncoherent block read requests may indicate a state of *Shared*, *CleanExclusive*, or *DirtyExclusive*. External block data responses for processor coherent block read exclusive or upgrade requests may indicate a state of *CleanExclusive* or *DirtyExclusive*. Figure 6-17 depicts two processor block read requests and the corresponding external block data responses. Figure 6-17 External Block Data Response Protocol ### External Double/Single/Partial-Word Data Response Protocol An external agent may issue an external double/single/partial-word data response in response to a processor double/single/partial-word read request. An external agent issues an external double/single/partial-word data response with a single data cycle; the data cycle consists of: - asserting SysCmd[11] - driving the request number associated with the corresponding processor request on SysCmd[10:8] - driving the data quality indication on **SysCmd**[5] - driving the response last data type indication on **SysCmd[4:3**] - driving the ECC check indication on **SysCmd[0]** - driving the data on **SysAD[63:0]** - asserting SysVal\* Figure 6-18 depicts a processor double/single/partial-word read request and the corresponding external double/single/partial-word data response. Figure 6-18 External Double/Single/Partial-Word Data Response Protocol ### **External Completion Response Protocol** An external agent issues an external completion response to provide an acknowledge, error, or negative acknowledge to an outstanding request, and to free the associated request number. An external agent issues an external completion response by driving the response on **SysResp[4:0]** and asserting **SysRespVal\*** for one **SysClk** cycle. **SysResp[4:2]** contains the request number associated with the corresponding outstanding request and **SysResp[1:0]** contains an acknowledge, error, or negative acknowledge indication, as described below: - The external agent issues an external ACK completion response for a processor read or upgrade request to indicate that the request was successful. An external ACK completion response may only be issued for a processor read request if a corresponding external data response is coincidentally or previously issued. - The external agent issues an external ERR completion response for a processor read or upgrade request to indicate that the request was unsuccessful. Upon receiving an external ERR completion response, the processor takes a Bus Error exception on the associated instruction. If the processor read or upgrade request was caused by a PREFETCH instruction, no exception is taken. Also, if the request was caused by a speculative instruction, no exception is taken. - The external agent issues an external NACK completion response for a processor read or upgrade request to indicate that the request was not accepted. Upon receiving an external NACK completion response, the processor re-evaluates the associated instruction. Due to the speculative nature of the R10000 processor, the re-evaluation may or may not result in the reissue of a similar processor request. An external ERR or NACK completion response issued in response to an external intervention, allocate request number, or invalidate has no affect on the processor except to free the request number. Figure 6-19 depicts a processor upgrade request and a corresponding external completion response. Figure 6-19 External Completion Response Protocol ### **External Request Protocol** ### Errata An external agent issues an external request when it requires a resource within the processor. The external agent refers to any device attached to the processor system interface. It may be memory interface or cluster coordinator ASIC, or another processor residing on the cluster bus. An external agent may only issue an external request to the processor when the System interface is in slave state. If the System interface is not already in slave state, the external agent must first negate **SysGnt\*** and then wait for the processor to assert **SysRel\***. If the System interface is already in slave state, the external agent may issue an external request immediately. The total number of outstanding external requests, including interventions, allocate request numbers, and invalidates, cannot exceed eight. External requests may be accepted by the processor in adjacent **SysClk** cycles. External intervention and invalidate requests are considered external coherency requests. ### **External Intervention Request Protocol** An external agent issues an external intervention request to obtain a *Shared* or *Exclusive* copy of a secondary cache block. An external agent issues an external intervention request with a single address cycle; this address cycle consists of the following: - negating SysCmd[11] - driving a request number on **SysCmd[10:8]** - driving the intervention command on SysCmd[7:5] - driving the ECC check indication on **SysCmd[0]** - driving the target indication on SysAD[63:60] - driving the physical address on SysAD[39:0] - asserting SysVal\* An external agent may only issue an external intervention request address cycle when the System interface is in slave state; typically a free request number is specified. An external agent may have as many as eight external intervention requests outstanding on the System interface at any given time. Figure 6-20 depicts three external intervention requests. Since the System interface is initially in master state, the external agent must first negate **SysGnt\*** and then wait until the processor relinquishes mastership of the System interface by asserting **SysRel\***. Figure 6-20 External Intervention Request Protocol ### **External Allocate Request Number Request Protocol** An external agent issues an external allocate request number request to reserve a request number for private use. Once allocated, the processor is prevented from using the request number until an external completion response for the request number is received. An external agent issues an external allocate request number request with a single address cycle; this address cycle consists of the following: - negating SysCmd[11] - driving a free request number on SysCmd[10:8] - driving the allocate request number command on SysCmd[7:5] - asserting SysVal\* An external agent may only issue an external allocate request number request address cycle when the System interface is in slave state and there is a free request number. The external agent may have as many as eight external allocate request number requests outstanding on the System interface at any given time. Figure 6-21 depicts three external allocate request number requests. Since the System interface is initially in master state, the external agent must first negate **SysGnt\*** and then wait until the processor relinquishes mastership of the System interface by asserting **SysRel\***. Figure 6-21 External Allocate Request Number Request Protocol ### **External Invalidate Request Protocol** An external agent issues an external invalidate request to invalidate a secondary cache block. An external agent issues an external invalidate request with a single address cycle. This address cycle consists of the following: - negating SysCmd[11] - driving a request number on **SysCmd[10:8]** - driving the invalidate command on **SysCmd**[7:5] - driving the ECC check indication on **SysCmd[0]** - driving the target indication on SysAD[63:60] - driving the physical address on **SysAD[39:0**] - asserting SysVal\* An external agent may only issue an external invalidate request address cycle when the System interface is in slave state; typically a free request number is specified. An external agent may have as many as eight external invalidate requests outstanding on the System interface at any given time. Figure 6-22 depicts three external invalidate requests. Since the System interface is initially in master state, the external agent must first negate **SysGnt\*** and then wait until the processor relinquishes mastership of the System interface by asserting **SysRel\***. Figure 6-22 External Invalidate Request Protocol ### **External Interrupt Request Protocol** An external agent issues an external interrupt request to interrupt the normal instruction flow of the processor. An external agent issues an external interrupt request with a single address cycle. This address cycle consists of the following: - negating SysCmd[11] - driving the special command on **SysCmd[7:5**] - driving the interrupt special cause indication on **SysCmd[4:3**] - driving the ECC check indication on **SysCmd[0]** - driving the target indication on SysAD[63:60] - driving the *Interrupt* register write enables on **SysAD[20:16]** - driving the *Interrupt* register values on **SysAD[4:0**] - asserting SysVal\* An external agent may only issue an external interrupt request address cycle when the System interface is in slave state. Figure 6-23 depicts three external interrupt requests. Since the System interface is initially in master state, the external agent must first negate **SysGnt\*** and then wait until the processor relinquishes mastership of the System interface by asserting **SysRel\***. Figure 6-23 External Interrupt Request Protocol # **Processor Response Protocol** Processor responses are supplied by the processor in response to external coherency requests that target the processor. The R10000 processor issues a processor coherency state response for each external coherency request that targets the processor. The processor issues a processor coherency data response for each external intervention request that targets the processor and hits a *DirtyExclusive* secondary cache block. Processor coherency state responses are issued by the processor in the same order that the corresponding external coherency requests are received. Processor coherency state and data responses may occur in adjacent **SysClk** cycles. ### **Processor Coherency State Response Protocol** A processor coherency state response results from an external coherency request that targets the processor. #### Errata The processor issues a processor coherency state response by driving the secondary cache block tag quality indication on **SysState[2]**, driving the secondary cache block former state on **SysState[1:0]**, and asserting **SysStateVal\*** for one **SysClk** cycle. The processor coherency state responses are issued in an order designated by the external coherency requests and will always be issued before an associated processor coherency data response. Note that processor coherency state responses can be pipelined ahead of the associated processor coherency data responses, and processor coherency data responses can be returned out-of-order. These cases typically arise from external coherency requests hitting outgoing buffer entries. Figure 6-24 depicts two external coherency requests and the resulting processor coherency state responses. Figure 6-24 Processor Coherency State Response Protocol ### **Processor Coherency Data Response Protocol** A processor coherency data response results from an external intervention request that targets the processor and hits a *DirtyExclusive* secondary cache block. The processor issues a processor coherency data response with a single empty cycle followed by either 8 or 16 data cycles. The empty cycle consists of negating **SysVal\*** for a single **SysClk** cycle. The data cycles consist of the following: - asserting SysCmd[11] - driving the request number associated with the corresponding external coherency request on **SysCmd[10:8**] - driving the data quality indication on SysCmd[5] - driving the data type indication on **SysCmd[4:3**] - driving the state of the cache block on SysCmd[2:1] - asserting SysCmd[0] - driving the data on SysAD[63:0], - asserting SysVal\* The first 7 or 15 data cycles have a response data type indication, and the last data cycle has a response last data indication. The processor may negate **SysVal\*** between data cycles of a processor coherency data response only if the **SCClk** frequency is less than half of the **SysClk** frequency. The processor may only issue a processor coherency data response when the System interface is in master state and SysWrRdy\* was asserted two SysClk cycles previously. Note that the empty cycle is considered the issue cycle for a processor coherency data response. If the System interface is not already in master state, the processor must first assert SysReq\*, and then wait for the external agent to relinquish mastership of the System interface bus by asserting SysGnt\* and SysRel\*. If the System interface is already in master state, the processor may issue a processor coherency data response immediately. ### Errata When **SysStateVal\*** is negated, **SysState[0]** provides the processor coherency data response indication. The processor asserts the processor coherency data response indication when there are one or more processor coherency data responses pending issue in the outgoing buffer. Once asserted, the indication is negated when the first doubleword of the last pending issue processor coherency data response is issued to the system interface bus. The processor coherency data response indication is not affected by **SysWrRdy\***. However, as previously noted the processor may only issue a processor coherency data response when **SysWrRdy\*** was asserted two **SysClk** cycles previously. Processor coherency data response data is supplied in subblock order, beginning with the quadword-aligned address specified by the corresponding external coherency request. Processor coherency data responses are not necessarily issued in the same order as the external coherency requests; however each processor coherency data response always follows the corresponding processor coherency state response. Note that more than one processor coherency state response may be pipelined ahead of the corresponding processor coherency data responses. Figure 6-25 depicts one external coherency request and the resulting processor coherency state and data responses. Figure 6-25 Processor Coherency Data Response Protocol ### **6.18 System Interface Coherency** The System interface supports external intervention shared, intervention exclusive, and invalidate coherency requests. These requests are used by an external agent or other R10000 processors on the cluster bus to maintain cache coherency. Each external coherency request that targets an R10000 results in a processor coherency state response. Additionally, each external intervention request that targets the R10000 and hits a *DirtyExclusive* secondary cache block results in a processor coherency data response. External coherency requests and the corresponding processor coherency state responses are handled in FIFO order. ### **External Intervention Shared Request** An external intervention shared request is used by an external agent to obtain a *Shared* copy of a cache block. If the desired block resides in the processor cache, it is marked *Shared*. If the secondary cache block's former state was *DirtyExclusive*, the processor issues a processor coherency data response. ### **External Intervention Exclusive Request** An external intervention exclusive request is used by an external agent to obtain an *Exclusive* copy of a cache block. If the desired block resides in the processor cache, it is marked *Invalid*. If the secondary cache block's former state was *DirtyExclusive*, the processor issues a processor coherency data response. ### **External Invalidate Request** An external invalidate request is used by an external agent to invalidate a cache block. If the desired block resides in the processor cache, it is marked *Invalid*. Under normal circumstances, the secondary cache block former state should not be *CleanExclusive* or *DirtyExclusive*. # **External Coherency Request Action** Table 6-27 indicates the action taken for external coherency requests that target the processor. Table 6-27 Action Taken for External Coherency Requests that Target the R10000 Processor $^{\dagger}$ | Secondary Cache<br>Block<br>Former State | Type of<br>External Request | Secondary<br>Cache Block<br>New State | Processor Coher-<br>ency State Re-<br>sponse<br>SysState[1:0] | Processor Co-<br>herency Data<br>Response Re-<br>quired? | Processor Coher-<br>ency Data Re-<br>sponse State<br>SysCmd[2:1] | |------------------------------------------|--------------------------------------------------------------------------|---------------------------------------|---------------------------------------------------------------|----------------------------------------------------------|------------------------------------------------------------------| | Invalid | Intervention shared Intervention exclusive | Invalid<br>Invalid | 0 | No<br>No | N/A<br>N/A | | | Invalidate | Invalid | 0 | No | N/A | | Shared | Intervention shared<br>Intervention exclusive<br>Invalidate | Shared<br>Invalid<br>Invalid | 1<br>1<br>1 | No<br>No<br>No | N/A<br>N/A<br>N/A | | CleanExclusive | Intervention shared<br>Intervention exclusive<br>Invalidate <sup>‡</sup> | Shared<br>Invalid<br>Invalid | 2<br>2<br>2 | No<br>No<br>No | N/A<br>N/A<br>N/A | | DirtyExclusive | Intervention shared* Intervention exclusive Invalidate* | Shared<br>Invalid<br>Invalid | 3<br>3<br>3 | Yes<br>Yes<br>No | Shared<br>DirtyExclusive<br>N/A | <sup>‡</sup> This should not occur under normal circumstances. <sup>\*</sup> The processor coherency data response must be written back to memory. <sup>†</sup> These actions are taken in cases where there are no internal coherency conflicts. For exceptions due to internal coherency conflicts, please refer to Table 6-28. ### **Coherency Conflicts** Coherency conflicts arise when a processor request and an external request target the same secondary cache block. Coherency conflicts may be categorized as either internal or external, and are described in this section. ### **Internal Coherency Conflicts** A processor request is considered to be **pending issue** when it is buffered in the processor and has not yet been issued to the System interface bus. Internal coherency conflicts occur when the processor has a processor request pending issue and a conflicting external coherency request is received. Internal coherency conflicts are unavoidable and cannot be anticipated by the external agent since it cannot anticipate when the processor will have processor requests pending issue. Table 6-28 describes the manner in which the processor resolves internal coherency conflicts. Table 6-28 Internal Coherency Conflict Resolution | Processor Request Pending Issue | Conflicting External Coherency Request | Resolution | | | | |---------------------------------|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--| | Coherent block read | Intervention shared | The processor allows the conflicting external | | | | | | Intervention exclusive | coherency request to proceed and provides an <i>Invalid</i> processor coherency state response. The processor | | | | | | Invalidate | stalls the processor coherent block read request until the conflicting external coherency request has received an external completion response. | | | | | Upgrade | Intervention shared | The processor allows the conflicting external | | | | | | Intervention exclusive | coherency request to proceed and provides a <i>Shared</i> processor coherency state response. Once the | | | | | | Invalidate | conflicting external coherency request has received an external completion response, the processor internally NACKs the processor upgrade request that is pending issue. | | | | | | Intervention shared | The processor provides a <i>DirtyExclusive</i> processor | | | | | Block write | Intervention exclusive | coherency state response and changes the processor block write request that is pending issue into a <i>DirtyExclusive</i> processor coherency data response. | | | | | | Invalidate | The processor provides a <i>DirtyExclusive</i> processor coherency state response and deletes the processor block write request that is pending issue. | | | | | | Intervention shared | The processor provides a Shared or CleanExclusive | | | | | Eliminate | Intervention exclusive | processor coherency state response and deletes the | | | | | | Invalidate | processor eliminate request that is pending issue. <sup>‡</sup> | | | | <sup>‡</sup> If the processor eliminate request that is pending issue has a *DirtyExclusive* state, a *CleanExclusive* processor coherency state response is provided. ### **External Coherency Conflicts** #### Errata A processor request is considered to be **pending response** when it has been issued to the System interface bus but has not yet received an external data or completion response. External coherency conflicts occur when the processor has a processor request that is pending response and a conflicting external coherency request is received. The processor relies on the external agent to detect and resolve external coherency conflicts. If the external agent chooses to issue an external coherency request to the processor which causes an external coherency conflict, the external coherency request must be completed before an external response is given to the conflicting processor request. External coherency conflicts may be avoided if the point of coherence is the processor System interface bus and only one request is allowed to be outstanding for any given secondary cache block. However, in some system designs external coherency conflicts are unavoidable. Processor block write and eliminate requests are never pending response, and therefore cannot cause external coherency conflicts. Table 6-29 describes the manner in which the external agent resolves external coherency conflicts. Table 6-29 External Coherency Conflict Resolution | Processor Requests that are Pending Response | Conflicting External Co-<br>herency Request | Resolution | | | |----------------------------------------------|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--| | Coherent block read | Intervention shared | The external agent responds to the external coherency | | | | | Intervention exclusive | requestor that the block is <i>Invalid</i> . At some later time the external agent supplies an external response to th | | | | | Invalidate | processor coherent block read request that is pending response. <sup>‡</sup> | | | | | Intervention shared | The external agent responds to the external coherency requestor that the block is <i>Shared</i> . At some later time, the external agent supplies an external response to the processor upgrade request that is pending response.* | | | | | Intervention exclusive | The external agent issues the conflicting external | | | | Upgrade | Invalidate | coherency request to the processor. The processor allows the conflicting external coherency request to proceed and supplies a <i>Shared</i> processor coherency state response. After observing the processor coherency state response, the external agent provides an external ACK completion response for the conflicting external coherency request. At some later time, the external agent supplies an external response for the processor upgrade request that is pending response. This external response may not be an external ACK completion response unless it is associated with an external block data response. | | | <sup>‡</sup> Although it is not required, the external agent may choose to issue the conflicting external coherency request to R10000 and the processor will return an *invalid* processor coherency state response. # Errata Revised the two footnotes in Table 6-29 above. <sup>\*</sup> Although it is not required, the external agent may choose to issue the conflicting external coherency request to R10000 and the processor will return a *shared* processor coherency state response. ### **External Coherency Request Latency** This section describes the R10000 external coherency request latency. Figure 6-26 depicts the following: - an external coherency request which targets the processor - the resulting processor coherency state response - the potential processor coherency data response Two external coherency request latency parameters are also defined: - the processor coherency state response latency, t<sub>pcsr</sub>, specifies the time from external coherency request to processor coherency state response - the processor coherency data response latency, t<sub>pcdr</sub>, specifies the time from the external coherency request to the processor coherency data response if a master, or to the assertion of the processor coherency data response indication on SysState[0] if a slave. Figure 6-26 External Coherency Request Latency Parameters The external coherency request latency is presented in Table 6-30. Table 6-30 External Coherency Request Latency | | Latency <sup>‡</sup> (PClk cycles) | | | | | | |----------|------------------------------------------------------------|-------------------|---------------------------------------------------------|-------------------|--------------------|--------| | | Processor Coherency State<br>Response (t <sub>pcsr</sub> ) | | Processor Coherency Data Response* (t <sub>pcdr</sub> ) | | | | | SCClkDiv | Min <sup>†</sup> | Typ <sup>‡‡</sup> | Max** | Min <sup>††</sup> | Typ <sup>‡‡‡</sup> | Max*** | | 1 | 5 | 10 | 39 | 8 | 28 | 70 | | 1.5 | 5 | 13 | 48 | 8 | 33 | 88 | | 2 | 5 | 14 | 59 | 8 | 38 | 105 | | 2.5 | 5 | 16 | 71 | 8 | 43 | 128 | | 3 | 5 | 17 | 79 | 8 | 43 | 141 | - ‡ This latency assumes no other previously issued external coherency requests are outstanding. 1 to 3 additional PClk cycles may be required for synchronization with SysClk depending on the SysClkDiv mode bits. - \* This value assumes a 32-word secondary cache block size. - † This value assumes the external coherency request hits a cached or outgoing buffer entry. - ‡‡ This value assumes the external coherency request does not hit a cached or outgoing buffer entry, the secondary cache is not busy, and the external coherency request hits in the MRU way of the secondary cache. If the external coherency request misses in the most-recently used (MRU) way of the secondary cache, 1 to 3 additional PClk cycles are required to query the LRU way of the secondary cache, depending on the SCClkDiv mode bits. - \*\* This value assumes the external coherency request does not hit a cached or outgoing buffer entry, the secondary cache just commenced an index-conflicting CACHE Hit WriteBack Invalidate (S), and the external coherency request misses in the secondary cache MRU way. - †† This value assumes the external coherency request hits an outgoing buffer entry. - ‡‡‡ This value assumes the external coherency request does not hit a cached or outgoing buffer entry, the secondary cache is not busy, the external coherency request hits in the MRU way of the secondary cache, no subset primary data cache blocks are inconsistent, and the external coherency request is secondary cache block-aligned. If the external coherency request misses in the MRU way of the secondary cache, 1 to 3 additional PClk cycles are required to query the LRU way of the secondary cache, depending on the SCClkDiv mode bits. - \*\*\* This value assumes the external coherency request does not hit a cached or outgoing buffer entry, the secondary cache just commenced an index-conflicting CACHE Hit WriteBack Invalidate (S), the external coherency request hits in the LRU way of the secondary cache, all subset primary data cache blocks are inconsistent, and the external coherency request is not secondary cache block-aligned. ### SysGblPerf\* Signal The **SysGblPerf\*** signal is provided for systems implementing a relaxed consistency memory model. The external agent asserts this signal when all processor requests are globally performed, thereby allowing the processor to graduate SYNC instructions. The external agent negates this signal when some processor requests are not yet globally performed, thereby preventing the processor from graduating SYNC instructions. To prevent a SYNC instruction from graduating, the external agent must negate the SysGblPerf\* signal no later than the same SysClk cycle in which it issued the external completion response for a processor read or upgrade request which is not yet globally performed. Also, the external agent must negate the SysGblPerf\* signal no later than two SysClk cycles after the address cycle of a processor double/single/partial-word write request which has not yet been globally performed. The **SysGblPerf\*** signal may be permanently asserted in systems implementing a sequential consistency memory model. # 6.19 Cluster Bus Operation A R10000 multiprocessor cluster may be created by directly attaching the System interfaces of 2 to 4 R10000 processors, and providing an external cluster coordinator to handle arbitration and coherency management. The cluster coordinator arbitrates the multiprocessors using the **SysReq\***, **SysGnt\***, and **SysRel\*** signals. A processor request issued by an R10000 processor in master state is observed as an external request by any R10000 processors in the slave state on the cluster bus. This is described Table 6-31. Table 6-31 Relationship Between Processor and External Requests for the Cluster Bus | Processor Request | External Request | |----------------------------------|-------------------------| | Coherent block read shared | Intervention shared | | Coherent block read exclusive | Intervention exclusive | | Noncoherent block read | Allocate request number | | Double/single/partial-word read | Allocate request number | | Block write | NOP | | Double/single/partial-word write | NOP | | Upgrade | Invalidate | | Eliminate | NOP | In the same manner, a processor coherency data response issued by a processor in the master state is observed as an external block data response by any processors in the slave state. External coherency requests that target a processor are handled in FIFO order and result in processor coherency state responses. If an external coherency request that targets a processor hits a *DirtyExclusive* secondary cache block, the processor also provides a processor coherency data response. Figure 6-27 presents an example of a processor read request with four R10000 processors residing on the cluster bus. The **CohPrcReqTar** mode bit is asserted for a snoopy-based coherency protocol. R10000<sub>0</sub> issues a processor coherent read exclusive request. This is observed as an external intervention exclusive request by R10000<sub>1</sub>, R10000<sub>2</sub>, and R10000<sub>3</sub>. R10000<sub>1</sub> and R10000<sub>3</sub> respond with *Invalid* processor coherency state responses. R10000<sub>2</sub> responds with a *DirtyExclusive* processor coherency state response. Based on these processor coherency state responses, the cluster coordinator allows R10000<sub>2</sub> to become master of the System interface so that it may provide a processor coherency data response, which will be observed as an external block data response by R10000<sub>0</sub>. Finally, the cluster coordinator issues an external ACK completion response to forward the external block data response and to free the request number. Figure 6-28 presents an example of a processor upgrade request with four R10000 processors residing on the cluster bus. The **CohPrcReqTar** mode bit is asserted for a snoopy-based coherency protocol. R10000<sub>0</sub> issues a processor upgrade request, observed as an external invalidate request by R10000<sub>1</sub>, R10000<sub>2</sub>, and R10000<sub>3</sub>. R10000<sub>2</sub> and R10000<sub>3</sub> provide *Shared* processor coherency state responses. R10000<sub>1</sub> provides an *Invalid* processor coherency state response. Based on these processor coherency state responses, the cluster coordinator issues an external ACK completion response for the processor upgrade request to indicate that the request was successful and to free the request number. Figure 6-27 R10000 Multiprocessor Cluster Processor Read Request Example Figure 6-28 R10000 Multiprocessor Cluster Processor Upgrade Request Example ## 6.20 Support for I/O The processor assumes a memory-mapped I/O model. Consequentially, no special System interface encodings are provided, or required to designate I/O accesses. It is left to the programmer to ensure that I/O addresses have the appropriate TLB mappings. The processor supports system designs utilizing hardware or software for coherent I/O. The external coherency requests are useful for creating systems with hardware I/O coherency, and the CACHE instruction is sufficient for creating a system with software I/O coherency. ## **6.21 Support for External Duplicate Tags** Some system designs implement an external duplicate copy of the secondary cache tags to reduce the coherency request latency and also filter out unnecessary external coherency requests made to the R10000 processor. For such systems, it must be remembered that blocks may reside in either the secondary cache or in the outgoing buffer. During the address cycle of processor block read requests, the secondary cache block former state is provided. The external agent may use this information to maintain the external duplicate tags. Typically, in a multiprocessor system using the cluster bus, the cluster coordinator specifies a free request number for an external coherency request. However, in a system using a duplicate-tag or directory-based coherency protocol, where the **CohPrcReqTar** mode bit is negated, the cluster coordinator may specify a busy request number for an external coherency request, providing each targeted R10000 processor has the request number busy due to an outstanding processor coherency request from another processor. For example, suppose the processor in master state issues a processor coherent block read or upgrade request. The processors in slave state observe the processor request as an external coherency request that targets the external agent only, causing the associated request number to become busy. The cluster coordinator checks the duplicate tag or directory structure to determine if the block resides in the cache of one of the processors that was in slave state. If necessary, the cluster coordinator issues an external coherency request targeted at one or more of the processors that were in slave state. By using the same request number as the original processor request, this external coherency request does not consume a free request number, and allows a potential processor coherency data response to be supplied as an external block data response to the original processor request. # 6.22 Support for a Directory-Based Coherency Protocol Some system designs implement a directory-based coherency protocol. For such systems, the processor provides the processor eliminate request cycle. If the **PrcElmReq** mode bit is asserted, the processor issues a processor eliminate request whenever it intends to eliminate a *Shared*, *CleanExclusive*, or *DirtyExclusive* block from the secondary cache. During the address cycle of the processor eliminate request, the physical address and the secondary cache block former state are provided. The external agent may then use this information to maintain an external directory structure. # **6.23 Support for Uncached Attribute** The processor supports a 2-bit user-defined *Uncached Attribute*, which is driven on **SysAD[59:58]** during the address cycle of the following: - processor double/single/partial-word read requests - double/single/partial-word write requests - block write requests resulting from completely gathered uncached accelerated blocks For unmapped accesses, the uncached attribute is sourced from VA[58:57]. For mapped accesses, the uncached attribute is sourced from the TLB *Uncached Attribute* field. The TLB *Uncached Attribute* field may be initialized in 64-bit mode using bits 63:62 of the CP0 *EntryLo0* and *EntryLo1* registers. # **6.24 Support for Hardware Emulation** When using the R10000 processor in hardware emulation, it is desirable to operate the System interface at a relative low frequency (typically 1 MHz or below). Since the R10000 processor contains dynamic circuitry, an external agent cannot simply provide low frequency **SysClk**, so a **SysCyc\*** input to the processor allows an external agent to define a virtual system clock, and yet supply a **SysClk** within the acceptable operating range. The assertion of **SysCyc\*** in a particular **SysClk** cycle creates a virtual system clock pulse four **SysClk** cycles later. **SysCyc\*** may be asserted aperiodically. In a normal system environment, the SysCyc\* input should be permanently asserted. Figure 6-29 depicts the use of **SysCyc\*** to create a virtual **SysClk** of one-third the normal **SysClk** frequency. Figure 6-29 Hardware Emulation Protocol # 7. Clock Signals The R10000 processor has differential PECL clock inputs, **SysClk** and **SysClk\***, from which all processor internal clock signals and secondary cache clock signals are derived. Three major clock domains are in the processor: - the **System interface clock domain**, which operates at the system clock frequency and controls the System interface signals - the internal processor clock domain, which controls the processor core logic - the **secondary cache clock domain**, which controls signals communicating with the external secondary cache synchronous SRAM These domains are described in this chapter. # 7.1 System Interface Clock and Internal Processor Clock Domains In high performance systems, PECL-level differential clocks are routinely used to minimize system clock skews. The R10000 processor receives differential system clock signals at the **SysClk** and **SysClk\*** pins; two additional pins, **SysClkRet** and **SysClkRet\***, are the return paths for termination of these signals. **SysClk** and **SysClk\*** are used to drive an on-chip phase-locked loop (PLL), which multiplies the system clock to create an internal processor clock, **PClk**. The R10000 processor always communicates with the system at the **SysClk** frequency, and **PClk** always runs at a frequency-multiple of **SysClk**, according to the following formula: For example, in a 50 MHz system with **SysClkDiv** = 7 and **SCClkDiv**=2, **PClk**= 50\*8/2 = 200 MHz. **NOTE:** It is preferred that the R10000 processor uses a differential PECL clock input. However, in a less-aggressive system, a CMOS/TTL single-ended clock can be used to drive the processor, provided its complementary clock input, **SysClk\***, is tied to an appropriate reference voltage (1.4V for TTL, Vcc/2 for CMOS). In any case, the reference voltage applied to SysClk\* should not be less than 1.2V. # 7.2 Secondary Cache Clock The processor uses registered synchronous SRAMs for its secondary cache, to allow pipelined accesses. ### Errata The processor provides 6 pairs of differential clock outputs, SCClk(5:0) and SCClk\*(5:0), to be used by the secondary cache synchronous SRAMs. These outputs swing between VccQSC and Vss. The SCClkTap mode bits (Mode bits are described in Chapter 8, the section titled "Mode Bits.") specify the alignment of SCClk(5:0) and SCClk\*(5:0) relative to the internal secondary cache clock. Note that the output buffer delay is not included. The secondary cache interface clock is generated by dividing down the internal processor clock, **PClk**. SCClk is related to SysClk according to the following formula: SCClk = SysClk\*(SysClkDiv+1)/(SCClkDiv+1) For example, in a 50 MHz system with **SysClkDiv**=7 and **SCClkDiv**=2, **SCClk** = 50\*8/3 = 133 MHz. # 7.3 Phase-Locked-Loop The processor uses the internal PLL for clock generation and multiplication as shown in Figure 7-1. Values of the termination resistors for the **SysClkRet/SysClkRet\*** signals are system-dependent. The system designer must select a value based upon the characteristic impedance of the board, therefore it is beyond the scope of this manual to specify values for these termination resistors. Figure 7-1 R10000 System and Secondary Cache Clock Interface # 8. Initialization This section describes initialization of the R10000 processor, including initialization of logical registers. Initialization of the processor occurs during a reset sequence. The processor supports three separate reset sequences: - Power-on reset - · Cold reset - Soft reset These sequences are described in this chapter. Also described are the mode bits. ### 8.1 Initialization of Logical Registers After a power-on or cold reset sequence, all logical registers (both in the integer and the floating-point register files) must be written before they can be read. Failure to write any of these registers before reading from them will have an unpredictable result. # **8.2 Power-On Reset Sequence** The Power-on Reset sequence is used to reset the processor after the initial power-on, or whenever power or **SysClk** are interrupted. The Power-on Reset sequence is as follows: - The external agent negates DCOk. - The external agent asserts SysReset\*. - The external agent negates SysGnt\*. - The external agent negates SysRespVal\*. - Once Vcc, VccQ[SC,Sys], Vref[SC,Sys], Vcc[Pa,Pd], and SysClk stabilize, the external agent waits at least 1ms and then asserts DCOk. - At this time, the System interface resides in slave state and all internal state is initialized. - The SysClkDiv mode bits default to divide-by-1. - The **SCClkDiv** mode bits default to divide-by-3. - After waiting at least 100 ms for the internal clocks to stabilize, the external agent loads the mode bits into the processor by driving the mode bits on SysAD[63:0], waiting at least two SysClk cycles, and then asserting SysGnt\* for at least one SysClk cycle. - After waiting at least another 100 ms for the internal clocks to restabilize, the external agent synchronizes all clocks internal to the processor. This is performed by asserting SysRespVal\* for one SysClk cycle. - After waiting at least 100 ms for the internal clocks to again restabilize, (a third 100 ms restabilization period) the external agent negates **SysReset\***. - The external agent must retain mastership of the System interface, refrain from issuing external requests or nonmaskable interrupts, and ignore the system state bus until the processor asserts SysReq\*. The assertion of SysReq\* indicates the processor is ready for operation. In a cluster arrangement, all processors must assert SysReq\*, indicating they are ready for operation. ### Errata If the **virtual SysClk** is used during the reset sequence, the mode bits, **SysGnt\***, **SysRespVal\***, and **SysReset\*** should all be referenced to the virtual **SysClk** that is created with **SysCyc\***. This approach will cause the R10000 to come out of reset synchronously with the **virtual SysClk**, which will allow repeatable and lock-step operation (see Chapter 6, the section titled "Support for Hardware Emulation," for description of **virtual SysClk** operation). During a Power-on Reset sequence, all internal state is initialized. A Power-on Reset sequence causes the processor to start with the Reset exception. Figure 8-1 Power-On Reset Sequence ### 8.3 Cold Reset Sequence The Cold Reset sequence is used to reset the entire processor, and possibly alter the mode bits while power and **SysClk** are stable. The Cold Reset sequence is as follows: - The external agent negates SysGnt\* and SysRespVal\*. - After waiting at least one SysClk cycle, the external agent asserts SysReset\*. - After waiting at least 100 ms, the external agent loads the mode bits into R10000. This is performed by driving the mode bits on SysAD[63:0], waiting at least two SysClk cycles, and then asserting SysGnt\* for at least one SysClk cycle. - After waiting at least another 100 ms for the internal clocks to restabilize, the external agent synchronizes all processor internal clocks by asserting SysRespVal\* for one SysClk cycle. - After waiting at least 100 ms for the internal clocks to again restabilize, (a third 100 ms restabilization period) the external agent negates SysReset\*. - The external agent must retain mastership of the System interface, refrain from issuing external requests or nonmaskable interrupts, and ignore the system state bus until the processor asserts SysReq\*. The assertion of SysReq\* indicates the processor is ready for operation. In a cluster arrangement, all processors must assert SysReq\*, indicating they are ready for operation. During a Cold Reset sequence all processor internal state is initialized. A Cold Reset sequence causes the processor to start with a Reset exception. Figure 8-2 shows the cold reset sequence. # 8.4 Soft Reset Sequence A Soft Reset sequence is used to reset the external interface of the processor without altering the mode bits while power and **SysClk** are stable. The Soft Reset sequence is as follows: - The external agent negates **SysGnt\*** and **SysRespVal\***. - After waiting at least one SysClk cycle, the external agent asserts SysReset\* for at least 16 SysClk cycles. - The external agent must retain mastership of the System interface, refrain from issuing external requests or nonmaskable interrupts, and ignore system state bus until the processor asserts **SysReq\***. The assertion of **SysReq\*** indicates the processor is ready for operation. In a cluster arrangement, all processors must assert **SysReq\***, indicating they are ready for operation. During a Soft Reset sequence, all external interface state is initialized. The internal and secondary cache clocks are not affected by a Soft Reset sequence. The general purpose, CP0, and CP1 registers are preserved, as well as the primary and secondary caches. A Soft Reset sequence causes a Soft Reset exception, in which the Soft Reset exception handler executes instructions from uncached space and uses CACHE instructions to analyze and dump the contents of the primary and secondary caches. To resume normal operation, a Cold Reset sequence must be initiated. Figure 8-3 presents the Soft Reset sequence. Figure 8-3 Soft Reset Sequence # 8.5 Mode Bits The R10000 processor uses mode bits to configure the operation of the microprocessor. These mode bits are loaded into the processor from the **SysAD[63:0]** bus during a power-on or cold reset sequence while **SysGnt\*** is asserted. The **SysADChk[7:0]** bus does not have to contain correct ECC during mode bit initialization. During the reset sequence, the mode bits obtained from **SysAD[24:0]** are written into bits 24:0 of the CP0 *Config* register. The mode bits are described in Table 8-1. Table 8-1 Mode Bits | Coug A D Did | None and Europian | Valera | Mode | Setting | |--------------|----------------------------------------------------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|---------------| | SysAD Bit | Name and Function | Value | R10000 | <u>R12000</u> | | 2:0 | <b>Kseg0CA</b> Specifies the $kseg0$ cache algorithm. | 0<br>1<br>2<br>3<br>4<br>5<br>6<br>7 | Reserved Reserved Uncached Cacheable noncoherent Cacheable coherent exclusiv Cacheable coherent exclusiv Reserved Uncached accelerated | | | 4:3 | <b>DevNum</b> Specifies the processor device number. | 0-3 | | | | 5 | CohPrcReqTar Specifies the target of processor coherent requests issued on the System interface by the processor. | 0 | External agent only<br>Broadcast | | | 6 | PrcElmReq Specifies whether to enable processor eliminate requests onto the System interface by the processor. | 0 | Disable<br>Enable | | | 8:7 | PrcReqMax Specifies the maximum number of outstanding processor requests allowed on the System interface by the processor. | 0<br>1<br>2<br>3 | 1 outstanding processor requ<br>2 outstanding processor requ<br>3 outstanding processor requ<br>4 outstanding processor requ | ests<br>ests | Table 8-1 (cont.) Mode Bits | Cara A D. D. | Name and Every 4 | <b>X</b> 70.1 | Mode | Setting | |--------------|--------------------------------------------|---------------|----------------------------|----------------------------------------------| | SysAD Bit | Name and Function | Value | R10000 | <u>R12000</u> | | | | 0 | Reserved | Reserved | | | | 1 | Result of division by 1 | NOT AVAILABLE | | | | 2 | Result of division by 1.5 | NOT AVAILABLE | | | | 3 | Result of division by 2 | Result of division by 2 | | | | 4 | Result of division by 2.5 | Result of division by 2.5 | | | SysClkDiv | 5 | Result of division by 3 | Result of division by 3 | | | Sets PClk to SysClk ratio; determines the | 6 | Result of division by 3.5 | Result of division by 3.5 | | 12:9 | System interface clock frequency; see | 7 | Result of division by 4 | Result of division by 4 | | 12.9 | Chapter 7, the section titled "System | 8 | Reserved | Result of division by 4.5 | | | Interface Clock and Internal Processor | 9 | Reserved | Result of division by 5 | | | Clock Domains | A | Reserved | Result of division by 5.5 | | | | В | Reserved | Result of division by 6 | | | | C | Reserved | Result of division by 7 | | | | D | Reserved | Result of division by 8 | | | | Е | Reserved | Result of division by 9 | | | | F | Reserved | Result of division by 10 | | 12 | SCBlkSize | 0 | 16-word | | | 13 | Specifies the secondary cache block size. | 1 | 32-word | | | | SCCorEn | 0 | Retry access through corre | ctor | | 14 | Specifies the method of correcting | 1 | Always access through cor | | | | secondary cache data array ECC errors. | 1 | Aiways access through cor | rector | | 15 | MemEnd | 0 | Little endian | | | | Specifies the memory system endianness. | 1 | Big endian | | | | | 0 | 512 Kbyte | | | | | 1 | 1 Mbyte | | | | a aa. | 2 | 2 Mbyte | | | 18:16 | SCSize | 3 | 4 Mbyte | | | | Specifies the size of the secondary cache. | 4 | 8 Mbyte | | | | | 5 | 16 Mbyte | | | | | 6 | Reserved | | | | | 7 | Reserved | | | | | 0 | Reserved | Reserved | | | SCClkDiv | 1 | Result of division by 1 | NOT AVAILABLE | | | Sets PClk to SCClk ratio; determines the | 2 | Result of division by 1.5 | Result of division by 1.5 | | 21:19 | secondary cache clock frequency; see | 3 | Result of division by 2 | Result of division by 2 | | | Chapter 7, the section titled "System | 4 | Result of division by 2.5 | Result of division by 2.5 | | | Interface Clock and Internal Processor | 5 | Result of division by 3 | Result of division by 3 | | | Clock Domains | 6 | Reserved | Reserved | | | | 7 | Reserved | Result of division by 4 | | | | 0 | | Reserved | | | | 1 | | Reserved | | | | 2 | | Reserved | | 24.55 | | 3 | | Reserved | | 24:22 | Reserved | 4 | | Delay Speculative Dirty - | | | | _ | | $\frac{\text{fix for speculative store}}{1}$ | | | | 5 | | Reserved | | | | 6 | | Reserved | | | | 7 | | Reserved | Table 8-1 (cont.) Mode Bits | Swa A D Dit | Name and Function | Value | Mode S | Setting | |-------------|--------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | SysAD Bit | Name and Function | value | R10000 | <u>R12000</u> | | 28:25 | SCClkTap Specifies the alignment of SCClk[5:0] and SCClk*[5:0] relative to the internal secondary cache clock. | 0<br>1<br>2<br>3<br>4<br>5<br>6<br>7<br>8<br>9<br>A<br>B<br>C<br>D<br>E<br>F | SCClk same phase as intern SCClk 1/12 PClk period ea SCClk 2/12 PClk period ea SCClk 3/12 PClk period ea SCClk 4/12 PClk period ea SCClk 5/12 PClk period ea undefined undefined SCClk 6/12 PClk period ea SCClk 7/12 PClk period ea SCClk 8/12 PClk period ea SCClk 8/12 PClk period ea SCClk 10/12 PClk period ea SCClk 10/12 PClk period ea SCClk 11/12 Undefined | arlier than internal clock | | 29 | Reserved | 0 | | | | 30 | ODrainSys Specifies whether or not to configure select* System interface bidirectional and output signals as open drain. | 0<br>1 | Push-pull<br>Open drain | | | 31 | <b>CTM</b> Specifies whether or not to enable cache test mode. | 0 | Disable<br>Enable | | | 63:32 | Reserved | 0 | | | <sup>†</sup> The Boot Mode bit 24 corresponds to the Config register[24] bit and this controls DSD during kernel and supervisor modes. However, the DSD mode can also be enabled in the user mode by setting the Status register[24] bit. Config register[24] is read-only and can be set only at boot time. #### If the DSD mode is set - - a) R12000 will not set the Dirty bit for a secondary cache block until the store instruction is the oldest in the Active List and is about to be executed. (An interrupt could cause a case where the dirty bit is set (store is no longer speculative), but the store does not immediately graduate. We believe this case should not cause any problem. This mode does prevent speculative stores from setting the dirty bit.) - b) This mode will have slightly lower performance due to the delay in the setting of the Dirty bit. This delay will occur just once per block refill from main memory, when it is necessary to set the dirty bit. Setting the bit requires about ten cycles; but usually the processor will continue to overlap execution of other instructions. Once a block becomes dirty in secondary cache, this mode has no performance effect. - c) In this mode, a miss in secondary cache, due to a store instruction which is not already the oldest in the pipeline, will cause a refill to the "clean exclusive" state. A hit to a shared line will immediately cause an upgrade to "clean exclusive". Thus, bus operations (which are relatively slow) will still begin speculatively. Independent of the DSD mode, R12000 will delay a "cached, non-coherent" load until it is the oldest instruction. This change is implemented because a speculative load accessing an unmapped "xkphys" address as "cached, non-coherent" might bring data into the secondary cache without the proper coherency checks. R12000 is doing no changes to prevent it from speculatively refilling cache lines in shared or clean states except the "xkphys" case described above. ‡‡ Does not include the output buffer delay. $* SysReq^*, SysRel^*, SysCmd[11:0], SysCmdPar, SysAD[63:0], SysADChk[7:0], SysVal^*, SysState[2:0], SysStatePar, SysStateVal^*, SysCorErr^*, SysUncErr^*$ # Errata The description of bits 28:25 of Table 8-1 has been revised. # [MEMO] # 9. Error Protection and Handling This chapter presents the error protection and handling features provided by the R10000 processor. Two types of errors can occur in an R10000 system: - correctable - uncorrectable The following two sections describe them. # 9.1 Correctable Errors Correctable errors consist of: - secondary cache tag array correctable ECC errors - secondary cache data array correctable ECC errors - System interface address/data bus correctable ECC errors When the processor detects a correctable error, the error is automatically corrected, and normal operation continues. Secondary cache array scrubbing is not performed. The processor informs the external agent that a correctable error was detected and then corrected by asserting the **SysCorErr\*** signal for one **SysClk** cycle. #### 9.2 Uncorrectable Errors Uncorrectable errors consist of: - · Primary instruction cache array parity errors - · Primary data cache array parity errors - Secondary cache tag array uncorrectable ECC errors - Secondary cache data array uncorrectable ECC errors - System interface command bus parity errors - System interface address/data bus uncorrectable ECC errors - System interface response bus parity errors #### Errata When the processor detects an uncorrectable error, a Cache Error exception is posted. In general, the <u>detection</u> of an uncorrectable error does not disrupt any ongoing operations. However, the instruction fetch and load/store units never use data which contains an uncorrectable error. To inform the external agent, the processor asserts **SysUncErr\*** for one **SysClk** cycle whenever any of the following uncorrectable errors are detected: - Primary instruction cache tag array parity errors - Primary data cache tag array parity errors - Secondary cache tag array uncorrectable ECC errors - System interface command bus parity errors - System interface address/data bus external address cycle uncorrectable ECC errors - System interface response bus parity errors. The processor informs the external agent that an uncorrectable tag error has been detected by asserting **SysUncErr\*** for one **SysClk** cycle. # 9.3 Propagation of Uncorrectable Errors The processor assists the external agent in limiting the propagation of uncorrectable errors in the following manner: - During external block data response cycles, if the data quality indication on SysCmd(5) is asserted, or if an uncorrectable ECC error is encountered on the system address/data bus while the ECC check indication on SysCmd(0) is asserted, the processor intentionally corrupts the ECC of the corresponding secondary cache quadword after receiving an external ACK completion response. - During processor data cycles, the processor asserts the data quality indication on SysCmd(5) if the data is known to contain uncorrectable errors. The System interface ECC is never intentionally corrupted; the SysCmd(5) bit is used to indicate corrupted data. - If an uncorrectable cache tag error is detected, the processor asserts SysUncErr\* for one SysClk cycle. - An external coherency request that detects a secondary cache tag array uncorrectable error asserts the secondary cache block tag quality indication on **SysState(2)** during the corresponding processor coherency state response. - If an external coherency request requires a processor coherency data response, and a primary data cache tag parity error is encountered during the primary cache interrogation, or a secondary cache tag array uncorrectable error is encountered during the secondary cache interrogation, the processor asserts the data quality indication on **SysCmd(5)** for all doublewords of the corresponding processor coherency data response. # 9.4 Cache Error Exception The processor indicates an uncorrectable error has occurred by asserting a Cache Error exception. The following four internal units detect and report uncorrectable errors: - · instruction cache - · data cache - secondary cache - · System interface Each of these four units maintains a unique local CacheErr register. A Cache Error exception is imprecise; that is, it is not associated with a particular instruction. When any of the four units post a Cache Error exception, completed instructions are graduated before the Cache Error exception is taken. If there are Cache Error exceptions posted from more than one of the units, the exceptions are prioritized in the following order: - 1. instruction cache - 2. data cache - 3. secondary cache - 4. System interface. The corresponding local *CacheErr* register is transferred to the CP0 *CacheErr* register and the CP0 *Status* register *ERL* bit is asserted. Instruction fetching begins from 0xa0000100 or 0xbfc00300, depending on the CP0 *Status* register *BEV* bit. The CP0 *ErrorEPC* register is loaded with the virtual address of the next instruction that has not been graduated, so that execution can resume after the Cache Error exception handler completes. When *ERL*=1, the user address region becomes a 2-Gbyte uncached space mapped directly to the physical addresses. This allows the Cache Error handler to save registers directly to memory without having to use a register to construct the address. The processor does not support nested Cache Error exception handling. While the CP0 *Status* register *ERL* bit is asserted, any subsequent Cache Error exceptions are ignored. However, the detection of additional uncorrectable errors is not inhibited, and additional Cache Error exceptions may be posted. † <sup>†</sup> The hardware does not handle the case of multiple Cache Error exceptions in any special manner; caches are refilled as normal, and data forwarded to the appropriate functional units. ## 9.5 CP0 CacheErr Register EW Bit When a unit detects an uncorrectable error, it records information about the error in its local *CacheErr* register and posts a Cache Error exception. If a subsequent uncorrectable error occurs while waiting for the Cache Error exception to be taken and transfer of the local *CacheErr* register to the CPO *CacheErr* register to complete, the *EW* bit is set in its local *CacheErr* register. Once the Cache Error exception is taken, the *EW* bit in the CPO *CacheErr* register is set and the Cache Error exception handler now determines that a second error has occurred. Once the CPO CacheErr register EW bit is set, it can only be cleared by a reset sequence. # 9.6 CP0 Status Register DE Bit Asserting the CP0 *Status* register *DE* bit suppresses the posting of future Cache Error exceptions. All local *CacheErr* registers are also prevented from being updated. Unlike the R4400 processor architecture, when the *DE* bit is asserted, cache hits are not inhibited when an uncorrectable error is detected. Correctable errors are handled normally when the *DE* bit is set. **NOTE:** Be careful when setting this bit, since it may cause erroneous data and/or instructions to be propagated. ### 9.7 CACHE Instruction Uncorrectable error protection is suppressed for the Index Load Tag, Index Store Tag, Index Load Data, and Index Store Data CACHE instruction variations. These four variations may be used within a Cache Error exception handler to examine the cache tags and data without the occurrence of further uncorrectable errors. # 9.8 Error Protection Schemes Used by R10000 Error protection schemes used in the R10000 processor are: - parity - · sparse encoding - ECC These schemes are described in this section, and listed in Table 9-1. Table 9-1 Error Protection Schemes Used in the R10000 Processor | Error Detection Used | What is Protected | |----------------------|----------------------------------------------------------------------------| | Parity | Primary caches Secondary cache data System interface buses | | Sparse encoding | Primary data cache state mod array | | ECC (SECDED) | Secondary cache tag Secondary cache data System interface address/data bus | ### **Parity** Parity is used to protect the primary caches and various System interface buses. The processor uses both odd and even parity schemes: - in an odd parity scheme, the total number of ones on the protected data and the corresponding parity bit should be odd - in an even parity scheme, the total number of ones on the protected data and the corresponding parity bit should be even. ### **Sparse Encoding** A sparse encoding is used to protect the primary data cache state mod array. In such a scheme, valid encodings are chosen so that altering a single bit creates an invalid encoding. #### **ECC** An error correcting code (ECC) is used to protect the secondary cache tag, the secondary cache data, and the System interface address/data bus. A distinct single-bit error correction and double-bit error detection (SECDED) code is used for each of these three applications. # 9.9 Primary Instruction Cache Error Protection and Handling This section describes error protection and error handling schemes for the primary instruction cache. #### **Error Protection** The primary instruction cache arrays have the following error protection schemes, as listed in Table 9-2. Table 9-2 Primary Instruction Cache Array Error Protection | Array | Width | Error Protection | |-------------|--------|------------------| | Tag Address | 27-bit | Even parity | | Tag State | 1-bit | Even parity | | Data | 36-bit | Even parity | | LRU | 1-bit | None | ## **Error Handling** All primary instruction cache errors are uncorrectable. If an error is detected, the instruction cache unit posts a Cache Error exception and initializes the *D*, *TA*, *TS*, and *PIdx* fields in the local *CacheErr* register (see Chapter 14, *CacheErr Register* (27), for more information). If an error is detected on the tag address or state array, the processor informs the external agent that an uncorrectable tag error was detected by asserting **SysUncErr\*** for one **SysClk** cycle. # 9.10 Primary Data Cache Error Protection and Handling This section describes error protection and error handling schemes for the primary data cache. #### **Error Protection** The primary data cache arrays have the following error protection schemes, as listed in Table 9-3. Table 9-3 Primary Data Cache Array Error Protection | Array | Width | Error Protection | | | | | | |-------------|--------|------------------|--|--|--|--|--| | Tag Address | 28-bit | Even parity | | | | | | | Tag State | 3-bit | Even parity | | | | | | | Tag Mod | 3-bit | Sparse encoding | | | | | | | Data | 8-bit | Even parity | | | | | | | LRU | 1-bit | None | | | | | | # **Error Handling** All primary data cache errors are uncorrectable. If an error is detected, the data cache unit posts a Cache Error exception and initializes the *EE*, *D*, *TA*, *TS*, *TM*, and *PIdx* fields in the local *CacheErr* register (see Chapter 14, *CacheErr Register (27)*, for more information). If an error is detected on the tag address, state, or mod array, the processor informs the external agent that an uncorrectable tag error was detected by asserting **SysUncErr\*** for one **SysClk** cycle. # 9.11 Secondary Cache Error Protection and Handling This section describes error protection and error handling schemes for the secondary cache. #### **Error Protection** The secondary cache arrays have the following error protection schemes, as listed in Table 9-4. Table 9-4 Secondary Cache Array Error Protection | Array | Width | Error Protection | |----------------------------|---------|-------------------------| | Data | 128-bit | 9-bit ECC + even parity | | Tag | 26-bit | 7-bit ECC | | MRU (Way prediction table) | 1-bit | None | ## **Error Handling** This section describes error handling for the data array and the tag array. As shown in Table 9-4, errors are not detected for the way prediction table. #### **Data Array** #### Errata The 128-bit wide secondary cache data array is protected by a 9-bit wide ECC. An even parity bit <u>for the 128 bits of data</u> is used for rapid detection of correctable (single-bit) errors; when a correctable parity error is detected, the data is sent through the data corrector. The parity bit does not have any logical effect on the processor's ability to either detect or correct errors. Whenever the processor writes the secondary cache data array, it drives the proper ECC on SCDataChk(8:0) and even parity on SCDataChk(9). #### **Data Array in Correction Mode** The secondary cache operates in correction mode when the **SCCorEn** mode bit is asserted. Whenever the processor reads the secondary cache data array in correction mode, the data is sent through a data corrector. If a correctable error is detected, in-line correction is automatically made without affecting latency. The processor informs the external agent that a correctable error was detected and corrected by asserting **SysCorErr\*** for one **SysClk** cycle. If an uncorrectable error is detected, the secondary cache unit posts a Cache Error exception and initializes the *D* and *SIdx* fields in the local *CacheErr* register (see Chapter 14, *CacheErr Register* (27), for more information). In correction mode, secondary-to-primary cache refill latency is increased by two **PClk** cycles. Multiple processors, operating in a lock-step fashion, remain synchronized in the presence of secondary cache data array correctable errors. Table 9-5 presents the ECC matrix for the secondary cache data array. Table 9-5 ECC Matrix for Secondary Cache Data Array | Check Bit | 8 | 7654 | 3210 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |---------------------------|-----|------|------|------|------|------|------|--------------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------| | Data Bit | | | | | | | 1 | 1111<br>1100 | | | | 9999 | 9988 | 8888 | 8888 | 7777 | 7777 | 7766 | 6666 | 6666 | 5555 | 5555 | 5544 | 4444 | 4444 | 3333 | 3333 | 3322 | 2222 | 2222 | 1111 | 1111 | 11 | | | | | | | | 7654 | 3210 | 9876 | 5432 | 1098 | 7654 | 3210 | 9876 | 5432 | 1098 | 7654 | 3210 | 9876 | 5432 | 1098 | 7654 | 3210 | 9876 | 5432 | 1098 | 7654 | 3210 | 9876 | 5432 | 1098 | 7654 | 3210 | 9876 | 5432 | 1098 | 7654 | 3210 | | 54 | 1 | 0000 | 0000 | 1111 | 1111 | 0000 | 0011 | 1111 | 1111 | 1110 | 0011 | 0000 | 0110 | 1111 | 1111 | 0000 | 0111 | 0010 | 0001 | 0110 | 0010 | 1101 | 0000 | 1011 | 0000 | 1101 | 0000 | 0000 | 0000 | 0010 | 0000 | 0010 | 0000 | 1000 | 0000 | | | 1 - | | | | | l | 1 | 0000 | | | | | | | - | | | - | | | | | | | | | | | | | | | | | | | 54 | 0 | 0100 | 0000 | 1000 | 0011 | 1111 | 1111 | 1111 | 1111 | 0000 | 0011 | 0000 | 0101 | 0000 | 0011 | 1111 | 1111 | 0101 | 1100 | 1000 | 0000 | 1000 | 0100 | 0000 | 0100 | 1100 | 0100 | 0011 | 1000 | 0100 | 1000 | 0100 | 1000 | 0101 | 0000 | | Number of ones 53 | | | | | | l | | 1 | | | | | | l | 1 | 1 | | | | | | | l . | | | - 1 | | | | | | | | | | | | - 1 | | | | | l | l . | 0100 | | | | | | l | 1 | 1 | | | | | | | | | | - 1 | | | | | | | | - 1 | 0100 | | <u>-</u> | + | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - 1 | | | | | l | l . | 0001 | | | | | | l | 1 | 1 | | | | | | | | | | - 1 | | | | | | | | - 1 | | | | - 1 | | | | | l | l . | 0000 | | | | | | l | 1 | 1 | | | | | | | | | | - 1 | | | | | | | | - 1 | | | Number of ones per column | 1 | 1111 | 1111 | 3333 | 3355 | 3333 | 3355 | 3333 | 3355 | 3333 | 3355 | 3333 | 3555 | 3333 | 3355 | 3333 | 3555 | 3355 | 5555 | 5555 | 5533 | 5553 | 3333 | 5533 | 3333 | 5553 | 3333 | 5533 | 3333 | 5533 | 3333 | 5533 | 3333 | 5533 | 3333 | #### **Data Array in Noncorrection Mode** When the **SCCorEn** mode bit is negated, the secondary cache operates in noncorrection mode. Whenever the processor reads the secondary cache data array in noncorrection mode, it checks for even parity on **SCDataChk(9)**. If a parity error is detected, it is assumed that a correctable error has occurred, and the secondary cache block is again read through a data corrector. During this re-read, the processor checks the **SCDataChk(8:0)** bus for the proper ECC. If a correctable error is detected, correction is automatically performed in-line. To inform the external agent that a correctable error had been detected and corrected, the processor asserts **SysCorErr\*** for one **SysClk** cycle. If an uncorrectable error is detected, the secondary cache unit posts a Cache Error exception and initializes the *D* and *SIdx* fields in the local *CacheErr* register. Secondary cache data array correctable errors are monitored with Performance Counter 0. #### Tag Array The 26-bit-wide secondary cache tag array is protected by a 7-bit-wide ECC. Table 9-6 presents the ECC matrix for the secondary cache tag array. Check Bit 12 34 56 Data Bit 2222 22 11 111 1111111 5432 10 98 5432 1098 7654 3210 76 |11|0|0100|1000|1000|0001|1111|1000|1000|1000 13|0|1000|0100|0100|0010|1111|1111|0000|0100 11|1|0010|1000|0001|1000|0000|1111|0100|0010 Number of |11|1|0100|0100|0010|0100|1000|0100|1111|0000 ones per row |13|0|1000|0001|1000|1000|0100|0000|1111|1111 |3|3331|3311|3311|3311|3333|3333|3333|3333 Number of ones per column Table 9-6 ECC Matrix for Secondary Cache Tag Array Whenever the processor reads the secondary cache tag array, it checks the **SCTagChk**(6:0) bus for the proper ECC. If a correctable error is detected, correction is automatically performed in-line, without affecting latency. The processor asserts **SysCorErr\*** for one **SysClk** cycle to inform the external agent that a correctable error has been detected and corrected. If an uncorrectable error is detected, the secondary cache unit posts a Cache Error exception and initializes the *TA* and *SIdx* fields in the local *CacheErr* register. The processor asserts **SysUncErr\*** for one **SysClk** cycle to inform the external agent that an uncorrectable tag error has been detected. Whenever the processor writes the secondary cache tag array, it drives the proper ECC on the SCTagChk(6:0) bus. # 9.12 System Interface Error Protection and Handling This section describes error protection and error handling schemes for the System interface. # **Error Protection** The System interface buses have the following error protection schemes, as listed in Table 9-7. Table 9-7 System Interface Bus Error Protection | Bus | Width | Error Protection | |----------|--------|------------------| | SysCmd | 12-bit | Odd parity | | SysAD | 64-bit | 8-bit ECC | | SysState | 3-bit | Odd parity | | SysResp | 5-bit | Odd parity | ### **Error Handling** This section describes error handling on the system command bus, system address/data bus, system state bus, and system response bus. ### SysCmd(11:0) Bus The 12-bit wide system command bus, **SysCmd(11:0)**, is protected by odd parity. Whenever the processor is in master state and it asserts **SysVal\*** to indicate that it is driving valid information on the **SysCmd(11:0)** bus, it also drives odd parity on the **SysCmdPar** signal. #### Errata Whenever the processor is in slave state and an external agent asserts **SysVal\*** to indicate that it is driving valid information on the **SysCmd(11:0)** bus, the processor checks the **SysCmdPar** signal for odd parity. If a parity error is detected, the processor ignores the **SysCmd(11:0)** and **SysAD(63:0)** buses for one **SysClk** cycle. The System interface unit posts a Cache Error exception and sets the SC bit in the local *CacheErr* register. Additionally, the processor informs the external agent by asserting **SysUncErr\*** for one **SysClk** cycle. Caution: By ignoring the **SysCmd(11:0)** and **SysAD(63:0)** buses, the processor to become unsynchronized with other processors or the external agent on the cluster bus. #### SysAD(63:0) Bus The 64-bit wide system address/data bus, **SysAD**(63:0), is protected by an 8-bit-wide ECC. #### **Processor in Master State** Whenever the processor is in master state and it asserts **SysVal\*** to indicate it is driving valid information on the **SysAD(63:0)** bus, it also drives the proper ECC on the **SysADChk(7:0)** bus. #### **Processor in Slave State** Whenever the processor is in slave state, error checking is enabled with the assertion of **SysCmd(0)**, and an external agent asserts **SysVal\*** to indicate it is driving valid information on the **SysAD(63:0)** bus, the processor checks the **SysADChk(7:0)** bus for the proper ECC. #### **Correctable Error Detected** If a correctable error is detected during an external address cycle, or during an external data cycle for a processor read or upgrade request originated by the R10000 processor, correction is automatically performed in-line without affecting latency. The processor asserts **SysCorErr\*** for one **SysClk** cycle to inform the external agent that a correctable error has been detected and corrected. #### **Uncorrectable Error Detected** #### Errata If an uncorrectable error is detected during an external address cycle, the processor ignores the **SysCmd(11:0)** and **SysAD(63:0)** buses for one **SysClk** cycle, and the System interface unit posts a Cache Error exception and sets the *SA* bit in the local *CacheErr* register. Additionally, the processor informs the external agent by asserting **SysUncErr\*** for one **SysClk** cycle. Caution: By ignoring the **SysCmd(11:0)** and **SysAD(63:0)** buses, this processor may become unsynchronized with other processors or the external agent on the cluster bus. If an uncorrectable error is detected or the data quality indication on **SysCmd(5)** is asserted during an external data cycle for a processor read or upgrade request originated by the processor, the R10000 asserts the corresponding incoming buffer uncorrectable error flag. When the processor forwards block data from an incoming buffer entry after receiving an external ACK completion response, the associated incoming buffer uncorrectable error flags are checked, and if any are asserted, the System interface unit posts a single Cache Error exception and initializes the *EE*, *D*, and *SIdx* fields in the local *CacheErr* register. When the processor forwards double/single/partial-word data from an incoming buffer entry after receiving an external ACK completion response, the associated incoming buffer uncorrectable error flag is checked and, if asserted, the System interface unit posts a Bus Error exception. Table 9-8 presents the ECC matrix for the System interface address/data bus. This ECC matrix is identical to that used by the R4X00 System interface. Table 9-8 ECC Matrix for System Interface Address/Data Bus | Check Bit | | | 43 | | 52 | | | | | | | | | | | 70 | | 61 | | |----------------------------|----------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|--------------------------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------| | Data Bit | | 6666<br>3210 | | 5555<br>7654 | | ı | l | | l . | 3333<br>5432 | | l . | | | | | 9876 | 54 | 3210 | | Number of ones per row | 27<br>27<br>27<br>27<br>27<br>27 | 1111<br>0000<br>0000<br>1000<br>0100<br>0010 | 1000<br>1000<br>1010<br>0101<br>1100<br>0100 | 1000<br>1100<br>0100<br>0011<br>0010<br>0011 | 1000<br>1000<br>1010<br>1100<br>0100<br>0101<br>1100<br>0100 | 0100<br>0010<br>0001<br>0000<br>1111<br>1111 | 0000<br>1111<br>1111<br>1000<br>0100<br>0010 | 0000<br>1111<br>0000<br>1000<br>0100<br>0010 | 0000<br>0000<br>1111<br>1000<br>0100<br>0010 | 1111<br>0000<br>1111<br>1000<br>0100<br>0010 | 0100<br>0010<br>0001<br>1111<br>0000<br>1111 | 0100<br>0010<br>0001<br>1111<br>0000<br>0000 | 0100<br>0010<br>0001<br>0000<br>1111<br>0000 | 0100<br>0010<br>0001<br>1111<br>1111<br>0000 | 1111<br>1111<br>0000<br>1000<br>0100<br>0010 | 1100<br>1000<br>1000<br>1100<br>0100<br>0100 | 1100<br>1000<br>1100<br>0001<br>0011<br>0010 | 1010<br>1100<br>1000<br>0100<br>0100<br>0101 | 0100<br>0010<br>0001<br>0000<br>0000<br>1111 | | Number of on<br>per column | es | 3333 | 5511 | 3333 | 5511 | 3333 | 3333 | 3333 | 3333 | 3333 | 3333 | 3333 | 3333 | 3333 | 3333 | 5511 | 3333 | 5511 | 3333 | #### SysState(2:0) Bus The 3-bit wide system state bus, **SysState(2:0)**, is protected by odd parity. The processor drives odd parity on the **SysStatePar** signal. #### SysResp(4:0) Bus The 5-bit wide system response bus, **SysResp(4:0)**, is protected by odd parity. #### Errata Whenever an external agent asserts **SysRespVal\*** to indicate it is driving valid information on the **SysResp(4:0)** bus, the processor checks the **SysRespPar** signal for odd parity. If a parity error is detected, the processor ignores the **SysResp(4:0)** bus for one **SysClk** cycle. The System interface unit posts a Cache Error exception and sets the *SR* bit in the local *CacheErr* register. Additionally, the processor informs the external agent by asserting **SysUncErr\*** for one **SysClk** cycle. Caution: If the processor ignores the **SysResp(4:0)** bus, it may become unsynchronized with other processors or the external agent on the cluster bus. Also, the processor will "hang" if a parity error is detected on the **SysResp[4:0]** bus during an external completion response cycle for a processor double/single/partial-word read request originated by the processor. The external agent may initiate a Soft Reset sequence to obtain the contents of the *CacheErr* register, and the *CacheErr* register will indicate a System interface uncorrectable system response bus error. # **Protocol Observation** The processor continuously observes the protocol on the System interface. Table 9-9 presents the supported protocol observations and the associated error handling sequence. Table 9-9 Protocol Observation | Protocol Observation | Error Handling | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------| | External response data cycle with an unexpected request number during an external block data response for a processor block read or upgrade request originated by the processor. | Ignore the external response data cycle | | External block data response specifying a <i>Reserved</i> cache block state for a processor block read or upgrade request originated by the processor. | Override the cache block state to CleanExclusive | | External block data response specifying a <i>Shared</i> cache block state for a processor coherent block read exclusive or upgrade request originated by the processor. | Override the cache block state to CleanExclusive | | External completion response specifying a <i>Reserved</i> completion indication. | Ignore the external completion response | | External ACK completion response for a processor read request originated by the processor that has not received an external data response. | Override the external ACK completion response to a NACK | # [MEMO] | | <i>10</i> . | JTAG Inter | face ( | Operation | |--|-------------|------------|--------|-----------| |--|-------------|------------|--------|-----------| The JTAG interface is implemented according to the standard IEEE 1149.1 test access port protocol specifications. ### Errata The JTAG interface accesses the JTAG controller and instruction register as well as a boundary scan register. The JTAG operation does not require **DCOk** to be asserted or **SysClk** to be running; however, if **DCOk** is asserted the **SysClk** must run at the specified minimum frequency or the core logic may be damaged. ### 10.1 Test Access Port (TAP) The test access port (TAP) consists of four interface signals. These signals are used to control the serial loading and unloading of instructions and test data, as well as to execute tests. The TAP consists of the following signals: JTDI: Serial data input (Input signal) JTDO: Serial data output (Output signal) JTMS: Mode select (Input signal) JTCK: Clock (Input signal) The timing and the relationship of the TAP signals follows the IEEE 1149.1 standard protocol. ### **TAP Controller (Input)** The R10000 processor implements the 16-state TAP controller specified by the IEEE 1149.1 standard in the following manner: - The JTMS signal operates the state machine synchronized by the JTCK signal. - The TAP controller is reset by keeping the JTMS signal asserted through five consecutive edges of JTCK. This reset condition sets the reset state of the controller. The TAP controller is also reset by asserting SysReset\*. This pin must not be asserted while using the boundary scan register. # 10.2 Instruction Register The JTAG instruction register is four bits wide, permitting a total of 16 instructions to control the selection of the bypass register, the boundary scan register, and other data registers. The encoding of the instruction register is given in Table 10-1: Table 10-1 JTAG Instruction Register Encoding | MSBLSB | Selected Data Register | |--------------------|--------------------------------------------| | 0000<br>0001 | Boundary Scan Register<br>Sample - Preload | | 0010<br>to<br>1110 | Data Register (not used) | | 1111 | Bypass Register | The 0001 value is provided to represent sample-preload, but also selects the boundary scan register. During a reset of the TAP controller, the value 1111 is loaded into the parallel output of the instruction register, thus selecting the bypass register as the default. During the Shift-IR state of the TAP controller, data is shifted serially into the instruction register from **JTDI**, and the LSB of the instruction register is shifted out onto **JTDO**. During the Update-IR state, the current state of the instruction register is shifted to its parallel output for decoding. # 10.3 Bypass Register The bypass register is 1 bit wide. When the bypass register is selected and the TAP controller is in the Shift-DR state, data on **JTDI** is shifted into the bypass register and the output of the bypass register is shifted out onto **JTDO**. # 10.4 Boundary Scan Register The bypass register is 1 bit wide. The boundary scan data register is selected by loading 0000 into the instruction register. The Shift-DR, Update-DR, and Capture-DR states of the TAP controller are used to operate the boundary scan register according to the IEEE 1149.1 standard specifications. The boundary scan register provides serial access to each of the processor interface pins, as shown in Figure 10-1. Hence, the boundary scan register can be used to load and observe specific logic values on the processor pins. Figure 10-1 JTAG Boundary Scan Cells The main application of the boundary scan register is board-level interconnect testing. The use of the boundary scan register for applying data to and capturing data from the internal microprocessor circuitry is not supported. The boundary scan register list for rev 1.2 of the fab is given in Table 10-2. The **TriState** signal will be eliminated from the BSR in rev 2.0 of the fab, and beyond. An additional bit is provided in the boundary scan register to control the direction of bidirectional pins. As it is loaded through JTDI, this bit is the first bit in the boundary scan chain. The logic value of this bit is latched during the Update-DR state, and sets the direction of all bidirectional pins as follows: | Value | Direction | |-------|-----------| | 0 | Input | | 1 | Output | The value is set to 0 during reset, setting all bidirectional pins to *input* prior to any boundary scan operations. Table 10-2 Boundary Scan Register Pinlist, rev 1.2 | | Signal | | Signal | | Signal | | Signal | | Signal | | Signal | |------|--------------|------|--------------|------|--------------|------|--------------|------|-----------------------|------|--------------| | 1. | SCDataChk[1] | 2. | SCData[63] | 3. | SCData[62] | 4. | SCData[61] | 5. | SCData[60] | 6. | SCData[59] | | 7. | SCData[58] | 8. | SCData[57] | 9. | SCData[56] | 10. | SCData[55] | 11. | SCData[54] | 12. | SCData[53] | | 13. | SCData[52] | 14. | SCData[51] | 15. | SCData[50] | 16. | SCData[49] | 17. | SCData[48] | 18. | SCData[47] | | 19. | SCData[46] | 20. | SCData[45] | 21. | SCData[44] | 22. | SCData[43] | 23. | SCData[42] | 24. | SCData[41] | | 25. | SCData[40] | 26. | SCData[39] | 27. | SCData[38] | 28. | SCData[37] | 29. | SCData[36] | 30. | SCData[35] | | 31. | SCData[34] | 32. | SCData[33] | 33. | SCData[32] | 34. | SysAD[0] | 35. | SysAD[1] | 36. | SysAD[2] | | 37. | SysAD[3] | 38. | SysAD[4] | 39. | SysAD[5] | 40. | SysAD[6] | 41. | SysAD[7] | 42. | SysAD[8] | | 43. | SysAD[9] | 44. | SysAD[10] | 45. | SysAD[11] | 46. | SysAD[12] | 47. | SysAD[13] | 48. | SysAD[14] | | 49. | SysAD[15] | 50. | SCData[0] | 51. | SCData[1] | 52. | SCData[2] | 53. | SCData[3] | 54. | SCData[4] | | 55. | SCData[5] | 56. | SCData[6] | 57. | SCData[7] | 58. | SCData[8] | 59. | SCData[9] | 60. | SCData[10] | | 61. | SCData[11] | 62. | SCData[12] | 63. | SCData[13] | 64. | SCData[14] | 65. | SCData[15] | 66. | SCData[16] | | 67. | SCData[17] | 68. | SCData[18] | 69. | SCData[19] | 70. | SCData[20] | 71. | SCData[21] | 72. | SCData[22] | | 73. | SCData[23] | 74. | SCData[24] | 75. | SCData[25] | 76. | SCData[26] | 77. | SCData[27] | 78. | SCData[28] | | 79. | SCData[29] | 80. | SCData[30] | 81. | SCData[31] | 82. | SCDataChk[0] | 83. | SCAAddr[18] | 84. | SCAAddr[17] | | 85. | SCAAddr[16] | 86. | SCAAddr[15] | 87. | SCAAddr[14] | 88. | SCAAddr[13] | 89. | SCAAddr[12] | 90. | SCAAddr[11] | | 91. | SCAAddr[10] | 92. | SCAAddr[9] | 93. | SCDataChk[2] | 94. | SCDataChk[4] | 95. | SCData[64] | 96. | SCData[65] | | 97. | SCData[66] | 98. | SCData[67] | 99. | SCData[68] | 100. | SCData[69] | 101. | SCData[70] | 102. | SCData[71] | | 103. | SCDataChk[9] | 104. | SysCyc* | 105. | SysAD[32] | 106. | SysAD[33] | 107. | SysAD[34] | 108. | SysAD[35] | | 109. | SysAD[36] | 110. | SysAD[37] | 111. | SysAD[38] | 112. | SysAD[39] | 113. | SysAD[40] | 114. | SysAD[41] | | 115. | SysAD[42] | 116. | SysAD[43] | 117. | SysAD[44] | 118. | SysAD[45] | 119. | SysAD[46] | 120. | SysAD[47] | | 121. | SCData[72] | 122. | SCData[73] | 123. | SCData[74] | 124. | SCData[75] | 125. | SCData[76] | 126. | SCData[77] | | 127. | SCData[78] | 128. | SCData[79] | 129. | SCAAddr[0] | 130. | SCAAddr[1] | 131. | SCAAddr[2] | 132. | SCAAddr[3] | | 133. | SCAAddr[4] | 134. | SCAAddr[5] | 135. | SCAAddr[6] | 136. | SCAAddr[7] | 137. | SCAAddr[8] | 138. | SCADWay | | 139. | SCADCS* | 140. | SCADOE* | 141. | SCADWr* | 142. | SCData[80] | 143. | SCData[81] | 144. | SCData[82] | | 145. | SCData[83] | 146. | SCData[84] | 147. | SCData[85] | 148. | SCData[86] | 149. | SCData[87] | 150. | SCData[88] | | 151. | SCData[89] | 152. | SCData[90] | 153. | SCData[91] | 154. | SCData[92] | 155. | SCData[93] | 156. | SCData[94] | | 157. | SCData[95] | 158. | SCDataChk[6] | 159. | SCDataChk[8] | 160. | Spare1 | 161. | SCTCS* | 162. | SCTOE* | | 163. | SCTWr* | 164. | SCTag[25] | 165. | SCTag[24] | 166. | SCTag[23] | 167. | SCTag[22] | 168. | SCTag[21] | | 169. | SCTag[20] | 170. | SCTag[19] | 171. | SCTag[18] | 172. | SCTag[17] | 173. | SCTag[16] | 174. | SCTag[15] | | 175. | SCTag[14] | | SCTag[13] | | SCTag[12] | _ | SCTag[11] | _ | SCTag[10] | 180. | SCTag[9] | | 181. | SCTag[8] | | SCTag[7] | | SCTag[6] | 184. | SCTag[5] | 185. | SCTag[4] | 186. | SCTag[3] | | 187. | SCTag[2] | | SCTag[1] | | SCTag[0] | 190. | SCTagLSBAddr | | TriState <sup>‡</sup> | 192. | SCTWay | | 193. | SCTagChk[6] | 194. | SCTagChk[5] | 195. | SCTagChk[4] | 196. | SCTagChk[3] | 197. | SCTagChk[2] | 198. | SCTagChk[1] | | 199. | SCTagChk[0] | 200. | SysCmd[0] | 201. | SysCmd[1] | 202. | | 203. | SysCmd[3] | 204. | SysCmd[4] | | 205. | SysCmd[5] | 206. | SysCmd[6] | 207. | SysCmd[7] | 208. | SysCmd[8] | 209. | SysCmd[9] | 210. | SysCmd[10] | | 211. | SysCmd[11] | 212. | SysCmdPar | 213. | SysVal* | 214. | SysReq* | 215. | SysRel* | 216. | SysGnt* | | | SysReset* | | SysRespVal* | | SysRespPar | | SysResp[4] | | SysResp[3] | | SysResp[2] | | | SysResp[1] | | SysResp[0] | _ | SysGblPerf* | | SysRdRdy* | | SysWrRdy* | | SysStateVal* | | | SysStatePar | | SysState[2] | | SysState[1] | | SysState[0] | _ | SysCorErr* | | SysUncErr* | | | SysNMI* | | SCDataChk[7] | | SCDataChk[5] | | SCData[127] | | SCData[126] | | SCData[125] | | | SCData[124] | | SCData[123] | | SCData[122] | | SCData[121] | | SCData[120] | | SCData[119] | | - | SCData[118] | | SCData[117] | | SCData[116] | | SCData[115] | _ | SCData[114] | | SCData[113] | <sup>‡</sup> Will be eliminated after rev. 1.2. Table 10-2 (cont.) Boundary Scan Register Pinlist, rev 1.2 | Signal | Signal | Signal | Signal | Signal | Signal | |------------------|-------------------|-------------------|-------------------|----------------------------|-------------------| | 253. SCData[112] | 254. SCBDWr* | 255. SCBDOE* | 256. SCBDCS* | 257. SCBDWay | 258. SCBAddr[8] | | 259. SCBAddr[7] | 260. SCBAddr[6] | 261. SCBAddr[5] | 262. SCBAddr[4] | 263. SCBAddr[3] | 264. SCBAddr[2] | | 265. SCBAddr[1] | 266. SCBAddr[0] | 267. SCData[111] | 268. SCData[110] | 269. SCData[109] | 270. SCData[108] | | 271. SCTag[8] | 272. SCTag[7] | 273. SCTag[6] | 274. SCTag[5] | 275. SCTag[4] | 276. SCTag[3] | | 277. SCTag[2] | 278. SCTag[1] | 279. SCTag[0] | 280. SCTagLSBAddr | 281. TriState <sup>‡</sup> | 282. SCTWay | | 283. SCTagChk[6] | 284. SCTagChk[5] | 285. SCTagChk[4] | 286. SCTagChk[3] | 287. SCTagChk[2] | 288. SCTagChk[1] | | 289. SCTagChk[0] | 290. SysCmd[0] | 291. SysCmd[1] | 292. SysCmd[2] | 293. SysCmd[3] | 294. SysCmd[4] | | 295. SysCmd[5] | 296. SysCmd[6] | 297. SysCmd[7] | 298. SysCmd[8] | 299. SysCmd[9] | 300. SysCmd[10] | | 301. SysCmd[11] | 302. SysCmdPar | 303. SysVal* | 304. SysReq* | 305. SysRel* | 306. SysGnt* | | 307. SysReset* | 308. SysRespVal* | 309. SysRespPar | 310. SysResp[4] | 311. SysResp[3] | 312. SysResp[2] | | 313. SysResp[1] | 314. SysResp[0] | 315. SysGblPerf* | 316. SysRdRdy* | 317. SysWrRdy* | 318. SysStateVal* | | 319. SysStatePar | 320. SysState[2] | 321. SysState[1] | 322. SysState[0] | 323. SysCorErr* | 324. SysUncErr* | | 325. SysNMI* | 326. SCDataChk[7] | 327. SCDataChk[5] | 328. SCData[127] | 329. SCData[126] | 330. SCData[125] | | 331. SCData[124] | 332. SCData[123] | 333. SCData[122] | 334. SCData[121] | 335. SCData[120] | 336. SCData[119] | | 337. SCData[118] | 338. SCData[117] | 339. SCData[116] | 340. SCData[115] | 341. SCData[114] | 342. SCData[113] | | 343. SCData[112] | 344. SCBDWr* | 345. SCBDOE* | 346. SCBDCS* | 347. SCBDWay | 348. SCBAddr[8] | | 349. SCBAddr[7] | 350. SCBAddr[6] | 351. SCBAddr[5] | 352. SCBAddr[4] | 353. SCBAddr[3] | 354. SCBAddr[2] | | 355. SCBAddr[1] | 356. SCBAddr[0] | 357. SCData[111] | 358. SCData[110] | 359. SCData[109] | 360. SCData[108] | | 361. SCData[107] | 362. SCData[106] | 363. SCData[105] | 364. SCData[104] | 365. SysAD[63] | 366. SysAD[62] | | 367. SysAD[61] | 368. SysAD[60] | 369. SysAD[59] | 370. SysAD[58] | 371. SysAD[57] | 372. SysAD[56] | | 373. SysAD[55] | 374. SysAD[54] | 375. SysAD[53] | 376. SysAD[52] | 377. SysAD[51] | 378. SysAD[50] | | 379. SysAD[49] | 380. SysAD[48] | 381. SysADChk[7] | 382. SysADChk[6] | 383. SysADChk[5] | 384. SysADChk[4] | | 385. SysADChk[3] | 386. SysADChk[2] | 387. SysADChk[1] | 388. SysADChk[0] | 389. SysAD[31] | 390. SysAD[30] | | 391. SysAD[29] | 392. SysAD[28] | 393. SysAD[27] | 394. SysAD[26] | 395. SysAD[25] | 396. SysAD[24] | | 397. SysAD[23] | 398. SysAD[22] | 399. SysAD[21] | 400. SysAD[20] | 401. SysAD[19] | 402. SysAD[18] | | 403. SysAD[17] | 404. SysAD[16] | 405. SCData[103] | 406. SCData[102] | 407. SCData[101] | 408. SCData[100] | | 409. SCData[99] | 410. SCData[98] | 411. SCData[97] | 412. SCData[96] | 413. SCDataChk[3] | 414. SCBAddr[9] | | 415. SCBAddr[10] | 416. SCBAddr[11] | 417. SCBAddr[12] | 418. SCBAddr[13] | 419. SCBAddr[14] | 420. SCBAddr[15] | | 421. SCBAddr[16] | 422. SCBAddr[17] | 423. SCBAddr[18] | | | | $<sup>\</sup>ddagger$ Will be eliminated after rev. 1.2. # 11. Electrical Specifications This chapter contains the following electrical and signal information about the R10000 processor: - DC electrical specification - AC electrical specification - signal integrity issues # 11.1 DC Electrical Specification This section describes the following DC electrical characteristics of the R10000 processor: - DC power supply levels - DCOk and power supply sequencing - maximum operating conditions - input signal level sensing - mode definitions - Vref[SC,Sys] - · unused inputs - DC input/output specifications ### **DC Power Supply Levels** The processor core is powered by a +3.3V (+/-5%) supply. The processor output drivers are powered from a separate supply, dependent on the output logic family used in the application system: - For JEDEC-compatible HSTL operation, the nominal value for **VccQSC** and **VccQSys** are in the 1.5V (+/- 100 millivolt) range. - For CMOS/TTL compatible systems, **VccQSC** and **VccQSys** can be externally tied to the same **Vcc** as the core power supply. **NOTE:** The I/O pins of the R10000 processor may not be driven higher than 4.0V by any device in the system until the **Vcc** and **VccQ** inputs are stable. ### DCOk and Power Supply Sequencing The following guidelines are designed to protect the processor from damage or latch-up: - With respect to the **Vcc** (3.3V) (supply to the core), **VccQ[SC,Sys]** (either 1.5V or 3.3V) must not be driven more than a diode threshold voltage. - Vref should not go higher than VccQ[SC,Sys]. Generally, Vref is derived from VccQ through a resistor divider, and therefore cannot rise above VccQ. - The power to termination resistors must not arrive before **Vcc** and **VccQ[SC,Sys]** arrive at the processor. - None of the supplies can float or be driven negative. One method of protecting the processor from excessive input voltage is to sequence the power supplies for the entire system, ensuring that the power to the processor is stable before any components drive signals to the processor. Another method to tristate all external drivers to the processor with the **DCOk** pin, until the processor has stabilized. **NOTE:** The input voltage required for the **DCOk** is 3.3V in either the CMOS/TTL or the HSTL configuration. Both **DCOk** pins must be tied together externally. ### **Maximum Operating Conditions** Table 11-1 shows the maximum conditions under which the processor operates. Table 11-1 Maximum Operating Conditions | Parameter | Symbol | Value | | |------------------------|--------------------------------|------------------------|--| | Core Supply Voltage | Vcc | 3.6 volts | | | Output Supply Voltage | VccQ (HTSL)<br>VccQ (CMOS/TTL) | 1.6 volts<br>3.6 volts | | | Case Temperature | <u>Tc</u> | 20° to 85° C | | | Applied Input Voltage: | Vin | -0.5 to Vcc+0.5 volts | | | Maximum Power | P <sub>R10000</sub> | 30 watts | | | PClk Frequency | f | 200 MHz | | #### Errata Revised "Case Temperature" in Table 11-1, above. ### **Input Signal Level Sensing** The processor input signals are all received by CMOS receivers that are compatible with either HSTL or CMOS/TTL logic levels. The I/O levels are defined by **VrefSC** and **VrefSys**, according to the appropriate logic family (HSTL or CMOS/TTL). #### **Mode Definitions** The mode bit, **ODrainSys**, is provided to select the characteristics of the pad ring. When asserted, this mode bit tristates the PMOS pullup devices to select system interface output drivers. This mode is included to allow for multiprocessor systems to use a GTL-like open drain configuration with external load/termination resistors providing logic high levels. ### Vref[SC,Sys] The **Vref[SC,Sys]** pins must be connected to a stable reference voltage source. This reference point is used in the input sense amp current mirror to provide the switch point for the logic levels. Inside the processor, the **Vref[SC,Sys]** signals have a large capacitance, and a low-pass filter at each receiver. The **DCOk** pins must not be asserted until there has been sufficient time for **Vref[SC,Sys]** to stabilize at each of the receivers inside the processor. A typical **Vref[SC,Sys]** generator is two resistors which provide the **Vref[SC,Sys]** level associated with the chosen logic family, and a 10µF tantalum capacitor connected to the processor's **Vref[SC,Sys]** pin to provide stability. ### **DC Input/Output Specifications** All processor output drivers are CMOS push-pull, and the signals swing between **VccQ** and **Vss**. In open drain mode, the gates of the PMOS pullup devices are disabled. Input-only pins include a disabled output buffer for implicit ESD protection. Tables 11-2 and 11-3 describe the DC characteristics of the I/O signals for the HSTL and CMOS/TLL configurations. **NOTE:** As the JEDEC Standard 8-x evolves, the HSTL specifications will also change, and the processor will remain compliant with these standards. Table 11-2 DC Characteristics for HSTL Configuration | Symbol | Parameter | Minimum | Maximum | Units | Conditions | |--------|---------------------|---------------|----------------|-------|------------| | VOH | Output high voltage | VccQ/2 + 0.3V | N/A | V | N/A | | VOL | Output low voltage | | VccQ /2 - 0.3V | V | N/A | | VIH | Input high voltage | Vref + 100mV | Vcc + 300mV | V | N/A | | VIL | Input low voltage | -300mV | Vref - 100mV | V | N/A | | ILeak | I/O leakage current | -TBD | TBD | mA | N/A | Table 11-3 DC Characteristics for CMOS/TTL Configuration | Symbol | Parameter | Minimum | Maximum | Units | Conditions | |--------|---------------------|---------|---------|-------|------------------| | VOH | Output high voltage | 2.4 | N/A | V | Vcc = VccQ = min | | VOL | Output low voltage | N/A | 0.4 | V | Vcc = VccQ = min | | VIH | Input high voltage | 2.0 | N/A | V | N/A | | VIL | Input low voltage | N/A | 0.8 | V | N/A | | ILeak | I/O leakage current | -TBD | TBD | mA | N/A | ### Errata All the JTAG output drivers are push-pull CMOS/TTL compatible, with Vcc (core) as the supply (independent of VccQ[SC, Sys]). All the JTAG inputs require full CMOS swings, as given by the DC specifications in the Table 11-3. # 11.2 AC Electrical Specification This section describes the following AC electrical characteristics of the R10000 processor: - · maximum operating conditions - test specification - secondary cache and system interface timing - enable/output delay, setup, hold time - asynchronous inputs ### **Maximum Operating Conditions** The R10000 chip clamps signals that overshoot the DC limits established for input logic levels. These limits are published as part of the fabrication process characterization. The R10000 chip provides silicon diode clamps on all signal pins. ### **Test Specification** HSTL test conditions are based on the JEDEC Standard conditions. ### **Secondary Cache and System Interface Timing** Timing measurements are referenced from the mid-swing point of the input signal to the crossing point of the **SysClk** and **SysClk\*** input clocks. All input signals maintain a 1 V/ns edge rate in the 20% to 80% range of the input signal swing. # **Enable/Output Delay, Setup, Hold Time** Table 11-4 lists the delay, setup, and hold times for the HTSL version of the processor. Table 11-4 AC Characteristics for HSTL Configuration | HSTL | Minimum | Maximum | |--------------|---------|---------| | Output delay | 0.5 ns | 1.5 ns | | Setup | 1.0 ns | | | Hold | 1.0 ns | | Table 11-5 lists the delay, setup, and hold times for the CMOS/TTL version of the processor. Table 11-5 AC Characteristics for CMOS/TTL Configuration | LVCMOS | Minimum | Maximum | |--------------|---------|---------| | Output delay | 0.5 ns | 2.0 ns | | Setup | 1.0 ns | | | Hold | 1.0 ns | | # **Asynchronous Inputs** The **SysReset\*** input can be asserted asynchronously to **SysClk**, but must be negated synchronously with **SysClk**, adhering to the AC electrical specifications listed above. # 11.3 Signal Integrity Issues In this section, the following signal integrity considerations are described for a R10000-based system: - Power Supply Regulation - Decoupling Capacitance - Reference Voltage - Maximum Input Voltage Levels - Output I-V Curves - Switching and Slew Rate Characteristics #### **Reference Voltage** Most input pins on the processor use a current-mirror sense amp with **Vref[SC,Sys]** supplied to the negative input to provide a single rail input receiver. The following input pins are exceptions to this rule: - SysClk and SysClk\* - DCOk All other inputs require a stable **Vref[SC,Sys]** supply for proper operation. The **Vref[SC,Sys]** source can be a simple voltage divider; the actual impedance of this source is not critical, since the **Vref[SC,Sys]** signals are sampled through a low-pass filter on the processor. ### **Power Supply Regulation** The system must provide connections to all of the Vcc, VccQ[SC,Sys], and Vss pins on the processor package. The power supply voltages must be held to 5% tolerance at the processor pin connection. #### **Maximum Input Voltage Levels** Maximum excursion of the input signal due to ringing may reach Vcc+0.5V or Vss-0.5V for periods of less than 10% of the total driven waveform period. The R10000 processor includes overshoot clamps by silicon diode protection which limit the overshoot to approximately 500 mV beyond each supply rail. ### **Decoupling Capacitance** #### Errata In order to regulate the transient current requirements of a R10000-based system, it is suggested that explicit decoupling capacitors be used. The R10000 package allows for the following capacitors: - eight Vcc-Vss - five VccQSC-Vss - four VccQSys-Vss The package also provides six connections for the PLL power supplies and loop capacitors. **VccPa** (**VccPd**) is connected to **VssPa** (**VssPd**) through three decoupling capacitors, as shown in Figures 11-1 and 11-2. The $0.1\mu F$ and 1 nF low-inductance capacitors are placed in parallel with the $10~\mu F$ capacitor, as close to the R10000 package as possible. $^{\dagger}$ Figure 11-1 Decoupling VccPa and VssPa Figure 11-2 Decoupling VccPd and VssPd <sup>†</sup> Decoupling between VccPa and VssPa is far more important than decoupling between VccPd and VssPd, if both are not possible. # [MEMO] # 12. Packaging The R10000 microprocessor is presently supplied in one standard package configuration: • a single-chip 599 ceramic LGA (Land Grid Array) MIPS Licensees are encouraged to develop package solutions with MIPS Semiconductor Partners to meet specific requirements. # 12.1 R10000 Single-Chip Package, 599CLGA The standard single-chip R10000 package is a **599CLGA** (ceramic land grid array), as shown in Figure 12-1. The 599CLGA package minimizes output switching noise by reducing the inductance of the power and ground paths leading into the package. Much of the decrease in power/ground inductance is accomplished by shortening the wire bonds running from the die pads to the package inner leads. The 599CLGA is designed with its cavity-side down, and the die is connected directly to a thermal slug. #### **Mechanical Characteristics** The 599CLGA has lands on a straight 1.27mm (.050inch) grid. It is a cavity-down, multi-layer ceramic package with an integral copper-tungsten slug, and is designed for use with a socket. Preliminary information suggests that the 599CLGA can withstand a force of 100 kilograms applied to the CuW slug, without damage, and a PWB assembly should insure that this force is not exceeded. Drawings for a reference LGA-PWB assembly are included in this chapter. #### **Electrical Characteristics** $T_{pd}$ Propagation delay The 599CLGA uses multilayer construction, incorporating stripline configuration for signals. Multiple planes distribute power and ground throughout the package and provide built-in distributed bypass/coupling capacitance between the primary power supplies: **Vcc**, **VccQSC**, **VccQSys**, and **Vss**. Pads are present on the package body for attaching chip-capacitors to provide additional bypass capacitance between the primary power supplies and the PLL power supply (**VccPa** and **VssPa**), and to provide an additional PLL loop filter capacitor (**PLLRC**). Chip-capacitors on the R10000 are assembled by the chip manufacturer. Detailed electrical package characteristics will be provided by the MIPS Semiconductor Partners as they become available. The data in Table 12-1 is provided as an estimate of the package parasitics. These estimates include the effects of bondwires, package traces and vias, but not the sockets. **Parameter Description** Minimum **Typical** Maximum $L_{sig}$ Effective signal inductance 4.0nH 8.4nH $M_{sig}$ Signal-to-signal mutual inductance 1.3nH $C_{sig}$ Signal loading capacitance 3.0pF 5.6pF 0.5pF $C_{\rm m}$ Signal-to-signal mutual capacitance $R_{sig}$ Signal resistance $400 \text{m}\Omega$ $1300 \text{m}\Omega$ $Z_0$ Characteristic impedance $40\Omega$ Table 12-1 R10000 599CLGA Electrical Characteristics The copper-tungsten slug (provided for thermal performance) is hard-connected to $\mathbf{Vss}$ to minimize EMI radiation from the package. 200ps #### **Thermal Characteristics** The 599CLGA incorporates a copper-tungsten slug to provide an efficient thermal path from the processor to the heatsink. The thermal analysis listed in Table 12-2 gives a preliminary indication of heatsink requirements for the 599CLGA. Table 12-2 R10000 599CLGA Thermal Characteristics - Preliminary | Parameter | Description | Value | |------------------------------|---------------------------------------------|----------| | T <sub>c</sub> | Maximum case temperature | 85° C | | $T_a^{\ddagger}$ | Maximum ambient temperature | 40° C | | P <sub>R10000</sub> | Maximum power dissipation | 30 watts | | T <sub>ja</sub> | Minimum temperature differential | 45° C | | Θ <sub>ca</sub> <sup>‡</sup> | Required case to ambient thermal resistance | 1.5° C/W | $<sup>\ \ ^{\</sup>downarrow}$ $\ Q_{ca}$ is used as an example to calculate the ambient temperature, $T_{c},$ needed. #### Errata #### Revised Table 12-2. System designers must take care, especially in desktop applications, to ensure sufficient airflow and heat-dissipation surface area to meet the required <u>case-to-ambient thermal</u> resistance, $\Theta_{ca}$ . The thermal interface between the package and heatsink is very important. Typically, grease or compliant material is inserted between the package and heatsink to increase the contact area between their surfaces. ### **Assembly Drawings and Pinout List** The following pages contain a pinout list (Table 12-3), and drawings of an example R10000 LGA-PWB assembly, including details of the PWB, heatsink, and bolster plate. Actual hardware specifications are dependent on the user. An assembly drawing of the 599LGA is also shown in Figure 12-2. Note that hardware specifications given in this drawing will require modifications to accommodate the actual dimensions of the socket, PWB, heatsink, bolster, etc. Figure 12-1 R10000 599CLGA Package Outline # 599CLGA Pinout Table 12-3 599CLGA Pinout | Signal | Location | Signal | Location | Signal | Location | |-------------|----------|-------------|----------|-------------|----------| | DCOk | AF2 | DCOk | B22 | JTCK | W 33 | | JTDI | W35 | JTDO | Y31 | JTMS | AA34 | | PLLDis | E24 | PLLRC | A25 | PLLSpare1 | C 21 | | PLLSpare2 | A21 | PLLSpare3 | D21 | PLLSpare4 | B 21 | | SCAAddr<0> | E13 | SCAAddr<1> | A11 | SCAAddr<2> | D 12 | | SCAAddr<3> | C11 | SCAAddr<4> | E12 | SCAAddr<5> | B10 | | SCAAddr<6> | D11 | SCAAddr<7> | C10 | SCAAddr<8> | A9 | | SCAAddr<9> | В30 | SCAAddr<10> | E29 | SCAAddr<11> | A31 | | SCAAddr<12> | D30 | SCAAddr<13> | C31 | SCAAddr<14> | E 30 | | SCAAddr<15> | В32 | SCAAddr<16> | D31 | SCAAddr<17> | В 33 | | SCAAddr<18> | C32 | SCADCS* | В9 | SCADOE* | D9 | | SCADWay | E10 | SCADWr* | A8 | SCBAddr<0> | AL 13 | | SCBAddr<1> | AP12 | SCBAddr<2> | AM12 | SCBAddr<3> | AR 11 | | SCBAddr<4> | AL12 | SCBAddr<5> | AN11 | SCBAddr<6> | AM 11 | | SCBAddr<7> | AP10 | SCBAddr<8> | AL11 | SCBAddr<9> | AL 29 | | SCBAddr<10> | AP30 | SCBAddr<11> | AM30 | SCBAddr<12> | AR 31 | | SCBAddr<13> | AL30 | SCBAddr<14> | AN31 | SCBAddr<15> | AM 31 | | SCBAddr<16> | AP32 | SCBAddr<17> | AP33 | SCBAddr<18> | AN 32 | | SCBDCS* | AN 10 | SCBDOE* | AL10 | SCBDWay | AR9 | | SCBDWr* | AP9 | SCClk<0> | B13 | SCClk<1> | A26 | | SCClk<2> | AA31 | SCClk<3> | AM15 | SCClk<4> | W 1 | | SCClk<5> | E1 | SCClk<0>* | E15 | SCClk<1>* | B26 | | SCClk<2>* | AB33 | SCClk<3>* | AR14 | SCClk<4>* | W 4 | | SCClk<5>* | F4 | SCData<0> | R31 | SCData<1> | N34 | | SCData<2> | P33 | SCData<3> | M35 | SCData<4> | P32 | | SCData<5> | M34 | SCData<6> | N33 | SCData<7> | L35 | | SCData<8> | N31 | SCData<9> | L33 | SCData<10> | M 32 | | SCData<11> | K34 | SCData<12> | M31 | SCData<13> | J 35 | | SCData<14> | L32 | SCData<15> | J34 | SCData<16> | K33 | | SCData<17> | Н35 | SCData<18> | K31 | SCData<19> | G34 | | SCData<20> | J32 | SCData<21> | G33 | SCData<22> | J 31 | | SCData<23> | F35 | SCData<24> | Н32 | SCData<25> | F34 | | SCData<26> | G31 | SCData<27> | E35 | SCData<28> | F 32 | | SCData<29> | D34 | SCData<30> | F31 | SCData<31> | E 32 | | SCData<32> | AA32 | SCData<33> | AB35 | SCData<34> | AC 34 | | SCData<35> | AB32 | SCData<36> | AD35 | SCData<37> | AC 33 | | SCData<38> | AD 34 | SCData<39> | AC31 | SCData<40> | AE 35 | | SCData<41> | AD32 | SCData<42> | AE33 | SCData<43> | AD31 | | SCData<44> | AF34 | SCData<45> | AE32 | SCData<46> | AG 35 | | SCData<47> | AF33 | SCData<48> | AG34 | SCData<49> | AF31 | | SCData<50> | AH35 | SCData<51> | AG32 | SCData<52> | AJ 34 | | SCData<53> | AG31 | SCData<54> | AJ33 | SCData<55> | AH 32 | | SCData<56> | AK35 | SCData<57> | AJ31 | SCData<58> | AK 34 | | SCData<59> | AK 32 | SCData<60> | AL35 | SCData<61> | AK 31 | *Table 12-3 (cont.)* | Signal | Location | Signal | Location | Signal | Location | |--------------|----------|--------------|----------|--------------|----------| | SCData<62> | AM34 | SCData<63> | AM33 | SCData<64> | D28 | | SCData<65> | В29 | SCData<66> | E27 | SCData<67> | C28 | | SCData<68> | D27 | SCData<69> | E26 | SCData<70> | A28 | | SCData<71> | C26 | SCData<72> | B15 | SCData<73> | D 15 | | SCData<74> | A14 | SCData<75> | C14 | SCData<76> | A12 | | SCData<77> | D14 | SCData<78> | B12 | SCData<79> | C13 | | SCData<80> | E9 | SCData<81> | C8 | SCData<82> | D8 | | SCData<83> | В7 | SCData<84> | C7 | SCData<85> | A 6 | | SCData<86> | E7 | SCData<87> | В6 | SCData<88> | D 6 | | SCData<89> | A5 | SCData<90> | E6 | SCData<91> | C 5 | | SCData<92> | D5 | SCData<93> | B4 | SCData<94> | C 4 | | SCData<95> | В3 | SCData<96> | AN29 | SCData<97> | AP29 | | SCData<98> | AM28 | SCData<99> | AN28 | SCData<100> | AL 27 | | SCData<101> | AR28 | SCData<102> | AM27 | SCData<103> | AP27 | | SCData<104> | AL16 | SCData<105> | AP15 | SCData<106> | AL 15 | | SCData<107> | AP13 | SCData<108> | AN14 | SCData<109> | AN 13 | | SCData<110> | AM14 | SCData<111> | AR12 | SCData<112> | AM 9 | | SCData<113> | AR8 | SCData<114> | AL9 | SCData<115> | AN 8 | | SCData<116> | AM8 | SCData<117> | AP7 | SCData<118> | AN 7 | | SCData<119> | AR6 | SCData<120> | AL7 | SCData<121> | AP 6 | | SCData<122> | AM6 | SCData<123> | AR5 | SCData<124> | AL 6 | | SCData<125> | AN5 | SCData<126> | AM5 | SCData<127> | AP4 | | SCDataChk<0> | D33 | SCDataChk<1> | AL32 | SCDataChk<2> | C29 | | SCDataChk<3> | AR30 | SCDataChk<4> | A30 | SCDataChk<5> | AP3 | | SCDataChk<6> | E4 | SCDataChk<7> | AN4 | SCDataChk<8> | D 3 | | SCDataChk<9> | В27 | SCTCS* | D2 | SCTag<0> | R 1 | | SCTag<1> | R4 | SCTag<2> | P1 | SCTag<3> | R 5 | | SCTag<4> | P3 | SCTag<5> | N2 | SCTag<6> | P4 | | SCTag<7> | M1 | SCTag<8> | N3 | SCTag<9> | M 2 | | SCTag<10> | N 5 | SCTag<11> | M4 | SCTag<12> | L 1 | | SCTag<13> | M5 | SCTag<14> | K2 | SCTag<15> | L 4 | | SCTag<16> | J1 | SCTag<17> | K3 | SCTag<18> | J 2 | | SCTag<19> | K5 | SCTag<20> | Н1 | SCTag<21> | J 4 | | SCTag<22> | G2 | SCTag<23> | J5 | SCTag<24> | G3 | | SCTag<25> | H4 | SCTagChk<0> | V4 | SCTagChk<1> | W3 | | SCTagChk<2> | V2 | SCTagChk<3> | V5 | SCTagChk<4> | V1 | | SCTagChk<5> | U3 | SCTagChk<6> | U1 | SCTOE* | G5 | | SCTWay | T3 | SCTWr* | F2 | SCTagLSBAddr | T 5 | | SelDVCO | E21 | Spare1 | F5 | Spare3 | U4 | | SysAD<0> | Y 34 | SysAD<1> | W32 | SysAD<2> | V 35 | | SysAD<3> | V31 | SysAD<4> | V34 | SysAD<5> | U 33 | | SysAD<6> | V32 | SysAD<7> | U32 | SysAD<8> | U35 | | SysAD<9> | T33 | SysAD<10> | T34 | SysAD<11> | T31 | | SysAD<12> | R35 | SysAD<13> | R32 | SysAD<14> | R 34 | | SysAD<15> | P35 | SysAD<16> | AL26 | SysAD<17> | AR 27 | | SysAD<18> | AN 26 | SysAD<19> | AP26 | SysAD<20> | AL 25 | *Table 12-3 (cont.)* | Signal | Location | Signal | Location | Signal | Location | |-------------|----------|-------------|----------|-------------|----------| | SysAD<21> | AN25 | SysAD<22> | AM25 | SysAD<23> | AR 25 | | SysAD<24> | AL24 | SysAD<25> | AP24 | SysAD<26> | AM 24 | | SysAD<27> | AR24 | SysAD<28> | AL23 | SysAD<29> | AN 23 | | SysAD<30> | AM22 | SysAD<31> | AP23 | SysAD<32> | C20 | | SysAD<33> | В20 | SysAD<34> | D19 | SysAD<35> | A 19 | | SysAD<36> | C19 | SysAD<37> | A18 | SysAD<38> | D 18 | | SysAD<39> | E18 | SysAD<40> | B18 | SysAD<41> | C17 | | SysAD<42> | A 17 | SysAD<43> | D17 | SysAD<44> | В16 | | SysAD<45> | C16 | SysAD<46> | A15 | SysAD<47> | E16 | | SysAD<48> | AN 20 | SysAD<49> | AR19 | SysAD<50> | AL 19 | | SysAD<51> | AN 19 | SysAD<52> | AM19 | SysAD<53> | AP 18 | | SysAD<54> | AM18 | SysAD<55> | AR18 | SysAD<56> | AL 18 | | SysAD<57> | AR17 | SysAD<58> | AM17 | SysAD<59> | AN 17 | | SysAD<60> | AL17 | SysAD<61> | AP16 | SysAD<62> | AN 16 | | SysAD<63> | AR15 | SysADChk<0> | AN22 | SysADChk<1> | AR 22 | | SysADChk<2> | AL21 | SysADChk<3> | AP21 | SysADChk<4> | AM 21 | | SysADChk<5> | AR21 | SysADChk<6> | AL20 | SysADChk<7> | AP 20 | | SysClk | A22 | SysClk* | A23 | SysClkRet* | C 23 | | SysClkRet | В23 | SysCmd<0> | Y2 | SysCmd<1> | Y3 | | SysCmd<2> | AA1 | SysCmd<3> | Y5 | SysCmd<4> | AA2 | | SysCmd<5> | AA4 | SysCmd<6> | AB1 | SysCmd<7> | AA5 | | SysCmd<8> | AB3 | SysCmd<9> | AC2 | SysCmd<10> | AB 4 | | SysCmd<11> | AD1 | SysCmdPar | AC3 | SysCorErr* | AK4 | | SysCyc* | E20 | SysGblPerf* | AG4 | SysGnt* | AD4 | | SysNMI* | AK5 | SysRdRdy* | AJ3 | SysRel* | AE 1 | | SysReq* | AC5 | SysReset* | AD5 | SysResp<0> | AJ2 | | SysResp<1> | AF5 | SysResp<2> | AH1 | SysResp<3> | AF3 | | SysResp<4> | AG2 | SysRespPar | AE4 | SysRespVal* | AG1 | | SysState<0> | AL1 | SysState<1> | AJ5 | SysState<2> | AK 2 | | SysStatePar | AH4 | SysStateVal | AK1 | SysUncErr* | AM 2 | | SysVal* | AD2 | SysWrRdy* | AG5 | TCA | AM 3 | | TCB | AL4 | TriState | T2 | VccPa | В25 | | VccPa | C25 | VccPd | E22 | VrefByp | C 22 | | VssPa | A24 | VssPa | B24 | VssPd | D 22 | | Vcc | A2 | Vcc | A34 | Vcc | AB 2 | | Vcc | AB34 | Vcc | AE3 | Vcc | AF 32 | | Vcc | AF4 | Vcc | AH2 | Vcc | AH 34 | | Vcc | AL3 | Vcc | AL31 | Vcc | AL 33 | | Vcc | AL5 | Vcc | AM10 | Vcc | AM 16 | | Vcc | AM20 | Vcc | AM26 | Vcc | AN 18 | | Vcc | AN2 | Vcc | AN34 | Vcc | AP1 | | Vcc | AP14 | Vcc | AP22 | Vcc | AP28 | | Vcc | AP35 | Vcc | AP8 | Vcc | AR 2 | | Vcc | AR34 | Vcc | B1 | Vcc | B14 | | Vcc | В28 | Vcc | B35 | Vcc | В8 | | Vcc | C18 | Vcc | C2 | Vcc | C 34 | *Table 12-3 (cont.)* | Signal | Location | Signal | Location | Signal | Location | |---------|----------|---------|----------|---------|----------| | Vcc | D10 | Vcc | D16 | Vcc | D 20 | | Vcc | D26 | Vcc | Е3 | Vcc | E 31 | | Vcc | E33 | Vcc | E5 | Vcc | F1 | | Vcc | Н2 | Vcc | Н34 | Vcc | K32 | | Vcc | K4 | Vcc | L3 | Vcc | P2 | | Vcc | P34 | Vcc | T32 | Vcc | T 4 | | Vcc | V3 | Vcc | V33 | Vcc | Y32 | | Vcc | Y4 | VccQSC | A10 | VccQSC | A32 | | VccQSC | A4 | VccQSC | AB31 | VccQSC | AD 33 | | VccQSC | AF35 | VccQSC | AH31 | VccQSC | AH 33 | | VccQSC | AK33 | VccQSC | AL14 | VccQSC | AL 28 | | VccQSC | AL8 | VccQSC | AM35 | VccQSC | AN 12 | | VccQSC | AN 3 | VccQSC | AN30 | VccQSC | AN 33 | | VccQSC | AN 6 | VccQSC | AR10 | VccQSC | AR 32 | | VccQSC | AR4 | VccQSC | C12 | VccQSC | C 3 | | VccQSC | C30 | VccQSC | C33 | VccQSC | C6 | | VccQSC | D1 | VccQSC | D35 | VccQSC | E 14 | | VccQSC | E28 | VccQSC | E8 | VccQSC | F3 | | VccQSC | F33 | VccQSC | Н3 | VccQSC | Н31 | | VccQSC | Н33 | VccQSC | Н5 | VccQSC | K1 | | VccQSC | K35 | VccQSC | M3 | VccQSC | M33 | | VccQSC | P31 | VccQSC | P5 | VccQSC | R 2 | | VccQSC | T1 | VccQSys | A16 | VccQSys | A20 | | VccQSys | AB 5 | VccQSys | AD3 | VccQSys | AF1 | | VccQSys | AH3 | VccQSys | AH5 | VccQSys | AK3 | | VccQSys | AL22 | VccQSys | AM1 | VccQSys | AN24 | | VccQSys | AR 16 | VccQSys | AR20 | VccQSys | AR 26 | | VccQSys | T35 | VccQSys | Y1 | VccQSys | Y35 | | VrefSC | AA35 | VrefSys | Y33 | Vss | A13 | | Vss | A27 | Vss | A29 | Vss | A3 | | Vss | A33 | Vss | A35 | Vss | A7 | | Vss | AA3 | Vss | AA33 | Vss | AC 1 | | Vss | AC 32 | Vss | AC35 | Vss | AC 4 | | Vss | AE2 | Vss | AE31 | Vss | AE 34 | | Vss | AE5 | Vss | AG3 | Vss | AG 33 | | Vss | AJ1 | Vss | AJ32 | Vss | AJ 35 | | Vss | AJ4 | Vss | AL2 | Vss | AL 34 | | Vss | AM13 | Vss | AM23 | Vss | AM 29 | | Vss | AM32 | Vss | AM4 | Vss | AM 7 | | Vss | AN 1 | Vss | AN15 | Vss | AN 21 | | Vss | AN 27 | Vss | AN35 | Vss | AN 9 | | Vss | AP11 | Vss | AP17 | Vss | AP 19 | | Vss | AP2 | Vss | AP25 | Vss | AP31 | | Vss | AP34 | Vss | AP5 | Vss | AR 1 | | | | | AR23 | | AR 29 | | Vss | AR13 | Vss | | Vss | | | Vss | AR3 | Vss | AR33 | Vss | AR 35 | *Table 12-3 (cont.)* | Signal | Location | Signal | Location | Signal | Location | |--------|----------|--------|----------|--------|----------| | Vss | AR7 | Vss | В11 | Vss | В17 | | Vss | В19 | Vss | В2 | Vss | B31 | | Vss | В34 | Vss | В5 | Vss | C 1 | | Vss | C15 | Vss | C24 | Vss | C27 | | Vss | C35 | Vss | C9 | Vss | D13 | | Vss | D23 | Vss | D24 | Vss | D25 | | Vss | D29 | Vss | D32 | Vss | D4 | | Vss | D7 | Vss | E11 | Vss | E17 | | Vss | E19 | Vss | E2 | Vss | E23 | | Vss | E25 | Vss | E34 | Vss | G1 | | Vss | G32 | Vss | G35 | Vss | G4 | | Vss | J3 | Vss | J33 | Vss | L2 | | Vss | L31 | Vss | L34 | Vss | L5 | | Vss | N 1 | Vss | N32 | Vss | N35 | | Vss | N4 | Vss | R3 | Vss | R33 | | Vss | U2 | Vss | U31 | Vss | U34 | | Vss | U5 | Vss | W2 | Vss | W31 | | Vss | W34 | Vss | W5 | | | Figure 12-2 599LGA Assembly Drawing Figure 12-3 599LGA PWB Footprint Figure 12-4 599LGA Heatsink Figure 12-5 599LGA Bolster Plate # 13. Coprocessor 0 This chapter describes the Coprocessor 0 operation, concentrating on the CP0 register definitions and the R10000 processor implementation of CP0 instructions. The Coprocessor 0 (CP0) registers control the processor state and report its status. These registers can be read using MFC0 instructions and written using MTC0 instructions. CP0 registers are listed in Table 13-1. Table 13-1 Coprocessor 0 Registers | Register No. | Register Name | Description | |--------------|---------------|----------------------------------------------------------------------------------| | 0 | Index | Programmable register to select TLB entry for reading or writing | | 1 | Random | Pseudo-random counter for TLB replacement | | 2 | EntryLo0 | Low half of TLB entry for even VPN (Physical page number) | | 3 | EntryLo1 | Low half of TLB entry for odd VPN (Physical page number) | | 4 | Context | Pointer to kernel virtual PTE table in 32-bit addressing mode | | 5 | Page Mask | Mask that sets the TLB page size | | 6 | Wired | Number of wired TLB entries (lowest TLB entries not used for random replacement) | | 7 | Undefined | Undefined | | 8 | BadVAddr | Bad virtual address | | 9 | Count | Timer count | | 10 | EntryHi | High half of TLB entry (Virtual page number and ASID) | | 11 | Compare | Timer compare | | 12 | Status | Processor Status Register | | 13 | Cause | Cause of the last exception taken | | 14 | EPC | Exception Program Counter | | 15 | PRId | Processor Revision Identifier | | 16 | Config | Configuration Register (secondary cache size, etc.) | | 17 | LLAddr | Load Linked memory address | | 18 | WatchLo | Memory reference trap address (low bits Adr[39:32]) | | 19 | WatchHi | Memory reference trap address (high bits Adr[31:3]) | | 20 | XContext | Pointer to kernel virtual PTE table in 64-bit addressing mode | | 21 | FrameMask | Mask the physical addresses of entries which are written into the TLB | | 22 | BrDiag | Branch Diagnostic register | | 23 | Undefined | Undefined | | 24 | Undefined | Undefined | | 25 | PC | Performance Counters | | 26 | ECC | Secondary cache ECC and primary cache parity | | 27 | CacheErr | Cache Error and Status register | | 28 | TagLo | Cache Tag register - low bits | | 29 | TagHi | Cache Tag register - high bits | | 30 | ErrorEPC | Error Exception Program Counter | Coprocessor 0 instructions are enabled if the processor is in Kernel mode, or if bit $28 \, (CU0)$ is set in the *Status* register. Otherwise, executing one of these instructions generates a Coprocessor 0 Unusable exception. # 13.1 Index Register (0) The *Index* register is a 32-bit, read/write register containing six bits to index an entry in the TLB. The high-order bit of the register shows the success or failure of a TLB Probe (TLBP) instruction. The *Index* register also specifies the TLB entry affected by TLB Read (TLBR) or TLB Write Index (TLBWI) instructions. Figure 13-1 shows the format of the *Index* register; Table 13-2 describes the *Index* register fields. ### **Index Register** Figure 13-1 Index Register Table 13-2 Index Register Field Descriptions | Field | Description | | | | | | | | | |-------|-----------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--| | P | Probe failure. Set to 1 when the previous TLBProbe (TLBP) instruction was unsuccessful. | | | | | | | | | | Index | Index to the TLB entry affected by the TLBRead and TLBWrite instructions | | | | | | | | | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | | | | | | | | ## 13.2 Random Register (1) The *Random* register is a read-only register of which six bits index an entry in the TLB. This register decrements when any instruction graduates at that particular cycle, and its values range between an upper and a lower bound, as follows: - The lower bound is set by the number of TLB entries reserved for exclusive use by the operating system (the contents of the *Wired* register). - The upper bound is set by the total number of TLB entries minus 1 (64 1 maximum). The *Random* register specifies the entry in the TLB that is affected by the TLB Write Random instruction. The register does not need to be read for this purpose; however, the register is readable to verify proper operation of the processor. To simplify testing, the *Random* register is set to the value of the upper bound upon system reset. This register is also set to the upper bound when the *Wired* register is written. Figure 13-2 shows the format of the *Random* register; Table 13-3 describes the *Random* register fields. Table 13-3 Random Register Field Descriptions | Field | Description | | | | | | | |--------|--------------------------------------------------------------------|--|--|--|--|--|--| | Random | TLB Random index | | | | | | | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | | | | | | # 13.3 EntryLo0 (2), and EntryLo1 (3) Registers The *EntryLo* register consists of two registers with identical formats: - EntryLo0 is used for even virtual pages. - EntryLo1 is used for odd virtual pages. The *EntryLo0* and *EntryLo1* registers are read/write registers. They hold the physical page frame number (PFN) of the TLB entry for even and odd pages, respectively, when performing TLB read and write operations. Figure 13-3 shows the format of these registers. #### EntryLo0 and EntryLo1 Registers Figure 13-3 Fields of the EntryLo0 and EntryLo1 Registers Table 13-4 Description of EntryLo Registers' Fields | Field | Description | |-------|--------------------------------------------------------------------------------------------------------------------------| | UC | Uncached attribute | | PFN | Page frame number; the upper bits of the physical address. | | С | Specifies the TLB page coherency attribute. | | D | Valid. If this bit is set, it indicates that the TLB entry is valid; otherwise, a TLBL or TLBS invalid exception occurs. | | V | Valid. If this bit is set, it indicates that the TLB entry is valid; otherwise, a TLBL or TLBS invalid exception occurs. | | G | Global. If this bit is set in both Lo0 and Lo1, then the processor ignores the ASID during TLB lookup. | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | The *PFN* fields of the *EntryLo0* and *EntryLo1* registers span bits 33:6 of the 40-bit physical address. Two additional bits for the mapped space's *uncached attribute* can be loaded into bits 63:62 of the *EntryLo* register, which are then written into the TLB with a TLB Write. During the address cycle of processor double/single/partial-word read and write requests, and during the address cycle of processor *uncached accelerated* block write requests, the processor drives the uncached attribute on **SysAD[59:58]**. The same *EntryLo* registers are used for the 64-bit and 32-bit addressing modes. In both modes the registers are 64 bits wide, however when the MIPS III ISA is not enabled (32-bit User and Supervisor modes) only the lower 32 bits of the *EntryLo* registers are accessible. MIPS III is disabled when the processor is in 32-bit Supervisor or User mode. Loading of the integer registers is limited to bits 31:0, sign-extended through bits 63:32. EntryLo[33:31] or PFN[39:38] can only be set to all zeroes or all ones. In 32- and 64-bit modes, the UC and PFN bits of both EntryLo registers are written into the TLB. The PFN bits can be masked by setting bits in the FrameMask register (described in this chapter) but the UC bits cannot be masked or initialized in 32-bit User or Supervisor modes. In 32-bit Kernel mode, MIPS III is enabled and 64-bit operations are always available to program the UC bits. There is only one G bit per TLB entry, and it is written with EntryLo0[0] and EntryLo1[0] on a TLB write. ## **13.4** Context (4) The *Context* register is a read/write register containing the pointer to an entry in the page table entry (PTE) array; this array is an operating system data structure that stores virtual-to-physical address translations. When there is a TLB miss, the CPU loads the TLB with the missing translation from the PTE array. Normally, the operating system uses the *Context* register to address the current page map which resides in the kernel-mapped segment, *kseg3*. The *Context* register duplicates some of the information provided in the *BadVAddr* register, but the information is arranged in a form that is more useful for a software TLB exception handler. Figure 13-4 shows the format of the *Context* register; Table 13-5 describes the *Context* register fields. #### **Context Register** Figure 13-4 Context Register Format #### Errata The 0 field in Table 13-5 is revised. Table 13-5 Context Register Fields | Field | Description | |---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | BadVPN2 | This field is written by hardware on a miss. It contains the virtual page number (VPN) of the most recent virtual address that did not have a valid translation. | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | PTEBase | This field is a read/write field for use by the operating system. It is normally written with a value that allows the operating system to use the <i>Context</i> register as a pointer into the current PTE array in memory. | The 19-bit *BadVPN2* field contains bits 31:13 of the virtual address that caused the TLB miss; bit 12 is excluded because a single TLB entry maps to an even-odd page pair. For a 4-Kbyte page size, this format can directly address the pair-table of 8-byte PTEs. For other page and PTE sizes, shifting and masking this value produces the appropriate address. # 13.5 PageMask Register (5) The *PageMask* register is a read/write register used for reading from or writing to the TLB; it holds a comparison mask that sets the variable page size for each TLB entry, as shown in Table 13-6. Format of the register is shown in Figure 13-5. TLB read and write operations use this register as either a source or a destination; when virtual addresses are presented for translation into physical address, the corresponding bits in the TLB identify which virtual address bits among bits 24:13 are used in the comparison. When the *Mask* field is not one of the values shown in Table 13-6, the operation of the TLB is undefined. The 0 field is reserved; it must be written as zeroes, and returns zeroes when read. Figure 13-5 PageMask Register | Page Size<br>(Mask) | Bit | | | | | | | | | | | | |---------------------|-----|----|----|----|----|----|----|----|----|----|----|----| | | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | | 4 Kbytes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 16 Kbytes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | | 64 Kbytes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | | 256 Kbytes | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | | 1 Mbyte | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | 4 Mbytes | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | 16 Mbytes | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | Table 13-6 Mask Field Values for Page Sizes # 13.6 Wired Register (6) The *Wired* register is a read/write register that specifies the boundary between the *wired* and *random* entries of the TLB as shown in Figure 13-6. Wired entries are fixed, nonreplaceable entries, which cannot be overwritten by a TLB write operation. Random entries can be overwritten. Figure 13-6 Wired Register Boundary The *Wired* register is set to 0 upon system reset. Writing this register also sets the *Random* register to the value of its upper bound (see *Random* register, above). Figure 13-7 shows the format of the *Wired* register; Table 13-7 describes the register fields. Figure 13-7 Wired Register Table 13-7 Wired Register Field Descriptions | Field | Description | | | | | | | | |-------|--------------------------------------------------------------------|--|--|--|--|--|--|--| | Wired | TLB Wired boundary | | | | | | | | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | | | | | | | ## 13.7 BadVAddr Register (8) The Bad Virtual Address register (*BadVAddr*) is a read-only register that displays the most recent virtual address that caused either a TLB or Address Error exception. The *BadVAddr* register remains unchanged during Soft Reset, NMI, or Cache Error exceptions. Otherwise, the architecture leaves this register undefined. Figure 13-8 shows the format of the *BadVAddr* register. Figure 13-8 BadVAddr Register Format # 13.8 Count and Compare Registers (9 and 11) The *Count* and *Compare* registers are 32-bit read/write registers whose formats are shown in Figure 13-9. The *Count* register acts as a real-time timer. Like the R4400 implementation, the R10000 *Count* register is incremented every *other* **PClk** cycle. However, unlike the R4400, the R10000 processor has no Timer Interrupt Enable boot-mode bit, so the only way to disable the timer interrupt is to negate the interrupt mask bit, *IM*[7], in the *Status* register. This means the timer interrupt cannot be disabled without also disabling the *Performance Counter* interrupt, since they share *IM*[7]. The *Compare* register can be programmed to generate an interrupt at a particular time, and is continually compared to the *Count* register. Whenever their values equal, the interrupt bit *IP*[7] in the *Cause* register is set. This interrupt bit is reset whenever the *Compare* register is written. Figure 13-9 Count and Compare Registers # 13.9 EntryHi Register (10) The *EntryHi* register holds the high-order bits of a TLB entry for TLB read and write operations. The *EntryHi* register is accessed by the TLB Probe, TLB Write Random, TLB Write Indexed, and TLB Read Indexed instructions. Figure 13-10 shows the format of this register and Table 13-8 describes the register's fields... #### **EntryHi Register** Figure 13-10 EntryHi Register Table 13-8 EntryHi Register Fields | Field | Description | |-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------| | VPN2 | Virtual page number divided by two (maps to two pages); upper bits of the virtual address | | ASID | Address space ID field. An 8-bit field that lets multiple processes share the TLB; each process has a distinct mapping of otherwise identical virtual page numbers. | | R | Region. $(00 \rightarrow \text{user}, 01 \rightarrow \text{supervisor}, 11 \rightarrow \text{kernel})$ used to match vAddr <sub>6362</sub> | | Fill | Reserved. 0 on read; ignored on write. | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | In 64-bit addressing mode, the VPN2 field contains bits 43:13 of the 44-bit virtual address. In 32-bit addressing mode only the lower 32 bits of the *EntryHi* register are used, so the format remains the same as in the R4400 processor's 32-bit addressing mode. The *FILL* field is ignored on write and read as zeroes, as it was in the R4400 implementation. When either a TLB refill, TLB invalid, or TLB modified exception occurs, the *EntryHi* register is loaded with the virtual page number (VPN2) and the ASID of the virtual address that did not have a matching TLB entry. ## 13.10 Status Register (12) The *Status* register (SR) is a read/write register that contains the operating mode, interrupt enabling, and the diagnostic states of the processor. The following list describes the more important *Status* register fields; Figure 13-11 shows the format of the entire register, and Table 13-10 describes the *Status* register fields. Some of the important fields include: - The 4-bit *Coprocessor Usability (CU)* field controls the usability of 4 possible coprocessors. Regardless of the *CU0* bit setting, CP0 is always usable in Kernel mode. The *XX* bit enables the MIPS IV ISA in User mode. - By default, the R10000 processor implements the same user instruction set as the R4400 processor. To enable execution of the MIPS IV instructions in User mode, the MIPS IV User Mode bit, (XX) of the CP0 Status register must be set The MIPS IV instruction extension uses COP1X as the opcode; this designation was COP3 in the R4400 processor. For this reason the *CU3* bit is omitted in the R10000 processor, and is used as the *XX* bit. In *Kernel* and *Supervisor* modes, the state of the *XX* bit is ignored, and MIPS IV instructions are always available. Mode bit settings are shown in Table 13-9; dashes in the table represent *don't cares*. Table 13-9 ISA and Status Register Settings for User, Supervisor and Kernel Mode Operations | Mode | UX | SX | KX | XX | MIPS II | MIPS III | MIPS IV | |------------|----|----|----|----|---------|----------|---------| | User | 0 | - | - | 0 | Yes | No | No | | | 0 | - | - | 1 | Yes | No | Yes | | | 1 | - | - | 0 | Yes | Yes | No | | | 1 | - | - | 1 | Yes | Yes | Yes | | Supervisor | - | 0 | - | - | Yes | No | Yes | | | - | 1 | - | - | Yes | Yes | Yes | | Kernel | - | - | - | - | Yes | Yes | Yes | **NOTE:** Operation with the MIPS IV ISA does not assume or require that the MIPS III instruction set or 64-bit addressing be enabled — *KX*, *SX* and *UX* may all be set to zero. - The *Reduced Power (RP)* bit is reserved and should be zero. The R10000 processor does not define a reduced power mode. - The *Reverse-Endian (RE)* bit, bit 25, reverses the endianness of the machine. The processor can be configured as either little-endian or big-endian at system reset; reverse-endian selection is available in Kernel and Supervisor modes, and in the User mode when the *RE* bit is 0. Setting the *RE* bit to 1 inverts the User mode endianness. - The 9-bit *Diagnostic Status (DS)* field is used for self-testing, and checks the cache and virtual memory system. This field is described in Table 13-11 and Figure 13-12. - The 8-bit *Interrupt Mask (IM)* field controls the enabling of eight interrupt conditions. Interrupts must be enabled before they can be asserted, and the corresponding bits are set in both the *Interrupt Mask* field of the *Status* register and the *Interrupt Pending* field of the *Cause* register. - The processor mode is undefined if the *KSU* field is set to 3 (11<sub>2</sub>). The R10000 processor implements this as User mode. #### **Status Register** Figure 13-11 Status Register # **Status Register Fields** Table 13-10 describes the *Status* register fields. Table 13-10 Status Register Fields | Field | Description | |-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | XX | Enables execution of MIPS IV instructions in User mode. $1 \rightarrow$ MIPS IV instructions usable $0 \rightarrow$ MIPS IV instructions unusable | | CU | Controls the usability of each of the four coprocessor unit numbers. CP0 is always usable when in Kernel mode, regardless of the setting of the $CU_0$ bit. $1 \rightarrow$ usable $0 \rightarrow$ unusable | | RP | In the R4400 processor, this bit enables reduced-power operation by reducing the internal clock frequency. In the R10000 processor, this bit should be set to zero. | | FR | Enables additional floating-point registers $0 \rightarrow 16$ registers $1 \rightarrow 32$ registers | | RE | Reverse-Endian bit, valid in User mode. | | DS | Diagnostic Status field (see Figure 13-12). | | IM | <ul> <li>Interrupt Mask: controls the enabling of each of the external, internal, and software interrupts. An interrupt is taken if interrupts are enabled, and the corresponding bits are set in both the Interrupt Mask field of the Status register and the Interrupt Pending field of the Cause register.</li> <li>0 → disabled</li> <li>1 → enabled</li> </ul> | | KX | Enables 64-bit addressing in Kernel mode. The extended-addressing TLB refill exception is used for TLB misses on kernel addresses. $0 \rightarrow 32\text{-bit} \\ 1 \rightarrow 64\text{-bit}$ | | SX | Enables 64-bit addressing and operations in Supervisor mode. The extended-addressing TLB refill exception is used for TLB misses on supervisor addresses. $0 \rightarrow 32\text{-bit}$ $1 \rightarrow 64\text{-bit}$ | | UX | Enables 64-bit addressing and operations in User mode. The extended-addressing TLB refill exception is used for TLB misses on user addresses. $0 \rightarrow 32\text{-bit} \\ 1 \rightarrow 64\text{-bit}$ | Table 13-10 (cont.) Status Register Fields | Field | Description | | | |-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--| | KSU | Mode bits $ \begin{array}{c} 11_2 \rightarrow \text{Undefined (implemented as User mode)} \\ 10_2 \rightarrow \text{User} \\ 01_2 \rightarrow \text{Supervisor} \\ 00_2 \rightarrow \text{Kernel} \end{array} $ | | | | ERL | Error Level; set by the processor when Reset, Soft Reset, NMI, or Cache Error exception are taken. $0 \rightarrow \text{normal} \\ 1 \rightarrow \text{error}$ | | | | EXL | Exception Level; set by the processor when any exception other than Reset, Soft Reset, NMI, or Cache Error exception are taken. 0 → normal 1 → exception | | | | IE | Interrupt Enable $0 \rightarrow \text{disable all interrupts}$ $1 \rightarrow \text{enables all interrupts}$ | | | #### **Diagnostic Status Field** The 9-bit *Diagnostic Status* (*DS*) field is used for self-testing, and checks the cache and virtual memory system. This field is described in Table 13-11 and shown Figure 13-12. Some of the important DS fields include: • In the R4400, the *TS* bit of the diagnostic field indicates a TLB *shutdown* has occurred due to matching of multiple virtual page entries during address translation. In the R10000 processor, the *TS* bit indicates a TLB write has introduced an entry that would allow matching of more than one virtual page entry during translation. In this case, the TLB entries that allow the multiple matches, even in the *Wired* area, are invalidated before the new TLB entry is written. This prevents multiple matches during address translation. The *TS* bit is updated for each TLB write. It can also be read and written by software (in the R4400, the *TS* bit is read-only); to clear the *TS* bit one needs to write a 0 into it. As in the R4400, Reset/Soft Reset/NMI exceptions also clear the *TS* bit. - The *NMI* bit is new to the R10000 processor; it distinguishes between Soft Reset and NMI exceptions. Both exceptions set the *SR* bit to 1; the NMI exception sets the *NMI* bit to 1, whereas the Soft Reset exception sets it to 0. - The CE bit is reserved in the R10000 processor and should be a 0. Figure 13-12 Diagnostic Status Field Table 13-11 Status Register Diagnostic Status Bits | Bit | Description | | |-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | BEV | Controls the location of TLB refill and general exception vectors. $0 \rightarrow \text{normal}$ $1 \rightarrow \text{bootstrap}$ | | | TS | This bit is set when a TLB write presents an entry that matches any other virtual page entry in the TLB. Should this occur, any TLB entries that allow multiple matches, even in the <i>Wired</i> area, are invalidated before this new entry can be written into the TLB. This prevents multiple matches during address translation. $0 \rightarrow$ normal $1 \rightarrow$ TLB shutdown has occurred. | | | SR | 1 → Indicates a Soft Reset or NMI exception. | | | NMI | 1 → Indicates a nonmaskable interrupt has occurred. Used to distinguish between a Soft Reset and a nonmaskable interrupt in a Soft Reset exception. | | | СН | Hit (tag match and valid state) or miss indication for last CACHE Hit Invalidate, Hit Write Back Invalidate for a secondary cache. $0 \rightarrow \text{miss}$ $1 \rightarrow \text{hit}$ | | | CE | Reserved in the R10000, and should be set to 0. | | | DE | Specifies that cache parity or ECC errors cannot cause exceptions. $0 \rightarrow \text{parity/ECC}$ remain enabled $1 \rightarrow \text{disables parity/ECC}$ | | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | # **Coprocessor Accessibility** Three *Status* register *CU* bits control coprocessor accessibility: *CU0*, *CU1*, and *CU2* enable coprocessors 0, 1, and 2, respectively. If a coprocessor is unusable, any instruction that accesses it generates an exception. The following describes the coprocessor implementations and operations on the R10000: - Coprocessor 0 is always enabled in kernel mode, regardless of the CU0 bit. - Coprocessor 1 is the floating-point coprocessor. If CU1 is 0 (disabled), all floating-point instructions generate a Coprocessor Unusable exception. In MIPS IV, the COP3 instruction is replaced with a second floating-point instruction, COP1X. In addition, new functions are added to COP1 (see Chapter 14, FPU Instructions). The floating-point branch conditional and compare instructions are expanded to use the eight Floating-Point Status register condition bits, instead of the original single bit. If any of these extra bits are referenced (cc > 0) when not using the MIPS IV ISA, an Unimplemented Instruction exception is taken. The integer conditional move (MOVC) instruction tests a floating-point condition bit; it causes a coprocessor unusable exception if coprocessor 1 is disabled. - Coprocessor 2 is defined, but does not exist in the R10000; its instructions (COP2, LWC2, LDC2, SWC2, SDC2) always cause an exception, but the exception code depends upon whether the coprocessor, as indicated by CU2, is enabled. - Coprocessor 3 has been removed from the MIPS III ISA, and is no longer defined. If MIPS IV is disabled, the coprocessor 3 instruction (COP3) always causes a Reserved Instruction exception. # **13.11** Cause Register (13) The 32-bit read/write *Cause* register describes the cause of the most recent exception. Figure 13-13 shows the fields of this register; Table 13-12 describes the *Cause* register fields. A 5-bit exception code (*ExcCode*) indicates one of the causes, as listed in Table 13-13. All bits in the *Cause* register, with the exception of the IP[1:0] bits, are read-only; IP[1:0] are used for software interrupts. Table 13-12 Cause Register Fields | Field | Description | |---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | BD | Indicates whether the last exception taken occurred in a branch delay slot. $1 \rightarrow$ delay slot $0 \rightarrow$ normal | | CE | Coprocessor unit number referenced when a Coprocessor Unusable exception is taken. This bit is undefined for any other exception. | | IP | Indicates an interrupt is pending. This bit remains unchanged for NMI, Soft Reset, and Cache Error exceptions. $1 \rightarrow \text{interrupt pending} \\ 0 \rightarrow \text{no interrupt}$ | | ExcCode | Exception code field (see Table 13-13) | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | #### **Cause Register** Figure 13-13 Cause Register Format Table 13-13 Cause Register ExcCode Field | Exception | Mnemonic | Description | | |------------|----------|-----------------------------------------------------|--| | Code Value | | | | | 0 | Int | Interrupt | | | 1 | Mod | TLB modification exception | | | 2 | TLBL | TLB exception (load or instruction fetch) | | | 3 | TLBS | TLB exception (store) | | | 4 | AdEL | Address error exception (load or instruction fetch) | | | 5 | AdES | Address error exception (store) | | | 6 | IBE | Bus error exception (instruction fetch) | | | 7 | DBE | Bus error exception (data reference: load or store) | | | 8 | Sys | Syscall exception | | | 9 | Вр | Breakpoint exception | | | 10 | RI | Reserved instruction exception | | | 11 | CpU | Coprocessor Unusable exception | | | 12 | Ov | Arithmetic Overflow exception | | | 13 | Tr | Trap exception | | | 14 | _ | Reserved | | | 15 | FPE | Floating-Point exception | | | 16–22 | - | Reserved | | | 23 | WATCH | Reference to WatchHi/WatchLo address | | | 24–30 | _ | Reserved | | | 31 | _ | Reserved | | # 13.12 Exception Program Counter (14) The Exception Program Counter $(EPC)^{\dagger}$ is a read/write register that contains the address at which processing resumes after an exception has been serviced. For synchronous exceptions, the *EPC* register contains either: - the virtual address of the instruction that was the direct cause of the exception, or - the virtual address of the immediately preceding branch or jump instruction (when the instruction is in a branch delay slot, and the *Branch Delay* bit in the *Cause* register is set). The processor does not write to the *EPC* register when the *EXL* bit in the *Status* register is set to a 1. Figure 13-14 shows the format of the *EPC* register. Figure 13-14 EPC Register Format <sup>†</sup> The *ErrorEPC* register provides a similar capability, described later in this chapter. # 13.13 Processor Revision Identifier (PRId) Register (15) The 32-bit, read-only *Processor Revision Identifier (PRId)* register contains information identifying the implementation and revision level of the CPU and CP0. Figure 13-15 shows the format of the *PRId* register; Table 13-14 describes the *PRId* register fields. #### **PRId Register** Figure 13-15 Processor Revision Identifier Register Format Table 13-14 PRId Register Fields | Field | Description | | | |-------|--------------------------------------------------------------------|--|--| | Imp | Implementation number | | | | Rev | Revision number | | | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | | The low-order byte (bits 7:0) of the *PRId* register is interpreted as a revision number, and the high-order byte (bits 15:8) is interpreted as an implementation number. The implementation number of the R10000 processor is 0x09. The content of the high-order halfword (bits 31:16) of the register are reserved. The revision number is stored as a value in the form y.x, where y is a major revision number in bits 7:4 and x is a minor revision number in bits 3:0. The revision number can distinguish some chip revisions, however there is no guarantee that changes to the chip will necessarily be reflected in the *PRId* register, or that changes to the revision number necessarily reflect real chip changes. For this reason, software should not rely on the revision number in the *PRId* register to characterize the chip. # 13.14 Config Register (16) The R10000 processor's *Config* register has a different format from that of the R4400, since the R10000 processor has different mode bits and configurations, however some fields are still compatible: *K0*, *DC*, *IC*, and *BE*. The value of bits 24:0 are taken directly from the Mode bit settings during a reset sequence; refer to Table 8-1 for these bit definitions. Table 13-15 shows the R10000 *Config* register fields, along with values which are hardwired into the register at boot time; Figure 13-16 shows the *Config* register format. Table 13-15 Config Register Field Definitions | Field | Bits | Name | | | Hardwired | | | |-------|-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|--------------------|----------------------------------------------|--------------------------------------------------------------------------|-----------| | Field | Dits | R10000 | | | | <u>R12000</u> | Values | | КО | 2:0 | Coherency algorithm $000_2 \rightarrow \text{reserved}$ $001_2 \rightarrow \text{reserved}$ $010_2 \rightarrow \text{uncached}$ $011_2 \rightarrow \text{cacheable noncoherent}$ $100_2 \rightarrow \text{cacheable coherent exclusive}$ $101_2 \rightarrow \text{cacheable coherent exclusive on write}$ $110_2 \rightarrow \text{reserved}$ $111_2 \rightarrow \text{uncached accelerated}$ | | | | | | | DN | 4:3 | Device number | | | | | | | CT | 5 | CohPrcReqTar | | | | | | | PE | 6 | PrcElmReq | | | | | | | PM | 8:7 | PrcReqMax | | | | | | | EC | 12:9 | SysClkDiv | | | | | | | SB | 13 | SCBlkSize | | | | | | | SK | 14 | SCCorEn | | | | | | | BE | 15 | MemEnd | | | | | | | SS | 18:16 | SCSize | | | | | | | SC | 21:19 | SCClkDiv | | | | | | | | 25:22 | Reserved | | Field<br>SD<br>PDR | <b>Bits</b> 22 <sup>†</sup> 23 <sup>††</sup> | Name SC Data and Tag Corrector disable Processor coherency data response | 0 | | DC | 28:26 | | Pri | mary d | lata cac | the size (hardwired to 011 <sub>2</sub> ) | 32 Kbytes | | IC | 31:29 | • | | | 32 Kbytes | | | <sup>†</sup> Bit 22 of the Config register is 'SC Data and Tag Corrector disable'. This bit turns off use of ECC to correct errors in the SC data and tags. <sup>††</sup> When Bit[23] of the Config register is set, the response that R12000 produces to an external intervention (shared or exclusive) which hits on a CleanExclusive line is changed. As before, the state of the line in the cache is changed, and the former state of the line is sent out on SysState[1:0]. Moreover, when Bit[23] of Config is set, a processor coherency data response is sent with the state response. In other words, when this bit is set, external interventions which hit CleanExclusive or DirtyExclusive lines in the Secondary Cache result in a processor coherency data response. Figure 13-16 Config Register Format # 13.15 Load Linked Address (LLAddr) Register (17) Physical addresses for Load Link instructions are no longer written into this register. *LLAddr* is implemented as a read/write scratch register used for NT compatibility. Figure 13-17 shows the format of the *LLAddr* register. Figure 13-17 LLAddr Register Format # 13.16 WatchLo (18) and WatchHi (19) Registers *WatchHi* and *WatchLo* are 32-bit read/write registers which contain a physical address of a doubleword location in main memory. If enabled, any attempt to read or write this location causes a Watch exception. This feature is used for debugging. Bits 7:0 of the *WatchHi* register contain bits 39:32 of the trap physical address, shown in Figure 13-18. The WatchLo register contains physical address bits 31:3. The remaining bits of the register are ignored on write and read as zero. Table 13-16 describes the WatchLo and WatchHi register fields. Figure 13-18 WatchLo and WatchHi Register Formats Field Description PAddr1 Bits 39:32 of the physical address PAddr0 Bits 31:3 of the physical address R Trap on load references if set to 1 W Trap on store references if set to 1 0 Ignored on write and read as zero. Table 13-16 WatchHi and WatchLo Register Fields # 13.17 XContext Register (20) The read/write *XContext* register contains a pointer to an entry in the page table entry (PTE) array, an operating system data structure that stores virtual-to-physical address translations. When there is a TLB miss, the operating system software loads the TLB with the missing translation from the PTE array. The *XContext* register no longer shares the information provided in the *BadVAddr* register, as it did in the R4400. The *XContext* register is for use with the XTLB refill handler, which loads TLB entries for references to a 64-bit address space, and is included solely for operating system use. The operating system sets the PTE base field in the register, as needed. Normally, the operating system uses the *Context* register to address the current page map, which resides in the kernel-mapped segment *kseg3*. Figure 13-19 shows the format of the *XContext* register; Table 13-17 describes the *XContext* register fields. # XContext Register 63 37 36 35 34 4 3 0 PTEBase R BadVPN2 0 27 2 31 4 Figure 13-19 XContext Register Format The 31-bit *BadVPN2* field holds bits 43:13 of the virtual address that caused the TLB miss; bit 12 is excluded because a single TLB entry maps to an even-odd page pair. For a 4-Kbyte page size, this format may be used directly to address the pair-table of 8-byte PTEs. For other page and PTE sizes, shifting and masking this value produces the appropriate address. #### Errata The 0 field in Table 13-17 is revised. Table 13-17 XContext Register Fields | Field | Description | | |---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | BadVPN2 | the <i>Bad Virtual Page Number</i> /2 field is written by hardware on a miss. It contains the VPN of the most excent invalidly translated virtual address. | | | R | The Region field contains bits 63:62 of the virtual address. $00_2 = \text{user}$ $01_2 = \text{supervisor}$ $11_2 = \text{kernel}$ . | | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | | | PTEBase | The <i>Page Table Entry Base</i> read/write field is normally written with a value that allows the operating system to use the <i>Context</i> register as a pointer into the current PTE array in memory. | | # 13.18 FrameMask Register (21) The *FrameMask* register is new with the R10000 processor. It masks bits of the *EntryLo0* and *EntryLo1* registers so that these masked bits are not passed to the TLB while doing a TLB write (either TLBWI or TLBWR). A zero in the *FrameMask* register allows its corresponding bit in the *EntryLo*[1,0] registers to pass to the TLB; a one in the *FrameMask* register masks off its corresponding bit in the *EntryLo* registers and passes a zero to the TLB. Bits 15:0 of the *FrameMask* register control bits 33:18 of the *EntryLo* registers. The remaining bits of this register are ignored on write and read as zeroes. The content of this register is set to zero after a processor reset or a power-up event. Figure 13-20 shows the *FrameMask* register format. #### FrameMask Register Figure 13-20 FrameMask Register Format # 13.19 Diagnostic Register (22) CP0 register 22, the *Diagnostic* register, is a new 64-bit register for processor-specific diagnostic functions. (Since this register is designed for local use, the diagnostic functions are subject to change without notice.) Currently, this register helps test the ITLB, branch caches, and the branch prediction scheme. In addition, it provides choices for branch prediction algorithms, to help diagnostic program writing. #### Errata The <u>twelve</u> fields of the *Diagnostic* register, shown in Figure 13-21, are described below. All fields are read-only (all writes are ignored). *ITLBM*: this field is a 4-bit read-only counter. This field is incremented by one for each ITLB miss, and any overflow is ignored. Its value is undefined during reset, and its value is meaningless when used in an unmapped space. *BSIdx*: this field defines the entry in the branch stack to be used for the latest conditional branch decoded. Its value is meaningless if the latest branch was an unconditional branch. DBRC: this field disables the use of the branch return cache (BRC). *BRCV*: this field indicates whether or not the branch return cache (BRC) is valid. BRC has only one entry (four instructions). *BRCW*: this field indicates whether or not the latest branch (JAL, JALR RX, BGEZAL, BGEZALL, BLTZALL) caused a write into BRC. It is not affected by any other type of branch. *BRCH*: this field indicates whether or not the latest branch (JR r31 or JALR rx,r31) has a BRC hit. It not affected by any other type of branch. *MP*: this field indicates whether or not the latest conditional branch verified was mispredicted. *BPMode*: this is a read-write field for branch prediction algorithm control. $00_2$ : 2-bit counter scheme 012: all conditional branches are predicted not taken 10<sub>2</sub>: all conditional branches are predicted taken $11_2$ : forward conditional branches are predicted not taken and backward conditional branches are predicted taken. The default mode is 00 on processor reset. *BPState*: this field contains the new 2-bit state for a conditional branch after it is verified. It is also used to hold the 2-bit state to read/write when a branch prediction table read/write operation is executed. *BPIdx*: this field contains the index to the Branch Prediction Table (BPT) for BPT read/write/initialization operations, and should contain *VA[11:3]* of the branch for BPT read/write operations. The upper six bits of the *BPIdx* field contain the line address for BPT line initialization operations; the lower three bits of *BPIdx* are ignored. BPOp: this field indicates the following BPT operations: 002: BPT read 01<sub>2</sub>: BPT write 10<sub>2</sub>: initializes BPT line to all zeroes (strongly not taken) 112: initializes BPT line to all ones (strongly taken). #### Errata #### 0: Reserved. Must be written as zeroes, and returns zeroes when read. Figure 13-21 shows the format of the *Diagnostic* register. Figure 13-21 Diagnostic Register Format #### Errata There are two ways to read the branch prediction state from the *Branch Prediction Table* (BPT): - Place an *mfc0 rx*, *C0 Diag* (a Move From *Diagnostic* register to *GPR rx*) in the delay slot of the conditional branch. This read of the *Diagnostic* register returns the next predicted state from the branch stacks before the *BPT* is updated. - Move the *Index* and the *BPT read* operation into the *Idx* and *BPOp* field of the *Diagnostic* register. This *mtc0* into *CP0 Diag* graduates as soon as the write is completed; however, there could be a significant delay in transferring the data from *BPT* to *CP0 Diag*. This delay occurs because *C0 Diag* has a lower priority to access the BPT as compared to the accesses by IFETCH and other processes. Thus, the prediction state read from the *C0 Diag* may not reflect the content of the BPT. Use the code sequence shown below to get the correct prediction state from the BPT: | | <u>li</u> | <u>rx</u> | # rx has index and BPT read for | |--------|-----------|------------------|-------------------------------------------------------| | | | | # Idx and BPOp, respectively. | | | mtc0 | rx, C0 Diag | # Set the Diagnostic register for reading the BPT | | | <u>la</u> | <u>ry, label</u> | # ry !=r31; la could be replaced by a dla for 64-bits | | | <u>jr</u> | <u>ry</u> | # This gives priority for C0 Diag to access BPT | | label: | mfc0 | rz, C0 Diag | # rz holds the state from BPT entry pointed by Idx | #### <12000> In R12000 two fields are added to the "Diag Resister" - CP0 Register 22. One field is "ghistory enable", bits 26:23. The other is "BTAC disable", bit 27. #### The definitions are: #### Ghistory enable: If bit 26 is set, branch prediction uses all eight bits of the global history register. If bit 26 is not set, then bits 25:23 specify a count of the number of bits of global history to be used. Thus if bits 26:23 are all zero, global history is disabled. The global history contains a record of the taken/not-taken status of recently executed branches, and when used is XOR'ed with the PC of a branch being predicted to produce a hashed value for indexing the BPT. Some programs with small "working set of conditional branches" benefit significantly from the use of such hashing, some see slight performance degradation. #### BTAC disable: If bit 27 is set, the use of the Branch Target Address Cache (BTAC) is disabled. The BTAC is used to reduce the instruction fetch penalty of taken branches by providing the target address of fixed-address branch and jump instructions. $\star$ # 13.20 Performance Counter Registers (25) The R10000 processor defines two performance counters and two associated control registers, which are mapped into CP0 register 25. An encoding in the MTC0/MFC0 instructions on register 25 indicates which counter or control register is used. Each counter is a 32-bit read/write register and is incremented by one each time the countable event, specified in its associated control register, occurs. Each counter can independently count one type of event at a time. The counter asserts an interrupt, IP[7], when its most significant bit (bit 31) becomes one (the counter overflows) and the associated performance control register enables the interrupt. The counting continues after counter overflow whether or not an interrupt is signalled. The format of the control registers are shown in Figure 13-22. Figure 13-22 Control Register Format The fields of the *Control* register are: • The *Event* field specifies the event to be counted, listed in Table 13-18. Table 13-18 Counter Events | Event | Counter 0 | Counter 1 | |-------|------------------------------------------------|----------------------------------------------------------------------------------| | 0 | Cycles | Cycles | | 1 | Instructions issued | Instructions graduated | | 2 | Load/prefetch/sync/CacheOp issued | Load/prefetch/sync/CacheOp graduated | | 3 | Stores (including store-conditional) issued | Stores (including store-conditional) graduated | | 4 | Store conditional issued | Store conditional graduated | | 5 | Failed store conditional | Floating-point instructions graduated | | 6 | Branches resolved | Quadwords written back from primary data cache | | 7 | Quadwords written back from secondary cache | TLB refill exceptions | | 8 | Correctable ECC errors on secondary cache data | Branches mispredicted | | 9 | Instruction cache misses | Secondary cache load/store and cache-ops operations | | 10 | Secondary cache misses (instruction) | Secondary cache misses (data) | | 11 | Secondary cache way mispredicted (instruction) | Secondary cache way mispredicted (data) | | 12 | External intervention requests | External intervention request is determined to have hit in secondary cache | | 13 | External invalidate requests | External invalidate request is determined to have hit in secondary cache | | 14 | Functional unit completion cycles | Stores or prefetches with store hint to<br>CleanExclusive secondary cache blocks | | 15 | Instructions graduated | Stores or prefetches with store hint to <i>Shared</i> secondary cache blocks | #### Errata Made various changes to Table 13-18, as indicated by the underlines. Note that the updated material reflects the functionality of silicon revision 3.0 and later. The status of earlier silicon revisions are documented as silicon errata available on www.mips.com. - The *IE* bit enables the assertion of *IP*[7] when the associated counter overflows. - The *U*, *S*, *K*, and *EXL* bits indicate the processor modes in which the event is counted: U is user mode; S is supervisor mode; K is kernel mode when *EXL* and *ERL* both are set to 0; the system is in kernel mode and handling an exception when *EXL* is set to 1, as shown in Table 13-22. - <u>0</u>: Reserved. Must be written as zeroes, and returns zeroes when read. These modes can be set individually; for example, one could set all four bits to count a certain event in all processor modes except during a cache error exception. # Errata In describing the rules that are applied for the counting of each events listed in Table 13-18, following terminology is used: *Done* is defined as the point at which the instruction is successfully executed by the functional unit but is not yet graduated. <u>Graduated</u> is defined as the point in time when the instruction is successfully executed (done), and it is the oldest instruction. <u>Secondary Cache Transaction Processing</u> (SCTP) logic is on-chip logic in which up to four internally-generated and one-externally generated secondary cache transactions are queued to be processed. The following rules apply for the counting of each event listed in Table 13-16: # **Event 0 for Counter 0 and Counter 1: Cycles** The counter is incremented on each PClk cycle. #### **Event 1 for Counter 0: Instructions Issued** The counter is incremented on each cycle by the sum of the three following events: - Integer operations marked as *done* on the cycle. 0, 1 or 2 such operations can be marked on each cycle. Since these operations (all except for MUL and DIV) are marked done on the cycle following their being issued to a functional unit, this number is nearly identical to the number issued. The only difference is that re-issues are not counted. - Floating point operations marked *done* in the active list. Possible values are 0, 1 or 2. Since these operations take more than one cycle to complete, it is possible for an instruction to be issued and then aborted before it is counted, due to a branch-misprediction or exception rollback. - Load/store instructions first issued to the address calculation unit on the previous cycle. Possible values are 0 or 1. Prefetch instructions are counted as issued. Load/store instructions are counted as being issued only once, even though they may have been issued more than one time. Any instruction which does not go to the load/store unit, integer functional unit, or FP functional is counted. Some of those not counted are: nops, bc1{f,t,fl,tl}, break, syscall, j, jal, jr, jalr, cp0 instructions. #### **Event 1 for Counter 1: Instruction Graduation.** The counter is incremented by the number of instructions that were graduated on the previous cycle. When an integer multiply or divide instruction graduates, it is counted as two instructions. #### Event 2 for Counter 0: Load/Prefetch/Sync/CacheOp Issue. Each of these instructions are counted as they are issued. A load instruction is only counted once, even though it may have been issued more than one time. † #### Event 2 for Counter 1: Load/Prefetch/Sync/CacheOp Graduation. Each of these instructions are counted as they are graduated. Up to four loads can graduate in one cycle. <sup>†</sup> This could be a result of *DCache Tag* being busy or four Instruction or Data cache misses already present and waiting to be processed in the Secondary Cache Transaction Processing (SCTP) logic. #### **Event 3 for Counter 0: Stores (Including Store-Conditional) Issued.** The counter is incremented on the cycle after a store instruction is issued to the address-calculation unit. Note that a store can only be counted as having been issued once, even though it may actually be issued more than once due to DCache Tag being busy or there already being four load/store cache misses waiting in the SCTP logic. #### **Event 3 For Counter 1: Store (Including Store-Conditional) Graduation.** Each graduating store (including SC) increments the counter. At most one store can graduate per cycle. #### **Event 4 for Counter 0: Store-Conditional Issued.** This counter is incremented on the cycle after a store conditional instruction is issued to the address-calculation unit. Note that an SC can only be counted as having been issued once, even though it may actually be issued more than once due to DCache Tag being busy or there already being four load/store cache misses waiting in the SCTP logic. #### **Event 4 for Counter 1: Store-Conditional Graduation.** At most, one store-conditional can graduate per cycle. This counter is incremented on the cycle following the graduation of a store-conditional instruction. #### **Event 5 for Counter 0: Failed Store Conditional.** This counter is incremented when a store-conditional instruction fails. #### **Event 5 for Counter 1: Floating-Point Instruction Graduation.** This counter is incremented by the number of FP instructions which graduated on the previous cycle. Any instruction that sets the FP *Status* register bits (*EVZOUI*) is counted as a graduated floating point instruction. There can be 0 to 4 such instructions each cycle. #### **Event 6 for Counter 0: Conditional Branch Resolved** This counter is incremented when a conditional branch is determined to have been "resolved." Note that when multiple floating-point conditional branches are resolved in a single cycle, this counter is still only incremented by one. Although this is a rare event, in this case the count would be incorrect. <sup>†</sup> In other words, this count is the sum of the conditional branches that are known to be both correctly predicted and mispredicted. #### **Event 6 for Counter 1: Quadwords Written Back From Primary Data Cache** This counter is incremented once each cycle that a quadword of data is written from primary data cache to secondary cache. #### **Event 7 for Counter 0: Quadwords Written Back From Secondary Cache** This counter is incremented once each cycle that a quadword of data is written back from the secondary cache to the outgoing buffer located in the on-chip system-interface unit. (Note that data from the outgoing buffer could be invalidated by an external request and not sent out of the processor.) #### **Event 7 for Counter 1: TLB Refill Exception (Due To TLB Miss)** This counter is incremented on the cycle after the TLB miss handler is invoked. All TLB misses are counted, whether they occur in the native code or within the TLB handler. #### **Event 8 for Counter 0: Correctable ECC Errors On Secondary Cache Data.** This counter is incremented on the cycle after the correction of a single-bit error on a quadword read from the secondary cache data array. #### **Event 8 for Counter 1: Branch Misprediction.** This counter is incremented on the cycle after a branch is restored because of misprediction. Note that the misprediction is determined on the same cycle that the conditional branch is resolved. The misprediction rate is the ratio of *branch mispredicted* count to *conditional branch resolve* count. #### **Event 9 for Counter 0: Primary Instruction Cache Misses.** This counter is incremented one cycle after an instruction refill request is sent to the SCTP logic. #### **Event 9 for Counter 1: Secondary Cache Load/Store and Cache-ops Operations** This counter is incremented one cycle after a request is entered into the SCTP logic, provided the request was initially targeted at the primary data cache. Such requests fall into three categories: - primary data cache misses - requests to change the state of primary and secondary and primary data cache lines from *Clean* to *Dirty*, due to stores hitting a clean line in the primary data cache - requests initiated by Cache-op instructions #### **Event 10 for Counter 0: Secondary Cache Misses (Instruction)** This counter is incremented the cycle after the last quadword of a primary instruction cache line is written from the main memory, while the secondary cache refill continues. #### **Event 10 for Counter 1: Secondary Cache Misses (Data)** This counter is incremented the cycle after the second quadword of a data cache line is written from the main memory, while the secondary cache refill continues. #### **Event 11 for Counter 0: Secondary Cache Way Misprediction (Instruction)** This counter is incremented when the secondary cache controller begins to retry an access to the secondary cache after it hit in the non-predicted way, provided the secondary cache access was initiated by the primary instruction cache. #### **Event 11 for Counter 1: Secondary Cache Way Misprediction (Data)** This counter is incremented when the secondary cache controller begins to retry an access to the secondary cache because it hit in the non-predicted way, provided the secondary cache access was initiated by the primary data cache. #### **Event 12 for Counter 0: External Intervention Requests** This counter is incremented on the cycle after an external intervention request enters the SCTP logic. #### **Event 12 for Counter 1: External Intervention Requests Hits In Secondary Cache** This counter is incremented on the cycle after an external intervention request is determined to have hit in the secondary cache. #### **Event 13 for Counter 0: External Invalidate Requests** This counter is incremented on the cycle after an external invalidate request enters the SCTP logic. #### Event 13 for Counter 1: External Invalidate Requests Hits In Secondary Cache This counter is incremented on the cycle after an external invalidate request is determined to have hit in the secondary cache. #### **Event 14 for Counter 0: Functional Unit Completion Cycles** This counter is incremented once on the cycle after at least one of the functional units — ALU1, ALU2, FPU1, or FPU2 — marks an instruction as done. # **Event 14 for Counter 1: Stores, or Prefetches with Store Hint to Clean Exclusive Secondary Cache Blocks.** This counter is incremented on the cycle after a request to change the *Clean Exclusive* state of the targeted secondary cache line to *Dirty Exclusive* is sent to the SCTP logic. #### **Event 15 for Counter 0: Instruction Graduation.** This counter is incremented by the number of instructions that were graduated on the previous cycle. When an integer multiply or divide instruction graduates, it is counted as two graduated instructions. #### Event 15 for Counter 1: Stores or Prefetches with Store Hint to Shared Secondary Cache Blocks. This counter is incremented on the cycle after a request to change the *Shared* state of the targeted secondary cache line to *Dirty Exclusive* is sent to the SCTP logic. The performance counters and associated control registers are written by using an MTC0 instruction, as shown in Table 13-19. Table 13-19 Writing Performance Registers Using MTC0 | Opcode[15:11] | Opcode[1:0] | Operation | |---------------|-------------|-------------------------------| | 11001 | 00 | Move to Performance Control 0 | | 11001 | 01 | Move to Performance Counter 0 | | 11001 | 10 | Move to Performance Control 1 | | 11001 | 11 | Move to Performance Counter 1 | The performance counters and associated control registers are read by using a MFC0 instruction, as shown in Table 13-20. Table 13-20 Reading Performance Registers Using MFC0 | Opcode[15:11] | Opcode[1:0] | Operation | |---------------|-------------|---------------------------------| | 11001 | 00 | Move from Performance Control 0 | | 11001 | 01 | Move from Performance Counter 0 | | 11001 | 10 | Move from Performance Control 1 | | 11001 | 11 | Move from Performance Counter 1 | The format of the performance control registers are shown in Table 13-21. Table 13-21 Performance Control Register Format | [8:5] | [4] | [3:0] | |--------------|------------------------|-----------------------------| | Event select | IP[7] interrupt enable | Count enable<br>(U/S/K/EXL) | The count enable field specifies whether counting is to be enabled during User, Supervisor, Kernel, and/or Exception level mode. Any combination of count enable bits may be asserted. All unused bits in the performance control registers are reserved. All counting is disabled when the *ERL* bit of the CPO *Status* register is asserted. Table 13-22 defines the operation of the count enable bits of the performance control registers. Table 13-22 Count Enable Bit Definition | Count Enable Bit | Count Qualifier (CP0 Status Register Fields) | | |------------------|----------------------------------------------|--| | U | KSU = 2 (User mode), $EXL = 0$ , $ERL = 0$ | | | S | KSU = 1 (Supervisor mode), EXL = 0, ERL = 0 | | | K | KSU = 0 (Kernel mode), $EXL = 0$ , $ERL = 0$ | | | EXL | EXL = 1, $ERL = 0$ | | The following rules apply: - The performance counter registers may be preloaded with an MTC0 instruction, and counting is enabled by asserting one or more of the count enable bits in the performance control registers. - The interrupt enable bit must be asserted to cause *IP*[7]. - To determine the cause of the interrupt, the interrupt handler routine must query the following: - the performance counter register - the interrupt enable bit of the associated performance control register of both counters - If neither of the counters caused the interrupt, *IP*[7] must be the result of the CP0 *Count* register matching the CP0 *Compare* register. # **13.21 ECC Register (26)** The R10000 processor implements a 10-bit read/write *ECC* register which is used to read and write the secondary cache data ECC or the primary cache data parity bits. (Tag ECC and parity are loaded to and stored from the *TagLo* register.) Unlike the R4400, the only CacheOps that use *ECC* register are *Index Load Data* and *Index Store Data*. In the R4400, both the primary instruction and data caches are parity byte-protected. In the R10000 processor, the following protection schemes are used: - The primary instruction cache is word-protected (where one word contains 36 bits), and one parity bit is used for each instruction word (*IP* in Figure 13-23). - The primary data cache is byte-protected, with four bits used for each 32-bit data word (*DP* in Figure 13-23). - Each quadword of the secondary cache data uses nine bits of ECC and one bit of parity (SP and ECC in Figure 13-23). The primary instruction CacheOps load or store one instruction word at a time; therefore, one bit is used in the *ECC* register. The primary data CacheOps load or store four bytes at a time; therefore, four bits are used in the *ECC* register. The secondary CacheOps use **ECC[9]** as the parity bit and **ECC[8:0]** as the 9-bit ECC. For the *Index Store Data* CacheOps, the unused bits are ignored. For *Index Load Data* CacheOps, the unused a bits are with zeroes. Figure 13-23 shows the format of the *ECC* register; Table 13-23 describes the register fields. Figure 13-23 ECC Register Format Table 13-23 ECC Register Fields | Field | Description | |-------|-----------------------------------------------------------------------------------------------| | SP | A 1-bit field specifying the parity bit read from or written to a secondary cache. | | ECC | An 9-bit field specifying the ECC bits read from or written to a secondary cache. | | DP | An 4-bit field specifying the parity bits read from or written to a primary data cache. | | IP | An 1-bit field specifying the parity bit read from or written to a primary instruction cache. | | 0 | Reserved. Must be written as zeroes, and returns zeroes when read. | # 13.22 CacheErr Register (27) The *CacheErr* register is a 32-bit read-only register that handles ECC errors in the secondary cache or system interface, and parity errors in the primary caches. R10000 processor correction policy is as follows: - Parity errors cannot be corrected. - Single-bit ECC errors can be corrected by hardware without taking a Cache Error exception. - Double-bit ECC errors can be detected but not corrected by hardware. - All uncorrectable errors take Cache Error exceptions unless the *DE* bit of the *Status* register is set. - As in the R4400, cache errors are imprecise. The *CacheErr* register provides cache index and status bits which indicate the source and nature of the error; it is loaded when a Cache Error exception is taken. #### **CacheErr Register Format for Primary Instruction Cache Errors** Figure 13-24 shows the format of the *CacheErr* register when a primary instruction cache error occurs. Figure 13-24 CacheErr Register Format for Primary Instruction Cache Errors EW: set when CacheErr register is already holding the values of a previous error *D*: data array error (way1 | way0) *TA*: tag address array error (way1 | way0) TS: tag state array error (way1 | way0) PIdx: primary cache virtual block index, VA[13:6] #### Errata 0: Reserved. Must be written as zeroes, and returns zeroes when read. # CacheErr Register Format for Primary Data Cache Errors Figure 13-25 shows the format of the *CacheErr* register when a primary data cache error occurs. Figure 13-25 CacheErr Register Format for Primary Data Cache Errors EW: set when CacheErr register is already holding the values of a previous error EE: tag error on an inconsistent block D: data array error (way1 || way0) TA: tag address array error (way1 || way0) TS: tag state array error (way1 || way0) TM: tag mod array error (way1 || way0) PIdx: primary cache virtual double word index, VA[13:6] # Errata 0: Reserved. Must be written as zeroes, and returns zeroes when read. # **CacheErr Register Format for Secondary Cache Errors** Figure 13-26 shows the format of the *CacheErr* register when a secondary cache error occurs. Figure 13-26 CacheErr Register Format for Secondary Cache Errors EW: set when CacheErr register is already holding the values of a previous error D: uncorrectable data array error (way1 || way0) TA: uncorrectable tag array error (way1 || way0) *SIdx*: secondary cache physical block index (**PA[22:6**] for 16-word block size or **PA[22:7**] for 32-word block size) # Errata <u>0</u>: Reserved. Must be written as zeroes, and returns zeroes when read. # **CacheErr Register Format for System Interface Errors** Figure 13-27 shows the format of the *CacheErr* register when a System interface error occurs. Figure 13-27 CacheErr Register Format for System Interface Errors EW: set when CacheErr register is already holding the values of a previous error EE: data error on a CleanExclusive or DirtyExclusive D: uncorrectable system block data response error (way1 || way0) SA: uncorrectable system address bus error SC: uncorrectable system command bus error SR: uncorrectable system response bus error SIdx: secondary cache physical block index # Errata 0: Reserved. Must be written as zeroes, and returns zeroes when read. # 13.23 TagLo (28) and TagHi (29) Registers The TagHi and TagLo registers are 32-bit read/write registers used to hold the following: - · the primary cache tag and parity - · the secondary cache tag and ECC - the data in primary or secondary caches for certain CacheOps *TagHi/Lo* formats in the R10000 processor differ from those in the R4400 due to changes in CacheOps and cache architecture. R10000 formats depend on the type of CacheOp executed and the cache to which it is applied. The reserved fields are read as zeroes after executing an *Index Load Tag* or an *Index Load Data* CacheOp and ignored when executing an *Index Store Tag* or an *Index Store Data* CacheOp. To ensure NT kernel compatibility, the *TagLo* register is implemented as a 32-bit read/write register. The value written by an MTC0 instruction can be retrieved by a MFC0 instruction, unless an intervening CACHE instruction has modified the content. This section gives the TagLo and TagHi register formats for the following CacheOp and cache combinations: - CacheOp is Index Load/Store Tag - primary instruction cache operation - primary data cache operation - secondary cache operation - CacheOp is Index Load/Store Data - primary instruction cache operation - primary data cache operation - secondary cache operation #### CacheOp is Index Load/Store Tag This section describes the three states of the *TagLo* and *TagHi* registers, when the CacheOp is an *Index Load/Store Tag* for the following operations: - primary instruction cache operation - · primary data cache operation - · secondary cache operation <sup>†</sup> To ensure NT kernel compatibility, the TagLo register is implemented as a 32-bit read/write register. The value written by a MTC0 instruction can be retrieved by a MFC0 instruction, unless intervening CACHE instructions modify the content. #### **Primary Instruction Cache Operation** If the CacheOp is an *Index Load/Store Tag* for a primary instruction cache operation, the fields of the *TagHi* and *TagLo* registers are defined as follows: PTag0: contains physical address bits [35:12] stored in the cache tag *PState*: contains the primary instruction cache state for the line, as follows: 1 = Valid 0 = Invalid #### Errata LRU: indicates which way is the least recently used of the set. SP: state even parity bit for the PState field TP: tag even parity bit. PTag1: contains physical address bits [39:36] stored in the cache tag Figure 13-28 shows the fields of the *TagHi* and *TagLo* registers. Figure 13-28 TagHi/Lo Register Fields in Primary Instruction Cache When CacheOp is Index Load/Store Tag 0: Reserved. Must be written as zeroes, and returns zeroes when read. #### **Primary Data Cache Operation** If the CacheOp is an *Index Load/Store Tag* for primary data cache operations, the fields of the *TagHi* and *TagLo* registers are defined as follows: State Modifier: holds the status of the line, as follows: $001_2$ = neither refilled or written $010_2$ = this line may have been written and inconsistent from the secondary cache (W bit) $100_2$ = this line is being refilled (*Refill* bit). PTag1: contains physical address bits [39:36] stored in the cache tag PTag0: contains physical address bits [35:12] stored in the cache tag *PState*: together with the *Refill* bit of the *State Modifier* in the *TagHi* register, *PState* determines the state of the cache block in the primary data cache, as shown in Table 13-24. Table 13-24 PState Field Definition in TagHi/Lo Registers, For Primary Data Cache Operation When CacheOp is Index Load/Store Tag | PState | Refill=0 | Refill=1 | |------------------------|--------------------|-----------------------------------------------------------| | 002 | Invalid | Refill <i>clean</i> (block is being refilled) | | <b>01</b> <sub>2</sub> | Shared | Upgrade Share (converting shared to dirty) | | <b>10</b> <sub>2</sub> | Clean<br>Exclusive | Upgrade Clean (converting clean to dirty). | | 112 | Dirty<br>Exclusive | Refill <i>dirty</i> (block is being refilled for a store) | #### Errata *LRU*: indicates which way is the least recently used of the set. SP: state even parity bit for the PState field and the Way bit Way: indicates which secondary cache set contains the primary cache line for this tag TP: tag even parity bit. 0: Reserved. Must be written as zeroes, and returns zeroes when read. Figure 13-29 shows the fields of the *TagHi* and *TagLo* registers. Figure 13-29 TagHi/Lo Register Fields in Primary Data Cache When CacheOp is Index Load/Store Tag #### **Secondary Cache Operation** If the CacheOp is an *Index Load/Store Tag* for secondary cache operations, the fields of the *TagHi* and *TagLo* registers are defined as follows: STagO: contains physical address bits [35:18] stored in the cache tag SState: contains the secondary cache state of the line, as follows: $00_2 = Invalid$ $01_2 = Shared$ $10_2 = Clean Exclusive$ $11_2 = Dirty Exclusive$ *VIndex* (virtual index): contains only two bits of significance since the 32 Kbyte 2-way set associative primary caches are addressed using only two untranslated address bits (VA[13:12]) plus the offset within the virtual page. ECC: contains the ECC for the STag, SState and VIndex fields. #### Errata MRU: indicates which way was the most recently used in the set. STag1: contains the physical address bits [39:36] stored in the cache tag. <u>0</u>: Reserved. Must be written as zeroes, and returns zeroes when read. Figure 13-30 shows the fields of the *TagHi* and *TagLo* registers. Figure 13-30 TagHi/Lo Register Fields in Secondary Cache When CacheOp is Index Load/Store Tag #### Errata Figure 13-30, size of the STag0 field is revised. # CacheOp is Index Load/Store Data This section describes the following three states of the *TagLo* and *TagHi* registers, when the CacheOp is an *Index Load/Store Data*: - primary instruction cache operation - primary data cache operation - secondary cache operation #### **Primary Instruction Cache Operation** If the CacheOp is an *Index Load/Store Data* for the primary instruction cache, the *TagHi* register stores the most significant four bits of a 36-bit instruction, as shown in Figure 13-31; the rest of the instruction is stored in the *TagLo* register. Figure 13-31 TagHi/Lo Register Fields in Primary Instruction Cache When CacheOp is Index Load/Store Data #### Errata 0: Reserved. Must be written as zeroes, and returns zeroes when read. #### **Primary Data Cache Operation** If the CacheOp is *Index Load/Store Data* for primary data cache, the *TagHi* register is not used. The *TagLo* registers contains a 32-bit data word for the cache operation, as shown in Figure 13-32. Figure 13-32 TagHi/Lo Register Fields in Primary Data Cache When CacheOp is Index Load/Store Data #### **Secondary Cache Operation** If the CacheOp is *Index Load/Store Data* for the secondary cache, a doubleword of data is required for the CacheOp. The *TagHi* register stores the upper 32 bits of the doubleword and the *TagLo* register stores the lower 32 bits, as shown below in Figure 13-33. Figure 13-33 TagHi/Lo Register Fields in Secondary Cache When CacheOp is Index Load/Store Data #### 13.24 ErrorEPC Register (30) The *ErrorEPC* register is similar to the *EPC* register, except that *ErrorEPC* is used on ECC and parity error exceptions. It is also used to store the program counter (PC) on Reset, Soft Reset, and nonmaskable interrupt (NMI) exceptions. The read/write *ErrorEPC* register contains the virtual address at which instruction processing can resume after servicing an error. Figure 13-34 shows the format of the *ErrorEPC* register. #### **ErrorEPC Register** Figure 13-34 ErrorEPC Register Format ### 14. Floating-Point Unit This section describes the operation of the FPU, including the register definitions. The Floating-Point unit consists of the following functional units: - · add unit - · multiply unit - divide unit - · square-root unit The **add unit** performs floating-point add and subtract, compare, and conversion operations. Except for Convert Integer To Single-Precision (float), all operations have a 2-cycle latency and a 1-cycle repeat rate. The **multiply unit** performs single-precision or double-precision multiplication with a 2-cycle latency and a 1-cycle repeat rate. The **divide and square-root units** do single- or double-precision operations. They have long latencies and low repeat rates (20 to 40 cycles). #### **14.1 Floating Point Unit Operations** The floating-point add, multiply, divide, and square-root units read their operands and store their results in the floating-point register file. Values are loaded to or stored from the register file by the load/store and move units. A logic diagram of floating-point operations is shown in Figure 14-1, in which data and instructions are read from the secondary cache into the primary caches, and then into the processor. There they are decoded and appended to the floating-point queue, passed into the FP register file where each is dynamically issued to the appropriate functional unit. After execution in the functional unit, results are stored, through the register file, in the primary data cache. Figure 14-1 Logical Diagram of FP Operations The floating-point queue can issue one instruction to the adder unit and one instruction to the multiplier unit. The adder and multiplier each have two dedicated read ports and a dedicated write port in the floating-point register file. Because of their low repeat rates, the divide and square-root units do not have their own issue port. Instead, they decode instructions issued to the multiplier unit, using its operand registers and bypass logic. They appropriate a second cycle later for storing their result. When an instruction is issued, up to two operands are read from dedicated read ports in the floating-point register file. After the operation has been completed, the result can be written back into the register file using a dedicated write port. For the add and multiply units, this write occurs four cycles after its operands were read. #### 14.2 Floating-Point Unit Control The control of floating-point execution is shared by the following units: - The floating-point queue determines operand dependencies and dynamically issues instructions to the execution units. It also controls the destination registers and register bypass. - The execution units control the arithmetic operations and generate status. - The graduate unit saves the status until the instructions graduate, and then it updates the *Floating-Point Status* register. #### **★** Eliminate traps for Denorm/NaN FP inputs (R12000) The R10000 currently takes Unimplemented Exception when an FPU gets a NaN or Denorm as an input. R12000 suppresses these traps whenever the FS bit is set in the FCSR (ref. $V_R5000$ , $V_R10000$ User's Manual INSTRUCTION). R12000 simply passes through NaN's and Denorm's when the bit is set. This change in no way affects the handling of QNaNs and Denorms when they are produced, it only changes the way they are handled when they are received as input operands. Case of Denorm when the FS bit is set to 1: A Denorm received as an input to the FP unit is flushed to zero before the FP unit begins to process the operand. The behavior of the unit (when FS is 1) will be exactly that seen when the input is zero. Specifically, if the zero input would itself cause a trap (due to divide by zero, for example) then the that zero-generated trap will be taken. When a Denorm is seen at the input, the Inexact bit is set, except in the cases described below: The Inexact bit will not be set, even if FS=1 and a Denorm is seen on input, if the other input to the FP operation is a value which pre-determines the FP result (e.g. QNaN). When the result is not affected by the presence or absence of the Denorm input, the result is EXACT. Hence the Inexact bit should not be set, even if Flush to Zero mode is ON. Case of QNaNs when the FS bit is set to 1: A QNaN received as an input operand for an FP unit will cause the unit to produce the standard QNaN (which is not necessarily same as the input QNaN). Note that FP units will not propagate the QNaN to the output, but will always produce the same, standard, QNaN. When the FS bit is set to zero, the behavior will be exactly as in R10000. #### 14.3 Floating-Point General Registers (FGRs) The Floating-Point Unit is the hardware implementation of Coprocessor 1 in the MIPS IV Instruction Set Architecture. The MIPS IV ISA defines 32 logical floating-point general registers (FGRs), as shown in Figure 14-2. Each FGR is 64 bits wide and can hold either 32-bit single-precision or 64-bit double-precision values. The hardware actually contains 64 physical 64-bit registers in the Floating-Point Register File, from which the 32 logical registers are taken. FP instructions use a 5-bit logical number to select an individual FGR. These logical numbers are mapped to physical registers by the rename unit (in pipeline stage 2), before the Floating-Point Unit executes them. Physical registers are selected using 6-bit addresses. #### 32- and 64-Bit Operations The FR bit (26) in the Status register determines the number of logical floating-point registers available to the program, and it alters the operation of single-precision load/store instructions, as shown in Figure 14-2. - FR is reset to 0 for compatibility with earlier MIPS I and MIPS II ISAs, and instructions use only the 16 physical even-numbered floating-point registers (32 logical registers). Each logical register is 32 bits wide. - FR is set to 1 for normal MIPS III and MIPS IV operations, and all 32 of the 64-bit logical registers are available. Figure 14-2 Floating-Point Registers #### **Load and Store Operations** When FR = 0, floating-point load and stores operate as follows: - A doubleword load or store is handled the same as if the FR bit was set to 1, as long as the register selected is even (0, 2, 4, etc.). - If the register selected is odd, the load/store is invalid. These operations are shown in Figure 14-3. Singleword loads/stores to even and odd registers are also shown. #### FR=0 16-Register Mode #### **Doubleword Load/Store** Same as FR=1 if register is even, else invalid. # Singleword Load/Store when Register is Even 31 0 Memory† LWC1 ft,address (MTC1 ft,rs) 63 32 31 0 Unchanged Load 32-bit SWC1 ft,address (MFC1 rt,fs) 63 32 31 0 Sign extend reg. Memory† # LWC1 ft,address (MTC1 ft,rs) 63 32 31 0 SWC1 ft,address (MFC1 rt,fs) 63 32 31 0 Sign extend reg. Memory<sup>†</sup> Singleword Load/Store when Register is Odd Memory<sup>†</sup> 31 <sup>†</sup>Move to/from selects an integer register instead. Moved 32-bit data is sign-extended in 64-bit register. Figure 14-3 Loading and Storing Floating-Point Registers in 16-Register Mode **NOTE:** Move (MOV) and conditional move (MOVC, MOVN, MOVZ are included in these arithmetic operations, although no arithmetic is actually performed. <sup>&</sup>lt;sup>†</sup>Move to/from selects an integer register instead. Moved 32-bit data is sign-extended in 64-bit register. When FR = 1, floating-point load and stores operate as follows: - Single-precision operands are read from the low half of a register, leaving the upper half ignored. Single-precision results are written into the low half of the register. The high half of the result register is architecturally undefined; in the R10000 implementation, it is set to zero. - Double-precision arithmetic operations use the entire 64-bit contents of each operand or result register. Because of register renaming, every new result is written into a temporary register, and conditional move instructions select between a new operand and the previous old value. The high half of the destination register of a single-precision conditional move instruction is undefined (shown in Figure 14-5), even if no move occurs. Singleword and doubleword loads and stores with the FPU in 32-register mode (FR=1) are shown in Figure 14-4. #### Doubleword Load/Store Singleword Load/Store Memory<sup>†</sup> (or 64-bit register) Memory<sup>†</sup> zero (dup) LDC1 ft,address (DMTC1 ft,rs) LWC1 ft,address (MTC1 ft,rs) 63 63 32 31 Load 64-bit Value Undefined 32-bit Value SDC1 ft,address (DMFC1 rt,fs) SWC1 ft,address (MFC1 rt,fs) Sign extend reg. Memory<sup>†</sup> (or 64-bit register) Memory<sup>†</sup> FR=1 32-Register Mode Figure 14-4 Loading and Storing Floating-Point Registers in 32-Register Mode <sup>†</sup>Move to/from selects an integer register instead. Moved 32-bit data is sign-extended in 64-bit register. Doubleword load, store and move to/from instructions load or store an entire 64-bit floating-point register, as shown in Figure 14-5. In MIPS 1 and II ISA, arithmetic operations are valid only for even-numbered registers. Figure 14-5 Operators on Floating-Point Registers In MIPS I and MIPS II ISAs, all arithmetic instructions, whether single- or double-precision, are limited to using even register numbers. Load, store and move instructions transfer only a single word. Even and odd register numbers are used to access the low and high halves, respectively, of double-precision registers. When storing a floating-point register (SWC1 or MFC1), the processor reads the entire register but writes only the selected half to memory or to an integer register. Because the register renaming scheme creates a new physical register for every destination, it is not sufficient just to enable writing half of the Floating-Point register file when loading (LWC1 or MTC1); the unchanged half must also be copied into the destination. This old value is read using the shared read port, it is then merged with the new word, and the merged doubleword value is written. (A write to the register file writes all 64 bits in parallel.) When instructions are renamed in MIPS I or II, the low bit of any FGR field is forced to zero. Thus, each even/odd logical register number pair is treated as an even-numbered double-precision register. Odd numbered logical registers are not used in the mapping tables and dependency logic, but they remain mapped to their latest physical registers. #### 14.4 Floating-Point Control Registers The MIPS IV ISA permits up to 32 control registers to be defined for each coprocessor, but the Floating-Point Unit uses only two: - Control register 0, the FP *Implementation and Revision* register - Control register 31, the *Floating-Point Status* register (*FSR*) #### Floating-Point Implementation and Revision Register The following fields are defined for control register 0 in Coprocessor 1, the FP *Implementation and Revision* register, as shown in Figure 14-6: - The *Implementation* field holds an 8-bit number, 0x09, which identifies the R10000 implementation of the floating point coprocessor. - The *Revision* field is an 8-bit number that defines a particular revision of the floating point coprocessor. Since it can be arbitrarily changed, it is not defined here. #### Implementation and Revision Register Figure 14-6 FP Implementation and Revision Register Format #### Floating-Point Status Register (FSR) Figure 14-7 shows the Floating-Point *Status* register (FSR), control register 31 in Coprocessor 1. It is implemented in the graduation unit rather than the Floating-Point Unit, because it is closely tied to the active list. Bits 22:18 are unimplemented and must be set to zero. All other bits may be read or written using Control Move instructions from or to Coprocessor 1 (subfunctions CFC1 or CTC1). These move instructions are fully interlocked; they are delayed in the decode stage until all previous instructions have been graduated, and no subsequent instruction is decoded until they have been completed. #### **FP Status Register** Condition bits are True/False values set by floating-point compare instructions. Flush (FS) bit: 0: A denormalized result causes an Unimplemented Operation exception. 1: A denormalized result is replaced with zero. No exception is flagged. Cause bits indicate the status of each floating-point arithmetic instruction. (Not by load, store, or move.) Enable bits enable an exception if the corresponding Cause bit is set. Flag bits are set whenever the corresponding Cause bit is a 1. These bits are cumulative. Once a bit is set, it remains set until the FSR is written by a CTC1 instruction. Unimplemented operation. This exception is always enabled. IEEE 754 Exception bits: The following bits may be individually enabled: > V Invalid operation. *Z O* Division by zero. (Divide unit only.) Overflow. U Underflow. Inexact operation. (Result can not be stored precisely.) Round Mode (RM): (IEEE specification) 0: RN, Round to nearest representable value. If two values are equally near, set the lowest bit to zero. - 1: RZ, Round toward Zero. Round to the closest value whose magnitude is not greater than the result. - 2: RP, Round to Plus Infinity. Round to the closest value whose magnitude is not less than the result. - 3: *RM*, Round to Minus Infinity. Round to the closest value whose magnitude is not greater. Figure 14-7 Floating-Point Status Register (FSR) #### Bit Descriptions of the FSR Description of the bits in the FSR are as follows: Condition Bits [31:25,23]: The Condition bits indicate the result of floating-point compare instructions. The active list keeps track of these bits. Cause Bits [17:12]: Each functional unit can detect exceptional cases in their function codes, operands, or results. These cases are indicated by setting one of six specific Cause bits. The Cause bits indicate the status of the floating-point arithmetic instruction which graduated most recently or caused an exception to be taken. The FSR is not modified by load, store, or move instructions. All cause bits, except E, have corresponding Enable and Flag bits in the FSR. - E *Unimplemented operation*: the execution unit does not perform the specified operation. This exception is always enabled. - V *Invalid operation*: this operation is not valid for the given operands. - Z *Division by zero*: (divide unit only) the result of division by zero is not defined. - O *Overflow*: the result is too large in magnitude to be correctly represented in the result format. - U *Underflow*: the result is too small in magnitude to be correctly represented in the result format. - I *Inexact Result*: the result cannot be represented exactly. **NOTE:** The *FSR* is modified only for instructions issued by the floating-point queue. Move From (MFC or DMFC) instructions never set the *Cause* field; status bits from the functional unit (multiplier) must be ignored. Move or Move Conditional instructions can set the Unimplemented Operation exception only in the *Cause* field. Load and store instructions are issued by the address queue.) The functional units generate the *Cause* bits and send them to the graduation unit when the operation is completed. *Enable Bits* [11:7]: The five *Enable* bits individually enable (when set to a 1) or disable (when set to a 0) exceptions when the corresponding *Cause* bit is set. Flag Bits [6:2]: One of the five Flag bits is set when a floating-point arithmetic instruction graduates, if the corresponding Cause bit is set. The Flag bits are sticky and remain set until the FSR is written. Thus, the Flag bits indicate the status of all floating-point instructions graduated since the FSR was last written. The Flag bits are not modified for any instructions which cause an exception to be taken. Round Mode [1:0]: RM bits select one of the four IEEE rounding modes. Most floating-point results cannot be precisely represented by the 32-bit or 64-bit register formats, and must be truncated and rounded to a representable value. The modes selected by the RM bit values are: - 0: RN, round to nearest representable value. If two values are equally near, set the lowest bit to zero. - 1: RZ, round toward zero. Round to the closest value whose magnitude is not greater than the result. - 2: *RP*, round to plus infinity. Round to the closest value whose magnitude is not less than the result. - 3: *RM*, round to minus infinity. Round to the closest value whose magnitude is not greater. The *Round* and *Enable* bits only change when the *FSR* is written by a CTC1 (Move To Coprocessor 1 Control Register) instruction. Each CTC1 instruction is executed sequentially, after all previous floating-point instructions have been completed, so these *FSR* bits do not change while any floating-point instruction is active. These bits are broadcast from the graduation unit to all the floating-point functional units. When a *Cause* bit is set and its corresponding *Enable* bit is also set, an exception is taken on the instruction. The result of the instruction is not stored, and the *Flag* bits are not changed. If no exception is taken, the corresponding *Flag* bits are set. The *Cause* and *Flag* bits may be read or written. If a CTC1 instruction sets both a *Cause* bit and its *Enable* bit, an exception is taken immediately. The *FSR* is written, but the exception is reported on the move instruction. #### Loading the FSR The *FSR* may be loaded from an integer register by a CTC1 instruction which selects control register 31. This instruction is executed serially; that is, it is delayed during decode until the entire pipeline has emptied, and it is completed before the next instruction is decoded. This instruction writes all *FSR* bits. If any *Cause* bit and its corresponding *Enable* bit are both set, an exception is taken after FSR has been modified. The CTC1 instruction is aborted; it does not graduate, even though it has changed the processor state. #### [MEMO] # 15. Memory Management This section describes the R10000 processor memory management, including: - processor modes and exceptions - virtual address space - virtual address translation #### 15.1 Processor Modes The R10000 has three operating modes and two addressing modes. All are described in this section. #### **Processor Operating Modes** The three operating modes are listed in order of decreasing system privilege: - **Kernel mode** (highest system privilege): can access and change any register. The innermost core of the operating system runs in kernel mode. - **Supervisor mode**: has fewer privileges and is used for less critical sections of the operating system. - **User mode** (lowest system privilege): prevents users from interfering with one another. Selection between the three modes can be made by the operating system (when in Kernal mode) by writing into *Status* register's *KSU* field. The processor is forced into Kernel mode when the processor is handling an error (the *ERL* bit is set) or an exception (the *EXL* bit is set). Table 15-1 shows the selection of operating modes with respect to the *KSU*, *EXL* and *ERL* bits. Table 15-1 also shows how different instruction sets and addressing modes are enabled by the *Status* register's *XX*, *UX*, *SX* and *KX* bits. A dash ("-") in this table indicates a "don't care." For detailed information on the address spaces available in each mode, refer to section titled, "Virtual Address Space," in this chapter. The R10000 processor was designed for use with the MIPS IV ISA; however, for compatibility with earlier machines, the useable ISAs can be limited to either MIPS III or MIPSI/II. | XX<br>31 | KX<br>7 | SX<br>6 | UX<br>5 | KSU<br>4:3 | ERL<br>2 | EXL<br>1 | Description | ISA <sup>‡</sup><br>III | ISA*<br>IV | Addressing Mode<br>32-Bit/64-Bit | |----------|---------|---------|---------|------------|----------|----------|------------------|-------------------------|------------|----------------------------------| | 0 | * | - | 0 | 10 | 0 | 0 | | No | No | 32 | | 1 | - | - | 0 | 10 | 0 | 0 | User mode. | No | Yes | 32 | | 0 | - | - | 1 | 10 | 0 | 0 | Oser mode. | Yes | No | 64 | | 1 | - | - | 1 | 10 | 0 | 0 | | Yes | Yes | 64 | | - | - | 0 | - | 01 | 0 | 0 | Supervisor mode | No | Yes | 32 | | - | - | 1 | - | 01 | 0 | 0 | Supervisor mode. | Yes | Yes | 64 | | - | 0 | - | - | 00 | 0 | 0 | Kernel mode. | Yes | Yes | 32 | | - | 1 | - | - | 00 | 0 | 0 | Kerner mode. | Yes | Yes | 64 | | - | 0 | - | - | - | 0 | 1 | Expantion Lavel | Yes | Yes | 32 | | - | 1 | - | - | - | 0 | 1 | Exception Level | Yes | Yes | 64 | | - | 0 | - | - | - | 1 | X | Error Level. | Yes | Yes | 32 | | - | 1 | - | - | - | 1 | X | EHOI LEVEI. | Yes | Yes | 64 | Table 15-1 Processor Modes <sup>‡</sup> No means the ISA is disabled; Yes means the ISA is enabled. <sup>\*</sup> Dashes (-) are "don't care." #### **Addressing Modes** The processor's *addressing mode* determines whether it generates 32-bit or 64-bit memory addresses. Refer to Table 15-1 for the following addressing mode encodings: - In Kernel mode the *KX* bit allows 64-bit addressing; all instructions are always valid. - In Supervisor mode, the *SX* bit allows 64-bit addressing and the MIPS III instructions. MIPS IV ISA is enabled all the time in Supervisor mode. - In User mode, the *UX* bit allows 64-bit addressing and the MIPS III instructions; the *XX* bit allows the new MIPS IV instructions. #### 15.2 Virtual Address Space The processor uses either 32-bit or 64-bit address spaces, depending on the operating and addressing modes set by the *Status* register. Table 15-1 lists the decoding of these modes. The processor uses the following addresses: - virtual address VA[43:0] - region bits **VA**[**63:59**] If a region is **mapped**, virtual addresses are translated in the TLB. Bits **VA[58:44]** are not translated in the TLB and are sign extensions of bit **VA[43]**. In both 32-bit and 64-bit address mode, the memory address space is divided into many regions, as shown in Figure 15-3. Each region has specific characteristics and uses. The user can access only the *useg* region in 32-bit mode, or *xuseg* in 64-bit mode, as shown in Figure 15-1. The supervisor can access user regions as well as *sseg* (in 32-bit mode) or *xsseg* and *csseg* (in 64-bit mode), shown in Figure 15-2. The kernel can access all regions except those restricted because bits **VA[58:44]** are not implemented in the TLB, as shown in Figure 15-3. The R10000 processor follows the R4400 implementation for *data* references only, ensuring compatibility with the NT kernel. If any of the upper 33 bits are nonzero for an instruction fetch, an Address Error is generated. Refer to Table 15-2 for delineation of the address spaces. #### **User Mode Operations** In User mode, a single, uniform virtual address space—labelled User segment—is available; its size is: - 2 Gbytes (2<sup>31</sup> bytes) in 32-bit mode (*useg*) - 16 Tbytes (2<sup>44</sup> bytes) in 64-bit mode (*xuseg*) Figure 15-1 shows User mode virtual address space. Figure 15-1 User Mode Virtual Address Space The User segment starts at address 0 and the current active user process resides in either *useg* (in 32-bit mode) or *xuseg* (in 64-bit mode). The TLB identically maps all references to *useg/xuseg* from all modes, and controls cache accessibility. #### 32-bit User Mode (useg) In User mode, when UX = 0 in the *Status* register, User mode addressing is compatible with the 32-bit addressing model shown in Figure 15-1, and a 2-Gbyte user address space is available, labelled *useg*. All valid User mode virtual addresses have their most-significant bit cleared to 0; any attempt to reference an address with the most-significant bit set while in User mode causes an Address Error exception. The system maps all references to *useg* through the TLB, and bit settings within the TLB entry for the page determine the cacheability of a reference. #### 64-bit User Mode (xuseg) In User mode, when UX = 1 in the *Status* register, User mode addressing is extended to the 64-bit model shown in Figure 15-1. In 64-bit User mode, the processor provides a single, uniform virtual address space of $2^{44}$ bytes, labelled *xuseg*. All valid User mode virtual addresses have bits 63:44 equal to 0; an attempt to reference an address with bits 63:44 not equal to 0 causes an Address Error exception. Although the system may be in 32-bit mode, address logic still generates 64-bit values. In this case the high 32 bits must equal the sign bit (31), or an Address Error exception is taken. #### **Supervisor Mode Operations** Supervisor mode is designed for layered operating systems in which a true kernel runs in processor Kernel mode, and the rest of the operating system runs in Supervisor mode. The processor operates in Supervisor mode when the *Status* register contains the Supervisor-mode bit-values shown in Table 15-1. Figure 15-2 shows Supervisor mode address mapping. Figure 15-2 Supervisor Mode Address Space #### 32-bit Supervisor Mode, User Space (suseg) In Supervisor mode, when SX = 0 in the *Status* register and the most-significant bit of the 32-bit virtual address is set to 0, the *suseg* virtual address space is selected; it covers the full $2^{31}$ bytes (2 Gbytes) of the current user address space. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. This mapped space starts at virtual address 0x0000 0000 and runs through 0x7FFF FFFF. #### 32-bit Supervisor Mode, Supervisor Space (sseg) In Supervisor mode, when SX = 0 in the *Status* register and the three most-significant bits of the 32-bit virtual address are $110_2$ , the *sseg* virtual address space is selected; it covers $2^{29}$ -bytes (512 Mbytes) of the current supervisor address space. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. This mapped space begins at virtual address 0xC000 0000 and runs through 0xDFFF FFFF. #### 64-bit Supervisor Mode, User Space (xsuseg) In Supervisor mode, when SX = 1 in the *Status* register and bits 63:62 of the virtual address are set to $00_2$ , selection of the *xsuseg* virtual address space is dependent upon the *UX* bit. - if UX = 1, the entire space from 0x0000 0000 0000 0000 through 0000 0FFF FFFF FFFF (16 Tbytes) is selected. - If UX = 0, the address space 0x0000 0000 0000 0000 through 0000 0000 7FFF FFFF (2 Gbytes) is selected. Addressing the space ranging from 0000 0000 8000 0000 through 0000 0FFF FFFF FFFF will cause an address error. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. #### 64-bit Supervisor Mode, Current Supervisor Space (xsseg) In Supervisor mode, when SX = 1 in the *Status* register and bits 63:62 of the virtual address are set to $01_2$ , the *xsseg* current supervisor virtual address space is selected. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. This mapped space begins at virtual address 0x4000 0000 0000 0000 and runs through 0x4000 0FFF FFFF FFFF. #### 64-bit Supervisor Mode, Separate Supervisor Space (csseg) In Supervisor mode, when SX = 1 in the *Status* register and bits 63:62 of the virtual address are set to $11_2$ , the *csseg* separate supervisor virtual address space is selected. Addressing of the *csseg* is compatible with addressing *sseg* in 32-bit mode. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. This mapped space begins at virtual address 0xFFFF FFFF C000 0000 and runs through 0xFFFF FFFF DFFF FFFF. #### **Kernel Mode Operations** The processor operates in Kernel mode when the *Status* register contains the Kernel-mode bit-values shown in Table 15-1. Kernel mode virtual address space is divided into regions differentiated by the high-order bits of the virtual address, as shown in Figure 15-3. #### 64-bit Figure 15-3 Kernel Mode Address Space **NOTE:** If *ERL* = 1, the selected 2 Gbyte space becomes uncached and unmapped. #### 32-bit Kernel Mode, User Space (kuseg) In Kernel mode, when KX = 0 in the *Status* register, and the most-significant bit of the virtual address, A31, is cleared, the 32-bit *kuseg* virtual address space is selected; it covers the full $2^{31}$ bytes (2 Gbytes) of the current user address space. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. #### 32-bit Kernel Mode, Kernel Space 0 (kseg0) In Kernel mode, when KX = 0 in the *Status* register and the most-significant three bits of the virtual address are $100_2$ , 32-bit $kseg\theta$ virtual address space is selected; it is the $2^{29}$ -byte (512-Mbyte) kernel physical space. References to $kseg\theta$ are not mapped through the TLB; the physical address is selected by subtracting $0x8000\ 0000$ from the virtual address. The $K\theta$ field of the Config register determines cacheability and coherency. #### 32-bit Kernel Mode, Kernel Space 1 (kseg1) In Kernel mode, when KX = 0 in the *Status* register and the most-significant three bits of the 32-bit virtual address are $101_2$ , 32-bit ksegI virtual address space is selected; it is the $2^{29}$ -byte (512-Mbyte) kernel physical space. References to *kseg1* are not mapped through the TLB; the physical address is selected by subtracting 0xA000 0000 from the virtual address. Caches are disabled for accesses to these addresses, and physical memory (or memory-mapped I/O device registers) are accessed directly. #### 32-bit Kernel Mode, Supervisor Space (ksseg) In Kernel mode, when KX = 0 in the *Status* register and the most-significant three bits of the 32-bit virtual address are $110_2$ , the *ksseg* virtual address space is selected; it is the current $2^{29}$ -byte (512-Mbyte) supervisor virtual space. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. References to ksseg are mapped through the TLB. #### 32-bit Kernel Mode, Kernel Space 3 (kseg3) In Kernel mode, when KX = 0 in the *Status* register and the most-significant three bits of the 32-bit virtual address are $111_2$ , the kseg3 virtual address space is selected; it is the current $2^{29}$ -byte (512-Mbyte) kernel virtual space. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. References to *kseg3* are mapped through the TLB. #### 64-bit Kernel Mode, User Space (xkuseg) In Kernel mode, when KX = 1 in the *Status* register and bits 63:62 of the 64-bit virtual address are $00_2$ , selection of the *xkuseg* virtual address space is dependent upon the *UX* and *ERL* bits. - if UX = 1 and ERL = 0, the entire space from $0x0000\ 0000\ 0000\ 0000\ 0000$ through $0000\ 0FFF\ FFFF\ (16\ Tbytes)$ is selected. - If UX = 0 or ERL = 1, the address space 0x0000 0000 0000 0000 through 0000 0000 7FFF FFFF (2 Gbytes) is selected. Addressing the space ranging from 0000 0000 8000 0000 through 0000 0FFF FFFF FFFF will cause an address error. Moreover, if ERL=1, the selected 2-Gbyte address space becomes unmapped and uncached. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. #### 64-bit Kernel Mode, Current Supervisor Space (xksseg) In Kernel mode, when KX = 1 in the *Status* register and bits 63:62 of the 64-bit virtual address are $01_2$ , selection of the *xksseg* virtual address space is dependent upon the *SX* bit. - if SX = 1, the entire space from 0x4000 0000 0000 0000 through 4000 0FFF FFFF FFFF (16 Tbytes) is selected. - If SX = 0, access to any address in the space ranging from $0x4000\ 0000\ 0000$ through 4000 0FFF FFFF FFFF causes an address error. The virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address. #### 64-bit Kernel Mode, Physical Spaces (xkphys) In Kernel mode, when KX = 1 in the *Status* register and bits 63:62 of the 64-bit virtual address are $10_2$ , the *xkphys* virtual address space is selected; it is a set of eight kernel physical spaces. Each kernel physical space contains either one or four $2^{40}$ -byte physical pages. References to this space are not mapped; the physical address selected is taken directly from bits 39:0 of the virtual address. Bits 61:59 of the virtual address specify the *cache algorithm*, described in Chapter 4, the section titled "Cache Algorithms." If the cache algorithm is either uncached or uncached accelerated (values of 2 or 7) the space contains four physical pages; access to addresses whose bits 56:40 are not equal to 0 cause an Address Error exception. Address bits 58:57 carry the *uncached attribute* (described in Chapter 6, the section titled "Support for Uncached Attribute"), and are not checked for address errors. If the cache algorithm is neither uncached nor uncached accelerated, the space contains a single physical page, as on the R4400 processor. In this case, access to addresses whose bits 58:40 are not equal to a zero cause an Address Error exception, as shown in Figure 15-4. | Address Error | FFFFFFF | BFFFFFFF | 0X | |---------------------------|-----------------|-----------------|-----| | Audiess Lilui | 0 0 0 0 0 0 0 0 | B E 0 0 0 1 0 0 | 0 X | | Uncached Accelerated | FFFFFFF | BE0000FF | 0 X | | Officacifed Accelerated | 0 0 0 0 0 0 0 0 | B E 0 0 0 0 0 0 | 0 X | | Address Error | FFFFFFF | BDFFFFFF | 0X | | Addic33 Elloi | 0 0 0 0 0 0 0 0 | B C 0 0 0 1 0 0 | 0X | | Uncached Accelerated | FFFFFFF | BC0000FF | 0X | | Chicachica 7 (coolchatea | 0 0 0 0 0 0 0 0 | BC000000 | 0 X | | Address Error | FFFFFFF | BBFFFFFF | 0 X | | Address Eller | 0 0 0 0 0 0 0 0 | BA000100 | 0 X | | Uncached Accelerated | FFFFFFF | BA0000FF | 0 X | | <u> </u> | 0 0 0 0 0 0 0 0 | BA000000 | 0X | | Address Error | FFFFFFF | B9FFFFFF | 0 X | | 7.000.000 2.101 | 0 0 0 0 0 0 0 0 | B 8 0 0 0 1 0 0 | 0X | | Uncached Accelerated | FFFFFFF | B 8 0 0 0 0 F F | 0 X | | | 0 0 0 0 0 0 0 0 | B 8 0 0 0 0 0 0 | 0 X | | Address Error | FFFFFFF | B7FFFFF | 0 X | | | 00000000 | B0000100 | 0 X | | Reserved <sup>‡</sup> | FFFFFFF | B00000FF | 0 X | | | 00000000 | B 0 0 0 0 0 0 0 | | | Address Error | FFFFFFF | AFFFFFFF | 0X | | | 00000000 | A 8 0 0 0 1 0 0 | | | Cacheable Exclusive Write | FFFFFFF | A 8 0 0 0 0 F F | | | | 0 0 0 0 0 0 0 | 0 0 0 0 0 8 A | | | Address Error | FFFFFFF | A7FFFFF | 0 X | | | 0 0 0 0 0 0 0 0 | A 0 0 0 0 1 0 0 | 0 X | | Cacheable Exclusive | FFFFFFF | A 0 0 0 0 0 F F | | | | 00000000 | A0000000 | 0 X | | Address Error | FFFFFFF | 9 F F F F F F | 0 X | |-----------------------|-----------------------|-----------------|-----| | Address Life | 0 0 0 0 0 0 0 0 | 98000100 | 0X | | Cacheable Noncoherent | FFFFFFF | 980000FF | 0 X | | Cacheable Nonconerent | 00000000 | 98000000 | 0X | | Address Error | FFFFFFF | 97FFFFF | 0X | | Address Elloi | 0 0 0 0 0 0 0 0 | 96000100 | 0 X | | Uncached | FFFFFFF | 960000FF | 0X | | Officached | 0 0 0 0 0 0 0 0 | 96000000 | 0X | | Address Error | FFFFFFF | 9 5 F F F F F F | 0 X | | Address Elloi | 00000000 | 94000100 | 0X | | Uncached | FFFFFFF | 9 4 0 0 0 0 F F | 0X | | Uncached | 00000000 | 9 4 0 0 0 0 0 0 | 0 X | | Address Error | FFFFFFF | 93FFFFFF | 0X | | Address Elloi | 00000000 | 92000100 | 0X | | Uncached | FFFFFFF | 9 2 0 0 0 0 F F | 0X | | Unicached | 0 0 0 0 0 0 0 0 | 9 2 0 0 0 0 0 0 | 0 X | | Address Error | FFFFFFF | 91FFFFF | 0X | | Address Life | 0 0 0 0 0 0 0 0 | 90000100 | 0X | | Uncached | FFFFFFF | 900000FF | 0X | | Officached | 0 0 0 0 0 0 0 0 | 90000000 | 0X | | Address Error | FFFFFFF | 8 F F F F F F F | 0X | | Address Life | 0 0 0 0 0 0 0 0 | 88000100 | 0X | | Reserved* | FFFFFFF | 8 8 0 0 0 0 F F | 0X | | I/G9GI VGU | 0 0 0 0 0 0 0 0 | 8 8 0 0 0 0 0 0 | 0X | | Address Error | FFFFFFF | 8 7 F F F F F F | 0X | | Addiess Ellol | 0 0 0 0 0 0 0 0 | 80000100 | 0X | | Reserved* | FFFFFFFF | 800000FF | 0X | | Reserveu | $0\ 0\ 0\ 0\ 0\ 0\ 0$ | 8 0 0 0 0 0 0 0 | 0 X | Figure 15-4 xkphys Virtual Address Space <sup>‡</sup> Accessing a reserved space results in undefined behavior. #### 64-bit Kernel Mode, Kernel Space (xkseg) In Kernel mode, when KX = 1 in the *Status* register and bits 63:62 of the 64-bit virtual address are $11_2$ , the address space selected is one of the following: - kernel virtual space, xkseg, the current kernel virtual space; the virtual address is extended with the contents of the 8-bit ASID field to form a unique virtual address - one of the four 32-bit kernel mode compatibility spaces (described below). #### 64-bit Kernel Mode, Compatibility Spaces (ckseg1:0, cksseg, ckseg3) In Kernel mode, when KX = 1 in the *Status* register, bits 63:62 of the 64-bit virtual address are $11_2$ , and bits 61:31 of the virtual address equal -1, the lower two bytes of address, as shown in Figure 15-3, select one of the following 512-Mbyte compatibility spaces. - *ckseg0*. This 64-bit virtual address space is an unmapped region, compatible with the 32-bit address model *kseg0*. The *K0* field of the *Config* register controls cacheability and coherency. - *ckseg1*. This 64-bit virtual address space is an unmapped and uncached region, compatible with the 32-bit address model *kseg1*. - *cksseg*. This 64-bit virtual address space is the current supervisor virtual space, compatible with the 32-bit address model *ksseg*. - *ckseg3*. This 64-bit virtual address space is kernel virtual space, compatible with the 32-bit address model *kseg3*. #### Address Space Access Privilege Differences Between the R4400 and R10000 In the R4400, the 64-bit Supervisor mode can access the entire *xsuseg* space, and the 64-bit Kernel mode can access the entire *xksseg* and *xkuseg* spaces. Access privileges in the R10000 are also dependent on the *UX* and *SX* bits: - Access to the 64-bit user space in 64-bit Supervisor or Kernel mode (*xsuseg* or *xkuseg*) is controlled by the *UX* bit. If *UX*=0, the 64-bit Supervisor and Kernel modes can only access the 32-bit user space (*suseg* or *kuseg*). - Access to the 64-bit supervisor space in Kernel mode (*xksseg*) is controlled by the SX bit. If SX=0, the 64-bit Kernel mode can only access the 32-bit supervisor space (*ksseg*). An Address Error exception is taken on an illegal access. The R10000 processor implements the same access privileges for 32-bit processor modes as in the R4400. The Table 15-2 summarizes the access privileges for all processor modes in the R10000 processor. Table 15-2 Access Privileges for User, Supervisor and Kernel Mode Operations | 64-bit | | 32-bit Mode | | 64-bit Mode | | | | | | |-----------------------------------------------|-------------------|-------------|---------|-------------|--------------------|--------------------|--------------------|----|----| | Virtual Address | User <sup>‡</sup> | Supervisor | Kernel | User | Supervisor | Kernel &<br>ERL=0 | Kernel &<br>ERL=1 | | | | FFFFFFFF E0000000 TO FFFFFFFF FFFFFFFF | | AddrErr | | | AddrErr | OK | OK | | | | FFFFFFFF C0000000<br>TO<br>FFFFFFFF DFFFFFFF | | OK | OK | | OK | | | | | | FFFFFFF A0000000 TO FFFFFFFF BFFFFFFF | | | OK | | AddrErr | | | | | | FFFFFFFF 80000000 TO FFFFFFFFF 9FFFFFFF | | | | | | | | | | | C0000FFF 00000000<br>TO<br>FFFFFFFF 7FFFFFFF | | | | AddrErr | | AddrErr | AddrErr | | | | C0000000 00000000<br>TO<br>C0000FFE FFFFFFF | AddrErr | | | nauti | | OK | OK | | | | 80000000 000000000 TO BFFFFFFF FFFFFFF | | | | AddrErr | | | | OK | OK | | 40001000 000000000<br>TO<br>7FFFFFFF FFFFFFFF | | | AddrErr | | | AddrErr | AddrErr | | | | 40000000 000000000<br>TO<br>40000FFF FFFFFFFF | | | | | OK | AddrErr if<br>SX=0 | AddrErr if<br>SX=0 | | | | 00001000 000000000<br>TO<br>3FFFFFFF FFFFFFFF | | | | | AddrErr | AddrErr | AddrErr | | | | 00000000 80000000<br>TO<br>00000FFF FFFFFFFF | | | | OK | AddrErr if<br>UX=0 | AddrErr if<br>UX=0 | AddrErr | | | | 00000000 000000000<br>TO<br>00000000 7FFFFFF | OK | OK | OK | OK . | OK | OK | OK | | | <sup>‡</sup> For data references, the upper 32 bits of the virtual addresses are cleared before checking access privilege and TLB translation. #### 15.3 Virtual Address Translation Programs can operate using either physical or virtual memory addresses: - · physical addresses correspond to hardware locations in main memory - virtual addresses are logical values only, and do not correspond to fixed hardware locations Virtual addresses must first be **translated** (finding the physical address at which the virtual address points) before main memory can be accessed. This translation is essential for multitasking computer systems, because it allows the operating system to load programs anywhere in main memory independent of the logical addresses used by the programs. This translation also implements a memory protection scheme, which limits the amount of memory each program may access. The scheme prevents programs from interfering with the memory used by other programs or the operating system. #### Errata #### **Virtual Pages** Translated virtual addresses retrieve data in blocks, which are called **pages**. In the R10000 processor, the size of each page may be selected from a range that runs from 4 Kbytes to 16 Mbytes inclusive, in powers of 4 (that is, 4 Kbytes, 16 Kbytes, 64 Kbytes, etc.). The virtual address bits which select a page (and thus are translated) are called the *page address*. The lower bits which select a byte within the selected page are called the *offset* and are not translated. The number of offset bits varies from 12 to 24 bits, depending on the page size. #### **Virtual Page Size Encodings** Page size is defined in each TLB entry's *PageMask* field. This field is loaded or read using the *PageMask* register, as described in Chapter 13, *PageMask Register* (5). Each entry translates a pair of physical pages. The low bit of the virtual address page is not compared, because it is used to select between these two physical pages. #### Using the TLB Translations are maintained by the operating system, using page tables in memory. A subset of these translations are loaded into a hardware buffer called the **translation-lookaside buffer** or TLB. The contents of this buffer are maintained by the operating system; if an instruction needs a translation which is not already in the buffer, an exception is taken so the operating system can compute and load the needed translation. If all the necessary translations are present, the program is executed without any delays. The TLB contains 64 entries, each of which maps a pair of virtual pages. Formats of TLB entries are shown in Figure 15-5. #### **Cache Algorithm Field** The *Cache Algorithm* fields of the TLB, *EntryLo0*, *EntryLo1*, and *Config* registers indicate how data is cached. Cache algorithms are described in Chapter 4, *Cache Algorithms*. #### Format of a TLB Entry Figure 15-5 shows the TLB entry formats for both 32- and 64-bit modes. Each field of an entry has a corresponding field in the *EntryHi*, *EntryLo0*, *EntryLo1*, or *PageMask* registers, as shown in Chapter 13, *Coprocessor 0*; for example the *PFN* and uncached attribute (*UC*) fields of the TLB entry are also held in the *EntryLo* registers. Figure 15-5 Format of a TLB Entry #### **Address Translation** Because a 64-bit address is unnecessarily large, only the low 44 address bits are translated. The high two virtual address bits (bits 63:62) select between user, supervisor, and kernel address spaces. The intermediate address bits (61:44) must either be all zeros or all ones, depending on the address region. The TLB does not include virtual address bits 61:59, because these are decoded only in the *xkphys* region, which is unmapped. For data cache accesses, the **joint TLB** (JTLB) translates addresses from the address calculate unit. For instruction accesses, the JTLB translates the PC address if it misses in the instruction TLB (ITLB). That entry is copied into the ITLB for subsequent accesses. The ITLB is transparent to system software. #### **Address Space Identification (ASID)** Each independent task, or *process*, has a separate address space, assigned a unique 8-bit Address Space Identifier (**ASID**). This identifier is stored with each TLB entry to distinguish between entries loaded for different processes. The ASID allows the processor to move from one process to another (called a **context switch**) without having to invalidate TLB entries. The processor's current ASID is stored in the low 8 bits of the *EntryHi* register. These bits are also used to load the *ASID* field of an entry during TLB refill. The ASID field of each TLB entry is compared to the *EntryHi* register; if the ASIDs are equal or if the entry is global (see below), this TLB entry may be used to translate virtual addresses. The ASID comparison is performed only when a new value is loaded into the *EntryHi* register; the one-bit result of the match is stored in a static Enable latch. (This bit is set whenever a new entry is loaded.) #### Global Processes (G) A translation may be defined as *global* so that it can be shared by all processes. This *G* bit is set in the TLB entry and enables the entry independent of its ASID value. #### **Avoiding TLB Conflict** Setting the *TS* bit in the *Status* register indicates an entry being presented to the TLB matches more than one virtual page entry in the TLB. Any TLB entries that allow multiple matches, even in the *Wired* area, are invalidated before the new entry can be written into the TLB. This prevents multiple matches during address translation. ## 16. CPU Exceptions This chapter describes the processor exceptions—a general view of the cause and return of an exception, exception vector locations, and the types of exceptions that are supported, including the cause, processing, and servicing of each exception. #### 16.1 Causing and Returning from an Exception When the processor takes an exception, the *EXL* bit in the *Status* register is set to 1, which means the system is in Kernel mode. After saving the appropriate state, the exception handler typically changes the *KSU* bits in the *Status* register to Kernel mode and resets the *EXL* bit back to 0. When restoring the state and restarting, the handler restores the previous value of the *KSU* field and sets the *EXL* bit back to 1. Returning from an exception also resets the EXL bit to 0 (see the ERET instruction in $V_R5000$ , $V_R10000$ User's Manual INSTRUCTION). #### **16.2 Exception Vector Locations** The Cold Reset, Soft Reset, and NMI exceptions are always vectored to the dedicated Cold Reset exception vector at an uncached and unmapped address. Addresses for all other exceptions are a combination of a *vector offset* and a *base address*. The boot-time vectors (when BEV = 1 in the Status register) are at uncached and unmapped addresses. During normal operation (when BEV = 0) the regular exceptions have vectors in cached address spaces; Cache Error is always at an uncached address so that cache error handling can bypass a suspect cache. The exception vector assignments for the R10000 processor shown in Table 16-1; the addresses are the same as for the R4400. | BEV | Evention Type | Exception Vector Address | | | | |--------|-------------------------------|--------------------------|---------------------|--|--| | DE V | Exception Type | 32-bit | 64-bit | | | | | Cold Reset/Soft Reset/<br>NMI | 0xBFC00000 | 0xfffffff BFC00000 | | | | | TLB Refill (EXL=0) | 0x80000000 | 0xFFFFFFFF 80000000 | | | | BEV=0 | XTLB Refill (EXL=0) | 0x80000080 | 0xFFFFFFFF 8000080 | | | | DE V=0 | Cache Error | 0xA0000100 | 0xfffffff A0000100 | | | | | Others | 0x80000180 | 0xFFFFFFFF 80000180 | | | | | TLB Refill (EXL=0) | 0xBFC00200 | 0xFFFFFFFF BFC00200 | | | | BEV=1 | XTLB Refill (EXL=0) | 0xBFC00280 | 0xFFFFFFFF BFC00280 | | | | DEV-1 | Cache Error | 0xBFC00300 | 0xFFFFFFFF BFC00300 | | | | | Others | 0xBFC00380 | 0xFFFFFFFF BFC00380 | | | Table 16-1 Exception Vector Addresses #### 16.3 TLB Refill Vector Selection In all present implementations of the MIPS III ISA, there are two TLB refill exception vectors: - one for references to 32-bit address space (TLB Refill) - one for references to 64-bit address space (XTLB Refill) Table 16-2 lists the exception vector addresses. The TLB refill vector selection is based on the address space of the address (*user*, *supervisor*, or *kernel*) that caused the TLB miss, and the value of the corresponding extended addressing bit in the *Status* register (*UX*, *SX*, or *KX*). The current operating mode of the processor is not important except that it plays a part in specifying in which address space an address resides. The *Context* and *XContext* registers are entirely separate pagetable-pointer registers that point to and refill from two separate page tables, however these two registers share *BadVPN2* fields (see Chapter 13 for more information). For all TLB exceptions (Refill, Invalid, TLBL or TLBS), the *BadVPN2* fields of both registers are loaded as they were in the R4400. In contrast to the R10000, the R4400 processor selects the vector based on the current operating mode of the processor (*user*, *supervisor*, or *kernel*) and the value of the corresponding extended addressing bit in the *Status* register (*UX*, *SX* or *KX*). In addition, the *Context* and *XContext* registers are not implemented as entirely separate registers; the *PTEbase* fields are shared. A miss to a particular address goes through either TLB Refill or XTLB Refill, depending on the source of the reference. There can be only be a single page table unless the refill handlers execute address-deciphering and page table selection in software. **NOTE:** Refills for the 0.5 Gbyte supervisor mapped region, *sseg/ksseg*, are controlled by the value of *KX* rather than *SX*. This simplifies control of the processor when supervisor mode is not being used. Table 16-2 lists the TLB refill vector locations, based on the address that caused the TLB miss and its corresponding mode bit. Table 16-2 TLB Refill Vectors | Space | Address Range | Regions | Exception Vector | |------------|-----------------------------|--------------------------|----------------------| | | 0xffff ffff E000 0000 | | Refill (KX=0) | | Kernel | to<br>0xffff ffff ffff ffff | kseg3 | or<br>XRefill (KX=1) | | | 0xffff ffff C000 0000 | | Refill (KX=0) | | Supervisor | to<br>Oxffff ffff Dfff ffff | sseg, ksseg | or<br>XRefill (KX=1) | | | 0xC000 0000 0000 0000 | | | | Kernel | to<br>0xC000 0FFE FFFF FFFF | xkseg | XRefill(KX=1) | | | 0x4000 0000 0000 0000 | | | | Supervisor | to<br>0x4000 0FFF FFFF FFFF | xsseg, xksseg | XRefill (SX=1) | | | 0x0000 0000 8000 0000 | | | | User | to<br>0x0000 0FFF FFFF FFFF | xsuseg, xuseg,<br>xkuseg | XRefill (UX=1) | | | 0x0000 0000 0000 0000 | useg, xuseg, suseg, | Refill (UX=0) | | User | to<br>0x0000 0000 7FFF FFFF | xsuseg, kuseg,<br>xkuseg | or<br>XRefill (UX=1) | #### **Priority of Exceptions** The remainder of this chapter describes exceptions in the order of their priority shown in Table 16-3 (with certain of the exceptions, such as the TLB exceptions and Instruction/Data exceptions, grouped together for convenience). While more than one exception can occur for a single instruction, only the exception with the highest priority is reported. Some exceptions are not caused by the instruction executed at the time, and some exceptions may be deferred. See the individual description of each exception in this chapter for more detail. Table 16-3 Exception Priority Order | Cold Reset (highest priority) | |--------------------------------------------------------------------------------------------------------------------------| | Soft Reset | | Nonmaskable Interrupt (NMI) <sup>‡</sup> | | Cache error — Instruction cache <sup>‡</sup> | | Cache error — Data cache <sup>‡</sup> | | Cache error — Secondary cache <sup>‡</sup> | | Cache error — System interface <sup>‡</sup> | | Address error — Instruction fetch | | TLB refill — Instruction fetch | | TLB invalid — Instruction fetch | | Bus error — Instruction fetch | | Integer overflow, Trap, System Call, Breakpoint, Reserved Instruction, Coprocessor Unusable, or Floating-Point Exception | | Address error — Data access | | TLB refill — Data access | | TLB invalid — Data access | | TLB modified — Data write | | Watch <sup>‡</sup> | | Bus error — Data access | | Interrupt (lowest priority) <sup>‡</sup> | <sup>‡</sup> These exceptions are interrupt types, and may be imprecise. Priority may not be followed when considering a specific instruction. Generally speaking, the exceptions described in the following sections are handled ("processed") by hardware; these exceptions are then serviced by software. #### **Cold Reset Exception** #### Cause The Cold Reset exception is taken for a power-on or "cold" reset; it occurs when the **SysGnt\*** signal is asserted while the **SysReset\*** signal is also asserted.<sup>†</sup> This exception is not maskable. #### **Processing** The CPU provides a special interrupt vector for this exception: - location 0xBFC0 0000 in 32-bit mode - location 0xFFFF FFFF BFC0 0000 in 64-bit mode The Cold Reset vector resides in unmapped and uncached CPU address space, so the hardware need not initialize the TLB or the cache to process this exception. It also means the processor can fetch and execute instructions while the caches and virtual memory are in an undefined state. The contents of all registers in the CPU are undefined when this exception occurs, except for the following register fields: - In the *Status* register, *SR* and *TS* are cleared to 0, and *ERL* and *BEV* are set to 1. All other bits are undefined. - Config register is initialized with the boot mode bits read from the serial input. - The *Random* register is initialized to the value of its upper bound. - The *Wired* register is initialized to 0. - The EW bit in the CacheErr register is cleared. - The *ErrorEPC* register gets the PC. - The *FrameMask* register is set to 0. - Branch prediction bits are set to 0. - Performance Counter register Event field is set to 0. - All pending cache errors, delayed watch exceptions, and external interrupts are cleared. #### Servicing The Cold Reset exception is serviced by: - initializing all processor registers, coprocessor registers, caches, and the memory system - performing diagnostic tests - bootstrapping the operating system <sup>†</sup> If **SysGnt\*** remains deasserted (high) while **SysReset\*** is asserted, the processor interprets this as a Soft Reset exception. # Soft<sup>†</sup> Reset Exception #### Cause The Soft Reset exception occurs in response to a Soft Reset (See Chapter 8, the section titled "Soft Reset Sequence"). A Soft Reset exception is not maskable. The processor differentiates between a Cold Reset and a Soft Reset as follows: - A Cold Reset occurs when the SysGnt\* signal is asserted while the SysReset\* signal is also asserted. - A Soft Reset occurs if the SysGnt\* signal remains negated when a SysReset\* signal is asserted. In R4400 processor, there is no way for software to differentiate between a Soft Reset exception and an NMI exception. In the R10000 processor, a bit labelled *NMI* has been added to the *Status* register to distinguish between these two exceptions. Both Soft Reset and NMI exceptions set the *SR* bit and use the same exception vector. During an NMI exception, the *NMI* bit is set to 1; during a Soft Reset, the *NMI* bit is set to 0. #### **Processing** When a Soft Reset exception occurs, the *SR* bit of the *Status* register is set, distinguishing this exception from a Cold Reset exception. When a Soft Reset is detected, the processor initializes minimum processor state. This allows the processor to fetch and execute the instructions of the exception handler, which in turn dumps the current architectural state to external logic. Hardware state that loses architectural state is not initialized unless it is necessary to execute instructions from unmapped uncached space that reads the registers, TLB, and cache contents. The Soft Reset can begin on an arbitrary cycle boundary and can abort multicycle operations in progress, so it may alter machine state. Hence, caches, memory, or other processor states can be inconsistent: data cache blocks may stay at the refill state and any cached loads/stores to these blocks will hang the processor. Therefore, CacheOps should be used to dump the cache contents. After the processor state is read out, the processor should be reset with a Cold Reset sequence. <sup>†</sup> Soft Reset is also known colloquially as Warm Reset. A Soft Reset exception preserves the contents of all registers, except for: - ErrorEPC register, which contains the PC - ERL bit of the Status register, which is set to 1 - *SR* bit of the *Status* register, which is set to 1 on Soft Reset or an NMI; 0 for a Cold Reset - BEV bit of the Status register, which is set to 1 - TS bit of the Status register, which is set to 0 - PC is set to the reset vector 0xFFFF FFFF BFC0 0000 - · clears any pending Cache Error exceptions ## **Servicing** A Soft Reset exception is intended to quickly reinitialize a previously operating processor after a fatal error. It is not normally possible to continue program execution after returning from this exception, since a **SysReset\*** signal can be accepted anytime. ## **NMI Exception** #### Cause The NMI exception is caused by assertion of the **SysNMI\*** signal. An NMI exception is not maskable. In R4400 processor, there is no way for software to differentiate between a Soft Reset exception and an NMI exception. In the R10000 processor, a bit labelled *NMI* has been added to the *Status* register to distinguish between these two exceptions. Both Soft Reset and NMI exceptions set the *SR* bit and use the same exception vector. During an NMI exception, the *NMI* bit is set to 1; during a Soft Reset, the *NMI* bit is set to 0. #### **Processing** When an NMI exception occurs, the *SR* bit of the *Status* register is set, distinguishing this exception from a Cold Reset exception. An exception caused by an NMI is taken at the instruction boundary. It does not abort any state machines, preserving the state of the processor for diagnosis. The *Cause* register remains unchanged and the system jumps to the NMI exception handler (see Table 16-1). An NMI exception preserves the contents of all registers, except for: - ErrorEPC register, which contains the PC - ERL bit of the Status register, which is set to 1 - *SR* bit of the *Status* register, which is set to 1 on Soft Reset or an NMI; 0 for a Cold Reset - BEV bit of the Status register, which is set to 1 - TS bit of the Status register, which is set to 0 - PC is set to the reset vector 0xFFFF FFFF BFC0 0000 - clears any pending Cache Error exceptions #### **Servicing** The NMI can be used for purposes other than resetting the processor while preserving cache and memory contents. For example, the system might use an NMI to cause an immediate, controlled shutdown when it detects an impending power failure. It is not normally possible to continue program execution after returning from this exception, since an NMI can occur during another error exception. #### **Address Error Exception** #### Cause The Address Error exception occurs when an attempt is made to execute one of the following: - reference to an illegal address space - reference the supervisor address space from User mode - reference the kernel address space from User or Supervisor mode - · load or store a doubleword that is not aligned on a doubleword boundary - load, fetch, or store a word that is not aligned on a word boundary - load or store a halfword that is not aligned on a halfword boundary This exception is not maskable. #### **Processing** The common exception vector is used for this exception. The *AdEL* or *AdES* code in the *Cause* register is set, indicating whether the instruction caused the exception with an instruction reference, load operation, or store operation shown by the *EPC* register and *BD* bit in the *Cause* register. When this exception occurs, the *BadVAddr* register retains the virtual address that was not properly aligned or that referenced protected address space. The contents of the *VPN* field of the *Context*, *XContext*, and *EntryHi* registers are undefined, as are the contents of the *EntryLo* register. The *EPC* register contains the address of the instruction that caused the exception, unless this instruction is in a branch delay slot. If it is in a branch delay slot, the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set as indication. #### Servicing The process executing at the time is handed a UNIX SIGSEGV (segmentation violation) signal. This error is usually fatal to the process incurring the exception. # **TLB Exceptions** Three types of TLB exceptions can occur: - TLB Refill occurs when there is no TLB entry that matches an attempted reference to a mapped address space. - TLB Invalid occurs when a virtual address reference matches a TLB entry that is marked invalid. - TLB Modified occurs when a store operation virtual address reference to memory matches a TLB entry which is marked valid but is not dirty (the entry is not writable). The following three sections describe these TLB exceptions. **NOTE:** TLB Refill vector selection is also described earlier in this chapter, in the section titled, TLB Refill Vector Selection. #### **TLB Refill Exception** #### Cause The TLB refill exception occurs when there is no TLB entry to match a reference to a mapped address space. This exception is not maskable. #### **Processing** There are two special exception vectors for this exception; one for references to 32-bit address spaces, and one for references to 64-bit address spaces. The *UX*, *SX*, and *KX* bits of the *Status* register determine whether the user, supervisor or kernel address spaces referenced are 32-bit or 64-bit spaces; the TLB refill vector is selected based upon the address space of the address causing the TLB miss (user, supervisor, or kernel mode address space), together with the value of the corresponding extended addressing bit in the *Status* register (*UX*, *SX*, or *KX*). The current operating mode of the processor is not important except that it plays a part in specifying in which space an address resides. An address is in *user* space if it is in *useg*, *suseg*, *suseg*, *xuseg*, or *xkuseg* (see the description of virtual address spaces in Chapter 15). An address is in *supervisor* space if it is in *sseg*, *ksseg*, *xsseg* or *xksseg*, and an address is in *kernel* space if it is in either *kseg3* or *xkseg*. *Kseg0*, *kseg1*, and kernel physical spaces (*xkphys*) are kernel spaces but are not mapped. All references use these vectors when the *EXL* bit is set to 0 in the *Status* register. This exception sets the *TLBL* or *TLBS* code in the *ExcCode* field of the *Cause* register. This code indicates whether the instruction, as shown by the *EPC* register and the *BD* bit in the *Cause* register, caused the miss by an instruction reference, load operation, or store operation. When this exception occurs, the *BadVAddr*, *Context*, *XContext* and *EntryHi* registers hold the virtual address that failed address translation. The *EntryHi* register also contains the ASID from which the translation fault occurred. The *Random* register normally contains a valid location in which to place the replacement TLB entry. The contents of the *EntryLo* register are undefined. The *EPC* register contains the address of the instruction that caused the exception, unless this instruction is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set. #### Servicing To service this exception, the contents of the *Context* or *XContext* register are used as a virtual address to fetch memory locations containing the physical page frame and access control bits for a pair of TLB entries. The two entries are placed into the *EntryLo0/EntryLo1* register; the *EntryHi* and *EntryLo* registers are written into the TLB. It is possible that the virtual address used to obtain the physical address and access control information is on a page that is not resident in the TLB. This condition is processed by allowing a TLB refill exception in the TLB refill handler. This second exception goes to the common exception vector because the *EXL* bit of the *Status* register is set. #### **TLB Invalid Exception** #### Cause The TLB invalid exception occurs when a virtual address reference matches a TLB entry that is marked invalid (TLB valid bit cleared). This exception is not maskable. #### **Processing** The common exception vector is used for this exception. The *TLBL* or *TLBS* code in the *ExcCode* field of the *Cause* register is set. This indicates whether the instruction, as shown by the *EPC* register and *BD* bit in the *Cause* register, caused the miss by an instruction reference, load operation, or store operation. When this exception occurs, the *BadVAddr*, *Context*, *XContext* and *EntryHi* registers contain the virtual address that failed address translation. The *EntryHi* register also contains the ASID from which the translation fault occurred. The *Random* register normally contains a valid location in which to put the replacement TLB entry. The contents of the *EntryLo* registers are undefined. The *EPC* register contains the address of the instruction that caused the exception unless this instruction is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set. #### **Servicing** A TLB entry is typically marked invalid when one of the following is true: - · a virtual address does not exist - the virtual address exists, but is not in main memory (a page fault) - a trap is desired on any reference to the page (for example, to maintain a reference bit) After servicing the cause of a TLB Invalid exception, the TLB entry is located with TLBP (TLB Probe), and replaced by an entry with that entry's *Valid* bit set. #### **TLB Modified Exception** #### Cause The TLB modified exception occurs when a store operation virtual address reference to memory matches a TLB entry that is marked valid but is not dirty and therefore is not writable. This exception is not maskable. #### **Processing** The common exception vector is used for this exception, and the *Mod* code in the *Cause* register is set. When this exception occurs, the *BadVAddr*, *Context*, *XContext* and *EntryHi* registers contain the virtual address that failed address translation. The *EntryHi* register also contains the ASID from which the translation fault occurred. The contents of the *EntryLo* register are undefined. The *EPC* register contains the address of the instruction that caused the exception unless that instruction is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set. #### Servicing The kernel uses the failed virtual address or virtual page number to identify the corresponding access control information. The page identified may or may not permit write accesses; if writes are not permitted, a write protection violation occurs. If write accesses are permitted, the page frame is marked dirty/writable by the kernel in its own data structures. The TLBP instruction places the index of the TLB entry that must be altered into the *Index* register. The *EntryLo* register is loaded with a word containing the physical page frame and access control bits (with the *D* bit set), and the *EntryHi* and *EntryLo* registers are written into the TLB. ## **Cache Error Exception** The Cache Error exception is described in Chapter 9, Cache Error Exception. ## **Virtual Coherency Exception** #### Errata The Virtual Coherency exception is not implemented in the R10000 processor, since the virtual coherency condition is handled in hardware. When the hardware detects the Virtual Coherency exception, it invalidates the lines in all other segments of the primary cache that could cause aliasing. This takes six cycles more than that needed to refill the primary cache line (the refill would have occurred even if there was no Virtual Coherency exception detected). In the R4400 processor, a Virtual Coherency exception occurs when a primary cache miss hits in the secondary cache but **VA[14:12]** are not the same as the *PIdx* field of the secondary cache tag, and the cache algorithm specifies that the page is cached. When such a situation is detected in the R10000 processor, the primary cache lines at the old virtual index are invalidated and the *PIdx* field of the secondary cache is written with the new virtual index. #### **Bus Error Exception** #### Cause A Bus Error exception occurs when a processor block read, upgrade, or double/single/partial-word read request receives an external ERR completion response, or a processor double/single/partial-word read request receives an external ACK completion response where the associated external double/single/partial-word data response contains an uncorrectable error. This exception is not maskable. #### **Processing** The common interrupt vector is used for a Bus Error exception. The *IBE* or *DBE* code in the *ExcCode* field of the *Cause* register is set, signifying whether the instruction (as indicated by the *EPC* register and *BD* bit in the *Cause* register) caused the exception by an instruction reference, load operation, or store operation. The *EPC* register contains the address of the instruction that caused the exception, unless it is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set. #### Servicing The physical address at which the fault occurred can be computed from information available in the CP0 registers. - If the *IBE* code in the *Cause* register is set (indicating an instruction fetch reference), the instruction that caused the exception is located at the virtual address contained in the *EPC* register (or 4+ the contents of the *EPC* register if the *BD* bit of the *Cause* register is set). - If the *DBE* code is set (indicating a load or store reference), the instruction that caused the exception is located at the virtual address contained in the *EPC* register (or 4+ the contents of the *EPC* register if the *BD* bit of the *Cause* register is set). The virtual address of the load and store reference can then be obtained by interpreting the instruction. The physical address can be obtained by using the TLBP instruction and reading the *EntryLo* registers to compute the physical page number. The process executing at the time of this exception is handed a UNIX SIGBUS (bus error) signal, which is usually fatal. # **Integer Overflow Exception** #### Cause An Integer Overflow exception occurs when an ADD, ADDI, SUB, DADD, DADDI or DSUB instruction results in a 2's complement overflow. This exception is not maskable. ## **Processing** The common exception vector is used for this exception, and the *OV* code in the *Cause* register is set. The *EPC* register contains the address of the instruction that caused the exception unless the instruction is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set. ## Servicing The process executing at the time of the exception is handed a UNIX SIGFPE/FPE\_INTOVF\_TRAP (floating-point exception/integer overflow) signal. This error is usually fatal to the current process. # **Trap Exception** #### Cause The Trap exception occurs when a TGE, TGEU, TLT, TLTU, TEQ, TNE, TGEI, TGEUI, TLTI, TLTUI, TEQI, or TNEI instruction results in a TRUE condition. This exception is not maskable. ## **Processing** The common exception vector is used for this exception, and the *Tr* code in the *Cause* register is set. The *EPC* register contains the address of the instruction causing the exception unless the instruction is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction and the *BD* bit of the *Cause* register is set. ## **Servicing** The process executing at the time of a Trap exception is handed a UNIX SIGFPE/FPE\_INTOVF\_TRAP (floating-point exception/integer overflow) signal. This error is usually fatal. ## **System Call Exception** #### Cause A System Call exception occurs during an attempt to execute the SYSCALL instruction. This exception is not maskable. #### **Processing** The common exception vector is used for this exception, and the *Sys* code in the *Cause* register is set. The *EPC* register contains the address of the SYSCALL instruction unless it is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction. If the SYSCALL instruction is in a branch delay slot, the *BD* bit of the *Status* register is set; otherwise this bit is cleared. #### Servicing When the System Call exception occurs, control is transferred to the applicable system routine. Additional distinctions can be made by analyzing the *Code* field of the SYSCALL instruction (bits 25:6), and loading the contents of the instruction whose address the EPC register contains. To resume execution, the *EPC* register must be altered so that the SYSCALL instruction does not re-execute; this is accomplished by adding a value of 4 to the *EPC* register (*EPC* register + 4) before returning. If a SYSCALL instruction is in a branch delay slot, a more complicated algorithm, beyond the scope of this description, may be required. ## **Breakpoint Exception** #### Cause A Breakpoint exception occurs when an attempt is made to execute the BREAK instruction. This exception is not maskable. #### **Processing** The common exception vector is used for this exception, and the *BP* code in the *Cause* register is set. The *EPC* register contains the address of the BREAK instruction unless it is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction. If the BREAK instruction is in a branch delay slot, the *BD* bit of the *Status* register is set, otherwise the bit is cleared. #### Servicing When the Breakpoint exception occurs, control is transferred to the applicable system routine. Additional distinctions can be made by analyzing the *Code* field of the BREAK instruction (bits 25:6), and loading the contents of the instruction whose address the EPC register contains. A value of 4 must be added to the contents of the EPC register (EPC register + 4) to locate the instruction if it resides in a branch delay slot. To resume execution, the EPC register must be altered so that the BREAK instruction does not re-execute; this is accomplished by adding a value of 4 to the EPC register (EPC register + 4) before returning. If a BREAK instruction is in a branch delay slot, interpretation of the branch instruction is required to resume execution. ## **Reserved Instruction Exception** #### Cause The Reserved Instruction exception occurs when one of the following conditions occurs: - an attempt is made to execute an instruction with an undefined major opcode (bits 31:26) - an attempt is made to execute a SPECIAL instruction with an undefined minor opcode (bits 5:0) - an attempt is made to execute a REGIMM instruction with an undefined minor opcode (bits 20:16) - an attempt is made to execute 64-bit operations in 32-bit mode when in User or Supervisor modes - an attempt is made to execute a COP1X when the MIPS IV ISA is not enabled 64-bit operations are always valid in Kernel mode regardless of the value of the KX bit in the Status register. This exception is not maskable. ## **Processing** The common exception vector is used for this exception, and the *RI* code in the *Cause* register is set. The *EPC* register contains the address of the reserved instruction unless it is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction. #### **Servicing** No instructions in the MIPS ISA are currently interpreted. The process executing at the time of this exception is handed a UNIX SIGILL/ILL\_RESOP\_FAULT (illegal instruction/reserved operand fault) signal. This error is usually fatal. ## **Coprocessor Unusable Exception** #### Cause The Coprocessor Unusable exception occurs when an attempt is made to execute a coprocessor instruction for either: - a corresponding coprocessor unit (CP1 or CP2) that has not been marked usable, or - CP0 instructions, when the unit has not been marked usable and the process executes in either User or Supervisor mode. This exception is not maskable. #### **Processing** The common exception vector is used for this exception, and the *CpU* code in the *Cause* register is set. The contents of the *Coprocessor Usage Error* field of the coprocessor *Control* register indicate which of the four coprocessors was referenced. The *EPC* register contains the address of the unusable coprocessor instruction unless it is in a branch delay slot, in which case the *EPC* register contains the address of the preceding branch instruction. #### Servicing The coprocessor unit to which an attempted reference was made is identified by the Coprocessor Usage Error field, which results in one of the following situations: - If the process is entitled access to the coprocessor, the coprocessor is marked usable and the corresponding user state is restored to the coprocessor. - If the process is entitled access to the coprocessor, but the coprocessor does not exist or has failed, interpretation of the coprocessor instruction is possible. - If the *BD* bit is set in the *Cause* register, the branch instruction must be interpreted; then the coprocessor instruction can be emulated and execution resumed with the *EPC* register advanced past the coprocessor instruction. - If the process is not entitled access to the coprocessor, the process executing at the time is handed a UNIX SIGILL/ILL\_PRIVIN\_FAULT (illegal instruction/privileged instruction fault) signal. This error is usually fatal. # **Floating-Point Exception** #### Cause The Floating-Point exception is used by the floating-point coprocessor. This exception is not maskable. ## **Processing** The common exception vector is used for this exception, and the *FPE* code in the *Cause* register is set. The contents of the *Floating-Point Control/Status* register indicate the cause of this exception. # Servicing This exception is cleared by clearing the appropriate bit in the *Floating-Point Control/ Status* register. ## **Watch Exception** #### Cause A Watch exception occurs when a load or store instruction references the physical address specified in the *WatchLo/WatchHi* System Control Coprocessor (CP0) registers. The *WatchLo* register specifies whether a load or store initiated this exception. A Watch exception violates the rules of a precise exception in the following way: If the load or store reference which triggered the Watch exception has a cacheable address and misses in the data cache, the line will then be read from memory into the secondary cache if necessary, and refilled from the secondary cache into the data cache. In all other cases, cache state is not affected by an instruction which takes a Watch exception. The CACHE instruction never causes a Watch exception. The Watch exception is postponed if either the *EXL* or *ERL* bit is set in the *Status* register. If either bit is set, the instruction referencing the *WatchLo/WatchHi* address is executed and the exception is delayed until the delay condition is cleared; that is, until *ERL* and *EXL* both are cleared (set to 0). The *EPC* contains the address of the next unexecuted instruction. A delayed Watch exception is cleared by system reset or by writing a value to the *WatchLo* register.<sup>†</sup> Watch is maskable by setting the EXL or ERL bits in the Status register. #### **Processing** The common exception vector is used for this exception, and the *Watch* code in the *Cause* register is set. #### **Servicing** The Watch exception is a debugging aid; typically the exception handler transfers control to a debugger, allowing the user to examine the situation. To continue program execution, the Watch exception must be disabled to execute the faulting instruction. The Watch exception must then be reenabled. The faulting instruction can be executed either by interpretation or by setting breakpoints. <sup>†</sup> An MTC0 to the WatchLo register clears a delayed Watch exception. ## **Interrupt Exception** #### Cause The Interrupt exception occurs when one of the eight interrupt conditions is asserted. The significance of these interrupts is dependent upon the specific system implementation. Each of the eight interrupts can be masked by clearing the corresponding bit in the *Interrupt-Mask (IM)* field of the *Status* register, and all of the eight interrupts can be masked at once by clearing the *IE* bit of the *Status* register. #### **Processing** The common exception vector is used for this exception, and the *Int* code in the *Cause* register is set. The *IP* field of the *Cause* register indicates current interrupt requests. It is possible that more than one of the bits can be simultaneously set (or even *no* bits may be set) if the interrupt is asserted and then deasserted before this register is read. On Cold Reset, an R4400 processor can be configured with *IP[7]* either as a sixth external interrupt, or as an internal interrupt set when the *Count* register equals the *Compare* register. There is no such option on the R10000 processor; *IP[7]* is always an internal interrupt that is set when one of the following occurs: - the Count register is equal to the Compare register - · either one of the two performance counters overflows Software needs to poll each source to determine the cause of the interrupt (which could come from more than one source at a time). For instance, writing a value to the *Compare* register clears the timer interrupt but it may not clear *IP*[7] if one of the performance counters is simultaneously overflowing. Performance counter interrupts can be disabled individually without affecting the timer interrupt, but there is no way to disable the timer interrupt without disabling the performance counter interrupt. #### Servicing If the interrupt is caused by one of the two software-generated exceptions (described in Chapter 6, the section titled "Software Interrupts"), the interrupt condition is cleared by setting the corresponding *Cause* register bit, *IP*[1:0], to 0. Software interrupts are imprecise. Once the software interrupt is enabled, program execution may continue for several instructions before the exception is taken. Timer interrupts are cleared by writing to the *Compare* register. The Performance Counter interrupt is cleared by writing a 0 to bit 31, the overflow bit, of the counter. Cold Reset and Soft Reset exceptions clear all the outstanding external interrupt requests, *IP*[2] to *IP*[6]. If the interrupt is hardware-generated, the interrupt condition is cleared by correcting the condition causing the interrupt pin to be asserted. ## 16.4 MIPSIV Instructions The system must either be in Kernel or Supervisor mode, or have set the *XX* bit of the *Status* register to a 1 in order to use the MIPS IV instruction set. In User mode, if *XX* is a 0 and an attempt is made to execute MIPS IV instructions, an exception will be taken. The type of exception that will be taken depends upon the type of instruction whose execution was attempted; a list is given in Table 16-4. Note that operating with MIPS IV instructions does not require that MIPS III instruction set or 64-bit addressing is enabled. MIPS IV instructions that use or modify the floating-point registers (CP1 state) are also affected by the *CU1* bit of the CP0 Status register. If *CU1* is not set, a Coprocessor Unusable exception may be signaled. The Reserved Instruction (RI), Coprocessor Unusable (CU), and Unimplemented Operation (UO) exceptions for MIPS IV instructions are listed in the Table 16-4 below. Instructions MIPS4 **Exceptions** CU<sub>1</sub> RI CPU (undefined) RI MOVN,Z 0 RI 0 MOVT,F CU 0 1 RI **PREF** 0 CU COP1 (all instructions) 0 UO (undefined) 1 RI BC (cc>0) 1 0 UO C (cc>0) 1 0 UO MOVN,Z,T,F 1 0 UO RECIP, RSQRT 1 0 RI COP1X (all instructions) 0 CU 0 (all instructions) 1 RI 1 (undefined) Table 16-4 MIPS IV Instruction Exceptions ### 16.5 COP0 Instructions Execution of an RFE instruction causes a Reserved Instruction exception in the R10000 processor. The execution of undefined COP0 functions is undefined in the R10000 processor. #### 16.6 COP1 Instructions The R10000 and R4400 processors do not generate the same exceptions for undefined COP1 instructions. In the R4400 processor, undefined opcodes or formats in the *sub* field take an Unimplemented Operation exceptions. In the R10000 processor, undefined opcodes (bits 25:24 are 0 or 1) take Reserved Instruction exceptions and undefined formats (bits 25:24 are 2 or 3) take Unimplemented Operation exceptions. In MIPS II on an R4400 processor, the execution of DMTC1, DMFC1, and L format take Unimplemented Operation exceptions. In MIPS II on the R10000 processor, the execution of DMTC1 and DMFC1 take Reserved Instruction exceptions The attempted execution of the L format takes an Unimplemented Operation exception when the MIPS III mode is not enabled. A CTC1 instruction that sets both *Cause* and *Enable* bits also forces an immediate floating-point exception; the *EPC* register points to the offending CTC1 instruction. #### 16.7 COP2 Instructions If the *CU2* bit of the CP0 *Status* register is not set during an attempted execution of such Coprocessor 2 instructions as COP2, LWC2, SWC2, LDC2, and SDC2, the system takes a Coprocessor Unusable exception. In the R4400 processor, if the *CU2* bit is set, COP2 instructions are handled as NOPs; the operations of Coprocessor 2 load/store instructions are undefined. In the R10000 processor, an execution of a Coprocessor 2 instruction takes a Reserved Instruction exception when *CU2* bit is set. # [MEMO] # 17. Cache Test Mode The R10000 processor provides a cache test mode that may be used during manufacturing test and system debug to access the following internal RAM arrays: - data cache data array - data cache tag array - instruction cache data array - instruction cache tag array - secondary cache way predication table # 17.1 Interface Signals Cache test mode is accessed by using a subset of the system interface signals. By not requiring the use of any secondary cache interface signals, the internal RAM arrays may be accessed for single-chip LGA as well as R10000/secondary cache module configurations. The following system interface signals are used during cache test mode: - SysAD(57:0) - SysVal\* Any input signals not listed above are ignored by the processor when it is operating in cache test mode, and any output signals not listed above are undefined during cache test mode. # 17.2 System Interface Clock Divisor Cache test mode is supported for all system interface clock speeds. However, since cache test mode repeat rates and latencies are expressed in terms of **PClk** cycles, the external agent must take care when operating at any system interface clock divisor other than Divide-by-1. # 17.3 Entering Cache Test Mode In order for the processor to enter cache test mode, the external agent must begin a Poweron or Cold Reset sequence. Rather than negating **SysReset\*** at the end of the reset sequence, the external agent loads the mode bits into the processor by driving the mode bits (with the **CTM** signal asserted) on **SysAD(63:0)**, waits at least two **SysClk** cycles, and then asserts **SysGnt\*** for at least one **SysClk** cycle. After waiting at least another 100 ms, the external agent may issue the first cache test mode command. Figure 17-1 shows the cache test mode entry sequence. Figure 17-1 Cache Test Mode Entry Sequence # 17.4 Exit Sequence To leave cache test mode, the external agent does the following: - loads the mode bits into the processor by driving the mode bits (with the CTM mode bit negated) on SysAD(63:0) - waits at least two SysClk cycles - asserts SysGnt\* for at least one SysClk cycle After at least one **SysClk** cycle, the external agent may negate **SysReset\*** to end the reset sequence. Figure 17-2 shows the cache test mode exit sequence. Figure 17-2 Cache Test Mode Exit Sequence # 17.5 **SysAD(63:0) Encoding** Encoding of the **SysAD(63:0)** bus during cache test mode is shown in Table 17-1. "Unused" fields are read as "undefined," and must be written as zeroes. Table 17-1 Cache Test Mode SysAD(63:0) Encoding | SysAD Bit | Data<br>Cache Data<br>Array | Data<br>Cache Tag<br>Array | Instruction<br>Cache Data<br>Array | Instruction<br>Cache Tag<br>Array | Secondary<br>Cache Way<br>Predication<br>Array | |-----------|-----------------------------|----------------------------|------------------------------------|-----------------------------------|------------------------------------------------| | 0 | Data<br>Data parity | Tag parity | Data | Tag parity | MRU | | 1 | | SCWay | | Unused | — Unused | | 2 | | State parity | | State parity | | | 3 | | LRU | | LRU | | | 4 | | Unused | | Unused | | | 5 | | State | | | | | 6 | | | | State | | | 7 | | | | Unused | | | 31:8 | | Tag | | Tag | | | 35:32 | | | | | | | 36 | Unused | StateMod | Data parity | Unused | | | 38:37 | | | 17 1 | | | | 39 | | Unused | Unused | | | | 42:40 | 0 | 1 | 2 | 3 | 4 | | | Array select | | | | | | 43 | Write/Read select | | | | | | 44 | Auto-increment select | | | | | | 45 | Way | | | | | | 57:46 | Address | | | | | | 63:58 | Unused | | | | | ## 17.6 Cache Test Mode Protocol This section describes the cache test mode protocol in detail, including: - normal write protocol - auto-increment protocol - normal read protocol - · auto-increment read protocol #### **Normal Write Protocol** A cache test mode **normal write** operation writes a selected RAM array. The write address, way, array, and data are specified in the write command. The external agent issues a normal write command by: - driving the address on SysAD(57:46) - driving the way on **SysAD(45)** - negating the auto-increment select on **SysAD(44)** - asserting the Write/Read select on SysAD(43) - driving the array select on **SysAD(42:40)** - driving the write data on SysAD(39:0) - asserting SysVal\* for one SysClk cycle Normal writes have a repeat rate of 8 PClk cycles. Figure 17-3 depicts two cache test mode normal writes. Figure 17-3 Cache Test Mode Normal Write Protocol #### **Auto-Increment Write Protocol** A cache test mode **auto-increment write** operation writes a selected RAM array. The write address is obtained by incrementing the previous write address, and the write way is obtained from the previous write way. If an overflow occurs when incrementing the previous write address, the address wraps to 0, and the way is toggled. The write data is identical to the previous write data. For proper results, an auto-increment write must always be proceeded by a normal or auto-increment write. The external agent issues an auto-increment write command by: - asserting the auto-increment select on SysAD(44) - asserting the Write/Read select on **SysAD(43)** - driving the array select on **SysAD**(42:40) - asserting SysVal\* for one SysClk cycle Auto-increment writes have a repeat rate of one PClk cycle. Figure 17-4 depicts three cache test mode auto-increment writes. Figure 17-4 Cache Test Mode Auto-Increment Write Protocol #### **Normal Read Protocol** A cache test mode **normal read** operation reads a selected RAM array. The read address, way, and array are specified by the read command. The external agent issues a normal read command by: - driving the address on **SysAD**(57:46) - driving the way on **SysAD(45)** - negating the auto-increment select on **SysAD(44)** - negating the Write/Read select on SysAD(43) - driving the array select on **SysAD(42:40)** - asserting SysVal\* for one SysClk cycle. After a read latency of 15 PClk cycles, the processor provides the read response by: - entering *Master* state - driving the read data on SysAD(39:0) - asserting SysVal\* for one SysClk cycle. In the following **SysClk** cycle, the processor reverts to *Slave* state. Normal reads have a repeat rate of 17 PClk cycles. Figure 17-5 depicts two cache test mode normal reads. Figure 17-5 Cache Test Mode Normal Read Protocol #### **Auto-Increment Read Protocol** A cache test mode **auto-increment read** operation reads a selected RAM array. The read address is obtained by incrementing the previous access address, and the read way is obtained from the previous access way. If an overflow occurs when incrementing the previous access address, the address wraps to 0, and the way is toggled. The external agent issues an auto-increment read command by: - asserting the auto-increment select on **SysAD(44)** - negating the Write/Read select on SysAD(43) - driving the array select on **SysAD(42:40)** - asserting SysVal\* for one SysClk cycle. After a read latency of 15 PClk cycles, the processor provides the read response by: - entering *Master* state - driving the read data on SysAD(39:0) - asserting SysVal\* for one SysClk cycle. In the following SysClk cycle, the processor reverts to Slave state. Auto-increment reads have a repeat rate of 17 PClk cycles. Figure 17-6 depicts two cache test mode auto-increment reads. Figure 17-6 Cache Test Mode Auto-Increment Read Protocol # [MEMO] # Appendix A Glossary The following terms are defined in this Glossary: - superscalar processor - pipeline - pipeline latency - pipeline repeat rate - out-of-order execution - dynamic scheduling - instruction fetch, decode, issue, execution, completion, and graduation - active list - free list and busy registers - · register renaming and unnaming - · nonblocking loads and stores - speculative branching - logical and physical registers - register files - ANDES architecture # A.1 Superscalar Processor A superscalar processor is one that can fetch, execute and complete more than one instruction in parallel. By implication, a superscalar processor has more than one pipeline (see below). # A.2 Pipeline In the processor **pipeline**, the execution of each instruction is divided into a sequence of simpler suboperations. Each suboperation is performed by a separate hardware section called a **stage**, and each stage passes its result to a succeeding stage. Normally, each instruction only remains in each stage for a single cycle, and each stage begins executing a new instruction as previous instructions are being completed in later stages. Thus, a new instruction can often begin during every cycle. Pipelines greatly improve the rate at which instructions can be executed, as long as there are no dependencies. The efficient use of a pipeline requires that several instructions be executed in parallel, however the result of any instruction is not available for several cycles after that instruction has entered the pipeline. Thus, new instructions must not depend on the results of instructions which are still in the pipeline. # A.3 Pipeline Latency The **latency** of an execution pipeline is the number of cycles between the time an instruction is issued and the time a dependent instruction (which uses its result as an operand) can be issued. In the R10000 processor, most integer instructions have a single-cycle latency, load instructions have a 2-cycle latency for cache hits, and floating-point addition and multiplication have a 2-cycle latency. Integer multiply, floating-point square-root, and all divide instructions are computed iteratively and have longer latencies. # A.4 Pipeline Repeat Rate The **repeat rate** of the pipeline is the number of cycles that occur between the issuance of one instruction and the issuance of the next instruction to the same execution unit. In the R10000 processor, the main five pipelines all have repeat rates of one cycle, but the iterative units have longer repeat delays. #### A.5 Out-of-Order Execution The "program order" of instructions is the sequence in which they are fetched and decoded. In the R10000 processor, instructions may be issued, executed, and completed **out of program order**. They are always graduated in program order. ## A.6 Dynamic Scheduling The R10000 processor can issue instructions to functional units out of program order; this capability is known as **dynamic scheduling** or **dynamic issuing**. The R10000 processor can dynamically issue an instruction as soon as all its operands are available and the required execution unit is not busy. Thus, an instruction is not delayed by a stalled previous instruction unless it needs the results of that previous instruction. ## A.7 Instruction Fetch, Decode, Issue, Execution, Completion, and Graduation In general, instructions are *fetched*, *decoded*, and *graduated* in their original program order, but may be *issued*, *executed*, and *completed* out of program order, as shown in Figure A-1. - Instruction fetching is the process of reading instructions from the instruction cache. - Instruction decode includes register renaming and initial dependency checks. For branch instructions, the branch path is predicted and the target address is computed. - An instruction is **issued** when it is handed over to a functional unit for execution. - An instruction is **complete** when its result has been computed and stored in a temporary physical register. - An instruction graduates when this temporary result is committed as the new state of the processor. An instruction can graduate only after it and all previous instructions have been successfully completed. Figure A-1 Dynamic Scheduling #### A.8 Active List The R10000 processor's **active list** is a program-order list of decoded instructions. For each instruction, the active list indicates the physical register which contained the *previous* value of the destination register (if any). If this instruction graduates, that previous value is discarded and the physical register is returned to the free list. The active list records status, such as those instructions that have completed, or those instructions that have detected exceptions. Instructions are appended to the bottom of the list as they are decoded and instructions are removed from the top as they graduate. $\star$ <R12000> Active List entries are increased to 48: The active list has been enlarged so that it now contains 48 entries. Active list accepts conservatively The read pointer for the active list is now evaluated on four-instruction blocks at a time. This has two effects: - a) There may be up to 11 empty slots in the active list and yet it will report to the decode unit that it cannot accept any new instructions. However this level of blockage only lasts for a single cycle. At most three empty slots will remain empty for more than one cycle. The time at which instructions are removed from the active list has also been changed. Integer and load/store instructions now remain in the list for one cycle after they graduate. This will be compensated for by the increased size of the active list. - b) The graduation of some instructions will be delayed, as the read pointer will not advance past the end of a four-instruction block during a cycle. Thus less than the maximum number of instructions might be graduated because the read pointer can get to them that cycle. ## A.9 Free List and Busy Registers A **busy-bit table** indicates whether or not a result has been written into each of the physical registers. Each register is initially defined to be busy when it is moved from the **free list** to the active list; the register becomes available ("not busy") when its instruction completes and its result is stored in the register file. The busy-bit table is read for each operand while an instruction is decoded, and these bits are written into the queue with the instruction. If an operand is busy, the instruction must wait in the queue until the operand is "not busy." The queues determine when an operand is ready by comparing the register number of the result coming out of each execution unit with the register number of each operand of the instructions waiting in the queue. With a few exceptions, the integer and address queues have integer operand registers, and the floating-point queue has floating-point operand registers. # A.10 Register Renaming As it executes instructions, the processor generates a myriad of *temporary* register results. These temporary values are stored in register files together with *permanent* values. The temporary values become new permanent values when their corresponding instructions graduate. **Register renaming** is used to resolve data dependencies during the dynamic execution of instructions. To ensure each instruction is given correct operand values, the logical register numbers (**names**) used in the instruction are mapped to physical registers. Each time a new value is put in a logical register, it is assigned to a new physical register. Thus, each physical register has only a single value. Dependencies are determined using these physical register numbers. An example of register renaming is shown below. The following Doubleword Shift Left Logical instruction, has one register operand (r2) plus a 5-bit shift count of value two stored in the sa field; the value in r2 is shifted left by two and this value is stored in r3. The physical execution of the instruction above, with register renaming, is given below: Physical execution Rename operation $$p3\leftarrow p2$$ shift left 2 $r3 = p3$ When the DSLL instruction is executed, the *l*ogical destination register r3 is assigned a new physical register, p3, from the free list. Register renaming also allows exceptions to be handled in a precise manner. *Out-of-order execution* means that an instruction can change its result register even before all prior instructions have been completed. However, if any of the prior instructions cause an exception, the original register value must be restored. Since each new register value is loaded into a new physical register (physical register values are not overwritten until the physical register is placed in the free list), previous values remain unchanged in the original physical registers and these previous values can be restored.<sup>†</sup> An instruction can be aborted up until the time it graduates, and all register and memory values can be restored to a precise state following any exception. This state is restored by *unnaming* the temporary physical registers assigned to subsequent instructions. Registers are **unnamed** by writing the old destination register into the mapping table and returning the new destination register to the free list. Unnaming is done in reverse program order, in case a logical register was used more than once. After renaming, the register files contain only the permanent values which were created by instructions prior to the exception. Once an instruction has graduated, all previous values are lost. #### A.11 Nonblocking Loads and Stores Loads and stores are **nonblocking**; that is, cache misses do not stall the processor. All other parts of the processor may continue to work on non-dependent instructions while as many as four cache misses are being processed. <sup>†</sup> This same technique is used to reverse mispredicted speculative branches. #### **A.12 Speculative Branching** Normally, about one of every six instructions is a branch. Since four instructions are fetched each cycle, the R10000 processor encounters, on average, a branch instruction every other cycle, as shown in Figure A-2. Figure A-2 Speculative Branching When a branch instruction was encountered in previous processors, the instruction fetch and instruction issue halted until it was determined whether or not to take the branch. For instance, a branch delay slot was designed into the MIPS architecture to handle the intrinsic delay of a branch and to keep the pipeline filled. Since the processor fetches up to four instructions each clock cycle, there is not enough time to resolve branches without stalling the fetch/decode circuitry. The processor therefore **predicts** the outcome of every branch and **speculatively executes** the branch based on this branch prediction. The branch prediction circuit consists of a 512-entry RAM, using a 2-bit prediction scheme: two bits are assigned to a branch instruction, and indicate whether or not the branch was taken the last time it occurred. The four possible prediction states are: strongly taken, weakly taken, weakly not taken, strongly not taken. If the branch was taken the last two times, there is a good probability it will be taken this time too — or the inverse. † The R10000 processor can speculate up to four branches deep. Shadow copies of the mapping tables are kept every time a prediction is made, allowing the R10000 processor to recover from a mispredicted branch in a single cycle. <sup>†</sup> Simulations have shown the R10000 branch prediction algorithm to be over 90% accurate. $\star$ <R12000> Use of global history in branch-prediction: The history register is 8 bits wide, and implements the 'gshare' predictor (reference to paper that defines will be provided later). The history register is updated speculatively, with a one cycle delay after a prediction before the results are available for use in forming another prediction index. As mentioned earlier, some programs with small "working set of conditional branches" benefit significantly from the use of such hashing; however, a slightly variable number of previously-executed branches may be omitted from the predictions made for any given branch. This will reduce prediction accuracy somewhat. Global history register is enabled via bits 26:23 of the Diag Register (CP0 register 22). If bit 26 is set, branch prediction uses all eight bits of the global history register. If bit 26 is not set, then bits 25:23 specify a count of the number of bits of global history register to be used. Increase in branch prediction table size: The table size is increased to 2048 2-bit entries. #### A.13 Logical and Physical Registers Register renaming (described above) distinguishes between **logical registers**, which are referenced within instruction fields, and **physical registers**, which are actually located in the hardware register file. The programmer is only aware of logical registers; the implementation of physical registers is entirely transparent. Logical register numbers are dynamically mapped onto physical register numbers. This mapping uses *mapping tables* which are updated after each instruction is decoded; each new result is written into a new physical register. This value is temporary and the previous contents of each logical register can be restored if its instruction must be aborted following an exception or a mispredicted branch. Register renaming simplifies dependency checks. Logical register numbers can be ambiguous when instructions are executed out of order, since a succession of different values may be assigned to the same register. But physical register numbers uniquely identify each result, making dependency checking unambiguous. The queues and execution units use physical register numbers. Integer and floating-point registers are implemented with separate renaming hardware and multi-port register files. # A.14 Register Files The R10000 processor has two 64-bit-wide register files to store integer and floating-point values. Each file contains 64 registers. The integer register file has seven read and three write ports; the floating-point register file has five read and three write ports. The integer and floating-point pipelines each use two dedicated operand ports and one dedicated result port in the appropriate register file. The Load/Store unit uses two dedicated integer operand ports for address calculation. It must also load or store either integer or floating-point values, sharing a result port and a read port in both register files. These shared ports are also used to move data between the integer and floating-point register files, to store branch and link return addresses, and to read the target address for branch register instructions. #### **A.15 ANDES Architecture** The R10000 processor uses the MIPS ANDES architecture, or *Architecture with Non-sequential Dynamic Execution Scheduling*. # Appendix B Differences between R10000 and R12000 The following terms are defined in this Appendix: - Mode bits changed in R12000 - DSD (Delay Speculative Dirty) - Config Register [22] - Config Register [23] - Changes in the Branch Diag Register - Eliminate traps for Denorm/NaN FP inputs - Increase in pre-decode buffering - Increased penalty for indirect branches - Addition of a Branch Target Address Cache - Use of global history in branch-prediction - Increase in branch prediction table size - Address calculation for load/store instructions uses integer queue - Load/store dependency is speculatively ignored - DCache set locking relaxed - SC refill blocking reduced - Increased the Way Prediction Table (MRU table) to 16K single-bit entries - Additional cycles for System Interface transactions - FP and Integer-Queue Issue Policy - Active List entires are increased to 48 - Cache Error inhibits graduation - Changed Spare (1, 3) pins to NC (No Connection) - CacheOp Index Write Back Invalidate (D) also clears Primary Tag - Summary of the differences # **B.1** Mode bits changed in R12000 Table B-1 Mode bits 12:9 SysClkDiv | CODE | Divider | SysClk (for PClk = 300 MHz) | |------|---------|--------------------------------------------------| | 0000 | - | Reserved | | 0001 | 1 | NOT AVAILABLE, defaults to 150 MHz (was 300 MHz) | | 0010 | 1.5 | NOT AVAILABLE, defaults to 150 MHz (was 200 MHz) | | 0011 | 2 | 150 MHz | | 0100 | 2.5 | 120 MHz | | 0101 | 3 | 100 MHz | | 0110 | 3.5 | 85.70 MHz | | 0111 | 4 | 75.00 MHz | | 1000 | 4.5 | 66.00 MHz | | 1001 | 5 | 60.00 MHz | | 1010 | 5.5 | 54.55 MHz | | 1011 | 6 | 50.00 MHz | | 1100 | 7 | 42.85 MHz | | 1101 | 8 | 37.50 MHz | | 1110 | 9 | 33.33 MHz | | 1111 | 10 | 30.00 MHz | Table B-2 Mode bits 21:19 SCClkDiv | CODE | Divider | (PCLK) SCCIk | |------|---------|------------------------------------| | 000 | - | Reserved | | 001 | 1 | NOT AVAILABLE, defaults to 200MHz | | 010 | 1.5 | 200MHz | | 011 | 2 | 150 MHz | | 100 | 2.5 | 120 MHz | | 101 | 3 | 100 MHz | | 110 | - | Reserved | | 111 | 4 | 75 MHz (added for testing silicon) | Table B-3 Mode bits 24:22 | CODE | Name | Comments | |------|------|---------------------------------------------------------------| | 000 | - | Reserved | | 001 | - | Reserved | | 010 | - | Reserved | | 011 | - | Reserved | | 100 | DSD | Delay Speculative Dirty - fix for speculative store (see B.2) | | 101 | - | Reserved | | 110 | - | Reserved | | 111 | - | Reserved | #### **B.2 DSD (Delay Speculative Dirty)** The Boot Mode bit 24 corresponds to the Config register[24] bit and this controls DSD during kernel and supervisor modes. However, the DSD mode can also be enabled in the user mode by setting the Status register[24] bit. Config register[24] is read-only and can be set only at boot time. If the DSD mode is set - - a) R12000 will not set the Dirty bit for a secondary cache block until the store instruction is the oldest in the Active List and is about to be executed. (An interrupt could cause a case where the dirty bit is set (store is no longer speculative), but the store does not immediately graduate. We believe this case should not cause any problem. This mode does prevent speculative stores from setting the dirty bit.) - b) This mode will have slightly lower performance due to the delay in the setting of the Dirty bit. This delay will occur just once per block refill from main memory, when it is necessary to set the dirty bit. Setting the bit requires about ten cycles; but usually the processor will continue to overlap execution of other instructions. Once a block becomes dirty in secondary cache, this mode has no performance effect. - c) In this mode, a miss in secondary cache, due to a store instruction which is not already the oldest in the pipeline, will cause a refill to the "clean exclusive" state. A hit to a shared line will immediately cause an upgrade to "clean exclusive". Thus, bus operations (which are relatively slow) will still begin speculatively. Independent of the DSD mode, R12000 will delay a "cached, non-coherent" load until it is the oldest instruction. This change is implemented because a speculative load accessing an unmapped "xkphys" address as "cached, non-coherent" might bring data into the secondary cache without the proper coherency checks. R12000 is doing no changes to prevent it from speculatively refilling cache lines in shared or clean states except the "xkphys" case described above. #### **B.3** Config Register[22] Bit 22 of the Config register is 'SC Data and Tag Corrector disable'. This bit turns off use of ECC to correct errors in the SC data and tags. #### **B.4** Config Register[23] When Bit [23] of the Config register is set, the response that R12000 produces to an external intervention (shared or exclusive) which hits on a CleanExclusive line is changed. As before, the state of the line in the cache is changed, and the former state of the line is sent out on SysState[1:0]. Moreover, when Bit[23] of Config is set, a processor coherency data response is sent with the state response. In other words, when this bit is set, external interventions which hit CleanExclusive or DirtyExclusive lines in the Secondary Cache result in a processor coherency data response. If *Bit*[23] is not set, then a data response is generated only for external interventions that hit a DirtyExclusive line (same behavior as that of R10000). #### **B.5** Changes in the Branch Diag Register In R12000 two fields are added to the "Diag Register" - CPO Register 22. One field is "ghistory enable", bits 26:23. The other is "BTAC disable", bit 27. The definitions are: #### Ghistory enable: - If bit 26 is set, branch prediction uses all eight bits of the global history register. - If bit 26 is not set, then bits 25:23 specify a count of the number of bits of global history to be used. Thus if bits 26:23 are all zero, global history is disabled. The global history contains a record of the taken/not-taken status of recently executed branches, and when used is XOR'ed with the PC of a branch being predicted to produce a hashed value for indexing the BPT. Some programs with small "working set of conditional branches" benefit significantly from the use of such hashing, some see slight performance degradation. #### • BTAC disable: If bit 27 is set, the use of the Branch Target Address Cache (BTAC) is disabled. The BTAC is used to reduce the instruction fetch penalty of taken branches by providing the target address of fixed-address branch and jump instructions. #### **B.6 Eliminate traps for Denorm/NaN FP inputs** The R10000 currently takes Unimplemented Exception when an FPU gets a NaN or Denorm as an input. R12000 suppresses these traps whenever the FS bit is set in the FCSR. R12000 simply passes through NaN's and Denorm's when the bit is set. This change in no way affects the handling of QNaNs and Denorms when they are produced, it only changes the way they are handled when they are received as input operands. Case of Denorm when the FS bit is set to 1: A Denorm received as an input to the FP unit is flushed to zero before the FP unit begins to process the operand. The behavior of the unit (when FS is 1) will be exactly that seen when the input is zero. Specifically, if the zero input would itself cause a trap (due to divide by zero, for example) then the that zero-generated trap will be taken. When a Denorm is seen at the input, the *Inexact bit* is set, except in the cases described below: The *Inexact bit* will not be set, even if FS=1 and a Denorm is seen on input, if the other input to the FP operation is a value which pre-determines the FP result (e.g. QNaN). When the result is not affected by the presence or absence of the Denorm input, the result is EXACT. Hence the Inexact bit should not be set, even if *Flush to Zero* mode is ON. Case of QNaNs when the FS bit is set to 1: A QNaN received as an input operand for an FP unit will cause the unit to produce the standard QNaN (which is not necessarily same as the input QNaN). Note that FP units will not propagate the QNaN to the output, but will always produce the same, standard, QNaN. When the FS bit is set to zero, the behavior will be exactly as in R10000. When Denorms or QNaNs are produced by an FP operation, the behavior will be exactly as in R10000, regardless of the FS bit setting. Handling of signalling NaNs will be unaffected by this change. Only the handling of input quiet NaNs and Denorms will be affected. Arithmetic instructions (like *add/sub/madd/cvt/div/sqrt/recip/rsqrt*) will follow the above behavior in all respects. There are some instructions that deserve special mention: - Mov, conditional mov will not be affected by this mode, i.e. no exceptions based on QNaNs and Denorms. Denorms and QNaNs will be moved without generating an exception, regardless of the FS state. This behavior is unchanged from that of R10000. - When FS=1, the *Abs*, *Neg* and *Compare* instructions will flush Denorm inputs to zero just as the arithmetic operations do. This is different from the behavior of the R8000, R4400 and R10000. In all cases where flushing the Denorm to zero made a difference in the result, the and inexact trap will be taken or the *Inexact bit* will be set. Compatibility with R4400 and R10000 can be achieved by setting FS=0. - The behavior of FP to INT conversion instructions will change in that when FS=1, an input Denorm will be flushed to zero and the Inexact bit will be set. With other inputs, FP to INT conversion will not be affected by the FS mode bit. Previously, R10000 took an *unimplemented exception* whenever a conversion from an FP value would result in a value that cannot be represented in the target format. This will continue to be the case, with the noted exception of Denorm inputs. - FP to FP convert instructions will be affected in the same way as arithmetic operations. That is, cvt FP to FP will not take exceptions on qNaN or Denorm inputs, if and only if FS=1. The above changes in R12000 will allow the compilers and applications can do more aggressive optimizations during loop unrolling like if-conversion, speculative load execution and speculative code motion by making use of this feature. The change is gated by the FS bit so that strict IEEE-compliance is possible, as before, by setting the FS bit to zero. #### **B.7** Increase in pre-decode buffering Up to 12 instruction may be buffered before being decoded. This should normally be invisible to the end user, but can be important when debugging systems in uncached-mode, since fetch and decode are now further de-coupled. #### **B.8** Increased penalty for indirect branches Indirect branches, which were already an expensive operation, have become even more so. Instruction fetch now stalls for a minimum of 5 cycles, rather than the 4 for the R10000. This additional cycle of delay is seen by both *jr* and *jalr* instructions. # B.9 Addition of a Branch Target Address Cache This 32-entry two-way set-associative cache holds the target addresses of previously-taken branches. When a branch is executed a hit in the *BTAC* eliminates the one-cycle fetch bubble with the R10000 experiences for every taken branch. However, if a branch which hits in the *BTAC* is actually predicted not-taken, then a one cycle fetch bubble is introduced where none was present before. Performance simulations indicate that the *BTAC* is a net win, but because of its "mixed-blessing" nature, a mechanism has been provided to disable it via software. (See description of changes to diag register) #### **B.10** Use of global history in branch-prediction The *history register* is 8 bits wide, and implements the 'gshare' predictor (reference to paper that defines will be provided later). The *history register* is updated speculatively, with a one cycle delay after a prediction before the results are available for use in forming another prediction index. As mentioned earlier, some programs with small "working set of conditional branches" benefit significantly from the use of such hashing; however, a slightly variable number of previously-executed branches may be omitted from the predictions made for any given branch. This will reduce prediction accuracy somewhat. *Global history register* is enabled via bits 26:23 of the *Diag Register (CPO register 22)*. If bit 26 is set, branch prediction uses all eight bits of the *global history register*. If bit 26 is not set, then bits 25:23 specify a count of the number of bits of *global history register* to be used. # **B.11** Increase in branch prediction table size The table size is increased to 2048 2-bit entries. #### B.12 Address calculation for load/store instructions uses integer queue When load, store, cacheop, or prefetch instructions are decoded, they are sent to both the AQ and IQ units. The IQ treats the address-calculate unit as a third "ALU" and issues instructions to it. When an instruction completes address calculation, the results are forwarded to the AQ. Unlike previously, if an address instruction must be retried for any reason, address calculation is not redone. If an the address queue is full, but the integer queue has free entries at the time a load/store instruction is decoded, the load/store is sent only to the integer queue. When the address queue has an available entry the calculated address is forwarded to that entry and the remainder of the load/store execution continues. # **B.13** Load/store dependency is speculatively ignored When a load follows a store in program-order, and the address of the load is known to the Address Queue (AQ) before the address of the store, then the AQ may speculatively issue the load to tag-check and data access. When the address of the store is determined, the AQ can undo the effects of the load through the use of the "soft-exception" mechanism. Since almost all loads which are actually dependent on previous stores use the same registers to form their addresses, normally either the two instructions are independent, or their addresses are resolved in program order, so the soft-exception should occur rarely. #### **B.14** DCache set locking relaxed In R10000, when an AQ entry accesses a Dcache line, that line is locked into the cache until the entry graduates, so that the entry will not be removed from the cache until the access completes. If another entry which needs to access exactly the same line arrives in the AQ before the first completes, the two may share the lock. In this way, a line is locked in the cache until all access to it complete. In order to prevent a deadlock from arising, whenever a cache line is locked in this way, only the oldest AQ entry can obtain a lock on the other "way" of the same cache set, thus ensuring that forward progress can be made. This algorithm can cause problems, because often the oldest entry in the AQ is the one which already owns the lock on the first way - thus ensuring that no other entries can access the second way of the cache for that set index. For some algorithms, most notably FFT's, this can cause severe performance degradation. R12000 allows an entry to obtain the lock on the second way of a set if it is the oldest entry which does not already own a lock. Thus, any entries which have already acquired a lock, including those locking the first way, will not prevent another, younger, entry from accessing that second way. #### **B.15** SC refill blocking reduced In R10000, during the time that an SCache line is being refilled from system interface via the "incoming buffer (IB)", no other accesses to the SCache are allowed. If the external interface sees an ACK to a line that is being refilled before the last words of the SCache line are received by R10000, this means that several cycles can elapse during which SCache access is blocked. By breaking the SCache refill transaction into 64-byte blocks, and allowing other requests to proceed during breaks between the blocks, this effect could be reduced. R12000 pulls in SCache lines with two "pause points." This first occurs when R12000 receives the ACK for a request. If the first two quad-words are already valid in the Incoming Buffer at that time, then R12000 will proceed to refill the SCache with those two, and forward the results to the DCache or ICache at the same time as normal. The next two quad-words will be refilled as they return, thus continuing to block any other access to the SCache just as today. If however, when the initial ACK is received, the first two are not valid (i.e., either 0 or 1 quad-words are valid at that time) then R12000 will "pause" the SCache refill and wait for both of them to be brought in to the IB. Once the first half is filled in to the SCache, R12000 will again check the IB to see if an additional 3 quad-words are valid (thus 7 out of the 8 quad-words in the SCache line should have arrived into the IB). Until that is the case, R12000 will again "pause" the SCache refill and allow other accesses to reach the SCache. These two pauses allow for other requests to slip in during an SCache refill. Using only two pauses both simplifies the logic and reduces bus turnarounds. # B.16 Increased the Way Prediction Table (MRU table) to 16K single-bit entries The size of the table has been increased to 16K entries, so that 4MB caches with 128B lines or 2MB caches with 64B lines can be fully mapped. # **B.17** Additional cycles for System Interface transactions All transactions which go through the system interface unit (in particular, SCache refills and writebacks) have one additional CPU-clock of latency added to them. #### **B.18 FP and Integer-Queue Issue Policy** The integer and floating-point queues are altered so that they are now composed of two 8-entry banks. Instructions are issued into the two banks in an alternating fashion. Each bank independently nominates instructions for the functional units. For each FU, the banks nominate the oldest instruction they contain which is ready to execute. If both banks nominate an instruction for a given FU, a winner is chosen by a priority bit which alternates between the two banks on each cycle. #### **B.19** Active List entries are increased to 48 The active list has been enlarged so that it now contains 48 entries. #### **B.20** Cache Error inhibits graduation When a cache error is detected, all instruction graduation is inhibited on the following cycle. Since cache errors are rare, and an exception will occur soon afterwards, this should have minimal impact on performance. #### **B.21** Changed Spare(1, 3) pins to NC (No Connection) The spare(1, 3), shown in the User Manual, Rev 2.0, page 43 tied to Vss through a 100 ohm resister, is used in R12000 for diagnostic purpose and thus for R12000 should not be connected to anything. #### B.22 CacheOp Index Write Back Invalidate(D) also clears Primary Tag As a result of the CacheOp $Index\ Write\ Back\ Invalidate(D)$ instruction, the Primary Tag is also cleared (set to zero) in addition to setting the cache state bits to zeros or (invalid) as described in $V_R5000$ , $V_R10000\ User$ 's Manual INSTRUCTION. #### **B.23** Summary of the differences - Higher operation frequency. - Core operating voltage for R12000 2.6V. - Max case temperature for R12000 70°C. - Less Power consumption. - Increased options for PClk to SysClk and PClk to SCClk ratios. - Added boot-time mode bits to allow processor upgrade without change in system interface and secondary cache interface frequency. - Added a mode in which the side effects of "Speculative Load/ Stores" are avoided. - Speculative load/stores could cause problems in a system with non-coherent I/O. In this mode prevents the behavior that causes the side-effects with some trade-off in performance. This mode is optional and can be selected during boot-time. - Added option to disable "SCData and Tag Corrector". - Processor provides data-response even if the external intervention hits a Clean Exclusive line - (i.e. processor is the owner for both CEx and DEx lines). - Added an optional Branch Target Address Cache to reduce instruction fetch penalty. - Since there are trade-offs, this feature can be disabled. - Added an optional "Global History Table" to improve branch prediction. - Since not all the program benefit from this feature; so the feature can be disabled. - Added an option to eliminate traps for Denorm/NAN FP inputs - This allows the compilers and applications to do more aggressive optimization. The change is optional if IEEE compliance is needed. - Quadrupled the branch prediction table size. - Doubled the MRU table for SCache way prediction to improve SCache hit rate. - Improved performance monitoring system. - Detailed in a seperate document - Increased Active list to 48 entries to improve performance. - Changed the Spare(1,3) pins to NC (No Connection). - Other miscellaneous changes to improve performance and simplify logic. ### [MEMO] # Appendix C Index | Numerics | access privileges, address space 314 | |----------------------------------------|---------------------------------------------| | 16 | ACK completion response 152 | | 16-word, cache refill read sequence 93 | ACK, signal 112 | | write sequence 98 | active list, definition of 359 | | write sequence 76 | add unit, FPU 291 | | 32-bit | | | address space 305 | address 1 205 | | mode, TLB entry format 317 | encodings, mode 305<br>Kernel mode 310 | | 32-word, cache refill | | | read sequence 93 | mapping Kernel mode 310 | | write sequence 98 | Supervisor mode 308 | | 3-state, signal 63 | User mode 306 | | - | mode 305 | | 4-word, cache refill | page 316 | | read sequence 91<br>write sequence 96 | queue 26, 33 | | • | instruction graduation 33 | | 599CLGA, see CLGA | issue ports 33 | | 64-bit | number of entries 33 | | address space 305 | number of instructions written per cycle 33 | | mode, TLB entry format 317 | organized as FIFO 33 | | 8-word, cache refill | sequencing 33 | | read sequence 92 | space | | write sequence 97 | access privileges 314 | | <b>A</b> | kernel 305 | | A | supervisor 305 | | AC electrical specifications 222 | user 305 | | asynchronous inputs 223 | virtual 305 | | delay time 223 | Supervisor mode 308 | | hold time 223 | translation 318 | | maximum operating conditions 222 | User mode 306 | | setup time 223 | Address Error exception 328 | | test specification 222 | Address Space Identifier, see also ASID 318 | | timing | address/data bus signals 61 | | secondary cache 222 | AdEL, indication 328 | | System interface 222 | AdES, indication 328 | | | Aues, mulcation 320 | | algorithms | size | |-----------------------------------------------------------------------------------|------------------------------------------------------------| | cache, five types of 74, 78 | primary data cache 68 | | aliasing, virtual 89 | primary instruction cache 66 | | allocate request number requests, external 156 | secondary cache 71 | | ALU (arithmetic logic unit) | block data transfers 116 | | No. 1 40 | external block data responses 116 | | No. 2 40 | processor block write requests 116 | | ALU1 29, 32 | processor coherency data responses 116 | | ALU2 29, 32 | boundary scan register, JTAG 214 | | ANDES, Architecture with Non-sequential Dynamic Execution Scheduling 24, 364 | BPIdx, field 268 BPMode, field 267 | | arbitration protocol, System interface 130 | BPOp, field 268 | | arbitration rules, System interface 131 | BPState, field 268 | | arbitration signals 61 | branch | | arbitration, cluster bus 104 | determining next address 39 | | Architecture with Non-sequential Dynamic Execution Scheduling, see also ANDES 364 | instruction, limits on execution 39 prediction 36, 51, 362 | | arithmetic instructions, FPU 300 | prediction rates, improving 43 | | arithmetic logic unit, see also ALU 40 | speculative 362 | | array 84 | unit 30, 39<br>BRCH, field 267 | | array, page table entry (PTE) 247 | BRCV, field 267 | | ASID (Address Space Identifier) | BRCW, field 267 | | context switch 318 | Breakpoint exception 338 | | relationship to Global (G) bit in TLB entry 318 | BSIdx, field 267 | | ASID (Address Space Indentifier) | | | stored in EntryHi register 318 | buffer cached request 111 | | ASID, field 251 | cluster request 111 | | asynchronous inputs, AC electrical specification 223 | incoming 111, 112 | | auto-increment read, cache test mode 355 | outgoing 111, 113 | | auto-increment write, cache test mode 353 | uncached 111, 114 | | D. | bus | | В | SysAD 124 | | Bad Virtual Address register (BadVAddr) 250 | SysCmd 117 | | BadVAddr register 247, 265, 328 | SysResp 127 | | BadVPN2, field 247, 265 | SysState 126 | | BD, (branch delay) bit 258, 260 | Bus Error exception 334 | | BE, (memory endianness) bit 262 | busy-bit table 360 | | BEV, (boot exception vector) bit 256 | bypass register, JTAG 213 | | BEV, bit 195, 320 | C | | block | | | instruction cache 29 | C, (coherency attribute) bit 245 | | primary data cache 29 | cache 24 | | secondary cache 31 | algorithms 74 and processor requests 78 | | | cacheable coherent exclusive on write, description of 75 | | | cacheable coherent exclusive, description of 75 | | cacheable noncoherent, description of 75 | ECC 31 | |------------------------------------------|---------------------------------------------------------------| | fields, encoding of 74 | error handling 200 | | for kseg0 address space 74 | index and tag 71 | | for mapped address space 74 | indexing 84 | | for xkphys address space 74 | indexing the data array 84 | | uncached accelerated, description of 76 | indexing the tag array 85 | | uncached, description of 75 | interface frequencies 83 | | where specified 74 | sizes 31 | | associativity 65 | specifying block size 82 | | block ownership 79 | specifying cache size 82 | | misses 47 | state diagram 72 | | nonblocking 45, 47 | states 71 | | ordering constraints 37 | tag 88 | | pages 316 | tag and data array ECC $82$ | | primary 24 | tag array 82 | | primary data 29 | way prediction 86 | | block size 68 | way prediction table 85 | | changing states 69 | write back protocol 71 | | description of 68 | strong ordering | | diagram, state 70 | example of 38 | | error handling 199 | structure, two-level 65 | | index and tag 69 | Cache Error exception 195, 333 | | interleaving 52 | precision 195 | | refill 51 | prioritization 195 | | state diagram 70 | Cache Error handler 195 | | states 69 | CACHE instruction | | subset of secondary cache 69 | support for I/O 174 | | write back protocol 68 | CACHE instructions 196 | | primary instruction 29 | _ | | block size 66 | cache miss stalls 47 | | description of 66 | cache test mode | | diagram, state 67 | entry 349 | | error handling 198 | exit 350 | | error protection 198 | cacheable coherent exclusive on write, cache algorithm 74, 75 | | index and tag 66 | cacheable coherent exclusive, cache algorithm 74, 75 | | refill 51 | cacheable noncoherent, cache algorithm 74, 75 | | state diagram 67 | cached request buffer 111 | | states 66 | CacheErr register 195, 196, 198, 199, 280 | | rules, ownership of a cache block 79 | | | secondary 24 | capacitors, decoupling 225 | | associativity 31, 71 | cause bits, FPU 300 | | block size 71 | Cause register 127, 128, 250, 258, 260 | | block state 89 | Cause, field (FP) 300 | | blocks 31 | CE, bit 255, 256, 258 | | changing states 72 | CH, bit 256 | | clock domain 179 | chip revisions, R10000 261 | | data array 82 | | | data array width 84 | ckseg0 space 314 | | description of 71 | ckseg1 space 314 | | diagram, state 72 | | | 2 214 | CI | |---------------------------------------------|-------------------------------------------------| | ckseg3 space 314 | conflicts | | cksseg space 314 | coherency 165<br>internal 165 | | CLGA (ceramic land grid array) 228 | TLB, avoiding 318 | | electrical characteristics 229 | Context register 247, 265 | | layout 228 | context register 247, 203 | | mechanical characteristics 228 package 228 | | | pinout 232 | control registers, FPU 298 | | thermal characteristics 230 | controller, TAP 212 | | clock | coordinator, cluster 103 | | domain | COP1 instructions 345 | | in secondary cache 179 | COP2 instructions 345 | | internal processor clock domain 177 | Coprocessor 0, see also CP0 241 | | secondary cache clock domain 177 | Coprocessor 1 see also CP1, COP1 257 | | System interface clock domain 177 | Coprocessor 2 see also CP2, COP2 257 | | signal | Coprocessor 3 see also CP3, COP3 257 | | PClk 178 | Coprocessor Unusable exception 340 | | SCCIk 179 | correctable error 192 | | SysClk 177 | | | SysCikret 178 | Count register 128, 250 | | signals, overview of 61 | CP0 (coprocessor 0) 241 instructions 345 | | clock divisor, system interface 102, 348 | registers, list of 242 | | cluster bus 56, 104 | | | operation 170 | csseg space 309 | | cluster coordinator 103, 104 | CT, bit 262 | | cluster request buffer 111 | CTM, mode bit 188, 349, 350 | | coherency conflicts 165 | CU, (coprocessor usability) field 252, 254, 257 | | coherency protocol, directory-based 175 | D | | coherency request, external 160, 162 | | | coherency schemes 56 | D, (dirty) bit 245 | | coherency, System interface | data cache | | external intervention exclusive request 163 | see also cache, primary data 68 | | external intervention shared request 163 | data dependencies 42 | | external invalidate request 163 | data path, secondary cache 31 | | CohPrcReqTar, mode bit 124, 171, 174, 186 | data quality indication 114 | | cold reset 181 | DBRC, field 267 | | sequence 184 | DC characteristics of I/O signals 221 | | Cold Reset exception 320 | DC electrical specifications 218 | | Compare register 128, 250 | input and output 221 | | completing, an instruction 359 | input level sensing 220 | | completion, definition of 361 | maximum operating conditions 219 | | condition bit dependencies 36 | mode definitions 220 | | Condition, field (FP) 300 | power supply levels 218 | | | Vref, voltage reference 220 | | Config register 262 | DC power supply levels 218 | | | DC voltage, reference 220 | | | DC, (data cache size) field 262 | | DCOk, signal 58, 182, 219, 220, 224 | EntryHi register 251, 317 | |----------------------------------------------------|---------------------------------------------------------------------------------------| | DE, bit 196, 256 | ASID field in 318 | | debugging, and Watch registers 264 | EntryLo registers, and FrameMask register 266 | | decoding, an instruction 359 | EntryLo0 register 245, 317 | | decoupling capacitance 225 | EntryLo1 register 245, 317 | | delay times, AC electrical 223 | EPC register 260 | | dependencies | ERL, (error level) bit 195, 255, 304 | | condition bit 36 | ERR completion response 152 | | exception 37 | ERR, signal 112 | | instruction 35 | error | | memory 36 | correctable 192 | | pipeline 35 | handling 191 | | register 36, 363 | protocol 209 | | DevNum, mode bits 186 | levels, in the Status register 304 | | Diagnostic register 267 | protection 191 | | directory-based coherency protocol 175 | schemes used in R10000 197 | | divide unit, FPU 291 | protection schemes, used in R10000 | | division by zero, FP 300 | ECC 197 | | divisor, clock, system interface 102, 348 | parity 197 | | DN, (device number) field 262 | sparse encoding 197 | | | uncorrectable 193 | | Done, bit 32 | handling an 195 | | done, see also completion 361 | limiting the propagation of 194 units that detect and report uncorrectable errors 195 | | DP, (primary data cache parity) field 279 | | | DS, (diagnostic status) field 253, 254, 255 | error correcting code see also ECC 197 | | duplicate tags, external 54 | Error Exception Program Counter (ErrorEPC) register 290 | | dynamic issue 35, 359 | Event, field 271 | | dynamic scheduling 359 | EW, bit in CacheErr register 196 | | | ExcCode, field 258, 259 | | E | exception levels, in the Status register 304 | | EC, field 262 | exception processing, CPU | | ECC (error correcting code) | exception types | | matrix for secondary cache data array 201 | Address Error 328 | | matrix for secondary cache tag array 203 | Breakpoint 338 | | matrix for System interface 207 | Bus Error 334 | | register 279 | Cache Error 195, 333 | | secondary cache 31 | Coprocessor Unusable 340 Floating-Point 341 | | ECC register 91, 96 | Integer Overflow 335 | | ECC, field 279 | Interrupt 343 | | efficiency, program, suggestions for increasing 43 | NMI 327 | | electrical specifications | Reserved Instruction 339 | | AC 222 | Soft Reset 325 | | DC 218 | System Call 337 | | Enable, field (FP) 300 | TLB 329 | | enable/output delay 223 | TLB Invalid 329, 331 | | | TLB Modified 329, 332 | | TLB Refill 329, 330 | external response 102, 109 | |----------------------------------------------------------------|-----------------------------------------------------------------------| | Trap 336 | protocol 149 | | Virtual Coherency 333 | _ | | Watch 342 | $\mathbf{F}$ | | exception vector location Reset 320 | fetch pipeline 26, 39 | | TLB Refill 320 | fetching, an instruction 359 | | exception vector selection 321 | FGR (Floating-Point General register) | | precise handling 37 | 32-bit operations 294 | | priority of 321, 323 | 5-bit select 294 | | TLB refill vector locations 322 | 64-bit operations 294 | | Exception Program Counter (EPC) register 260 | load operations 295 | | executing, an instruction 359 | operations 294 | | execution order 35 | Status register FR bit 294 | | | store operations 295 | | execution pipelines 26 | Fill, field 251 | | execution units, iterative 364 | flag | | execution, speculative 42, 362 | uncorrectable error 112 | | EXL, (exception level) bit 255, 260, 304, 320 | Flag, field (FP) 300 | | external ACK completion response 112, 152 | floating-point | | external agent 54, 55, 101 | adder 40 | | also referred to as cluster coordinator 103 | adder pipeline 26 | | connecting to 103 | divide 40, 292 | | external allocate request number request protocol 156 | multiplier 40 | | external block data response 116, 150 | pipeline 27 | | protocol 149 | queue 26, 33 | | external coherency conflicts 166 | instructions written each cycle 33 | | external coherency request latency 168 | number of allowable entries 33 | | external coherency requests, action taken 164 | ports 33<br>sequencing 33 | | external completion response 153 | registers 294 | | protocol 152 | rounding mode 301 | | • | square root 40 | | external double/single/partial-word data response protocol 151 | Floating-Point exception 341 | | external duplicate tags, support for 174 | Floating-Point Status register <i>see also</i> FSR 299 | | external interface 31 | | | memory accesses 52 | Floating-Point Unit, see also FPU 291 | | priority operations 52 | flow control 115 | | external interrupt request 127 | external data response 115 | | protocol 158 | external request 115 | | external intervention exclusive request 163 | processor coherency data response 115 processor eliminate request 115 | | external intervention request 155 | processor read request 115 | | protocol 155 | processor upgrade request 115 | | external intervention shared request 163 | processor write request 115 | | external invalidate request 163 | signals 61 | | protocol 157 | format, TLB entry 317 | | external NACK completion response 152 | FPU 291 | | external request 102, 109 | Active List, control of FSR 299 | | protocol 154 | add unit 291 | | | / - | | arithmetic instructions 300 | instruction decode and rename 30 | |----------------------------------------------------------|---------------------------------------------| | cause bits, FSR 300 | integer ALU 29 | | changing rounding mode using a CTC1 301 | iterative 29 | | compare 300 | Load/Store Unit 29 | | condition bits 300 | C | | control registers 298 | G | | divide unit 291 | G, (Global) bit in TLB 246, 318 | | FGRs (general registers) 294 | gathering data, in identical mode 114 | | FSR, (Status register in FPU) 299 | gathering data, in sequential mode 114 | | graduation, control of FSR 299 | global processes (G bit in TLB) 318 | | latency 291 | | | logic diagram 292 | graduation definition of 361 | | move to floating-point 297 | of an instruction 359 | | multiply unit 291 | | | operations 292 | Grant parking 130 | | queue | н | | controlling units 293<br>move unit, FPU 292 | п | | read ports 292 | hardware emulation, support for 176 | | register file 292 | hardware interrupts 127 | | repeat rate 291 | hold times, AC electrical 223 | | rounding modes 301 | | | serial dependency circuit 297 | I | | square-root unit 291 | I/O signals, DC characteristics 221 | | FR, field 254 | I/O, support for 174 | | FrameMask register 246, 266 | IC, (instruction cache size) field 262 | | free list 360 | IE, (interrupt enable) bit 255 | | | | | freeing the request number, with completion response 152 | IE, bit 271 | | FSR (Floating-Point Status register) | IM, (interrupt mask) field 253 | | cause bits 300 | implementation number, R10000 processor 261 | | condition bits 300 | incoming buffer 111, 112 | | division by zero $300$ enable bits $300$ | Index Load Tag instruction 94 | | flag bits 300 | Index register 243 | | inexact result 300 | Index Store Data CACHE instruction 96 | | invalid operation 300 | Index Store Tag CACHE instruction 99 | | load exceptions 301 | indexing, the secondary cache 84 | | loading the FSR 301 | inexact result (FP) 300 | | overflow 300 | initialization 181 | | RM, round to minus infinity 301 | | | RN, round to nearest representable value 301 | input voltage levels, maximum 224 | | RP, round to plus infinity 301 | instruction | | RZ, round toward zero 301 | CACHE, see also CACHE instructions 196 | | underflow 300 | completion 42, 359 | | unimplemented operation 300 | COP0 see also CP0 345 | | functional unit 29 | COP1 345 | | branch 30 | COP2 345 | | floating-point adder 29 | decoding 359 | | floating-point multiplier 29 | dependencies 35 | | 200 | 250 | |-----------------------------------------------------------------------------|-------------------------------| | DMFC1 300 | issuing, an instruction 359 | | execution 359 | iterative execution units 364 | | fetching 359 | ITLB (instruction TLB) 318 | | graduation 359 | ITLBM, field 267 | | issue 42, 359 | | | superscalar 42 | J | | latencies 49 | JTAG | | MFC1 297, 300 | boundary scan register 214 | | prefetch 47 | bypass register 213 | | queue 32, 39 | Capture-DR state 214 | | repeat rates 49 | instruction register 213 | | serializing 45<br>SWC1 297 | interface 211 | | SYNC 77, 170 | instruction register 213 | | | JTCK signal 212 | | instruction cache, block size <i>see also</i> cache, primary instruction 66 | JTDI signal 212 | | instruction register, JTAG 213 | JTDO signal 212 | | integer | JTMS signal 212 | | queue 32 | Tap controller 212 | | branch instructions 32 | test access port 212 | | divide instructions 32 | Shift-DR state 213, 214 | | multiply instructions 32 | signals 63 | | ports 32 | Update-DR state 214 | | shift instructions 32 | Update-IR state 213 | | integer ALU pipeline 26 | JTCK, signal 63, 64, 212 | | Integer Overflow exception 335 | JTDI, signal 63, 64, 212, 213 | | integer queue 26 | JTDO, signal 63, 212, 213 | | interface, external 31 | JTLB (joint TLB) 318 | | internal coherency conflicts 165 | JTMS, signal 63, 64, 212 | | internal processor clock domain 178 | | | Interrupt exception 343 | K | | interrupt mask, bit 250 | K0, field 262 | | Interrupt register 127 | Kernel mode 304 | | interrupt request, external 127 | address mapping 310 | | interrupts 127 | ckseg0 space 314 | | hardware 127 | ckseg1 space 314 | | nonmaskable 128 | ckseg3 space 314 | | software 128 | cksseg space 314 | | timer 128 | kseg0 space 311 | | invalid operation, FP 300 | kseg1 space 311 | | invalidate request, external 157 | kseg3 space 311 | | IP, (interrupt pending) bit 258, 279 | ksseg space 311 | | ISA (Instruction Set Architecture) | kuseg space 311 | | MIPS I 22 | operations 310 | | MIPS I 22 | xkphys space 312 | | MIPS II 22<br>MIPS III 22 | xkseg space 314 | | MIPS IV 22, 294 | xksseg space 312 | | | xkuseg space 312 | | issue, dynamic 359 | kseg0 space 311 | | Kseg0CA, mode bits 186 | mode | |----------------------------------------------------|----------------------------------------------------------------------| | kseg1 space 311 | addressing 305 | | kseg3 space 311 | addressing, encodings 305 | | ksseg space 311 | Kernel mode 305 | | KSU, field 253, 255, 320 | Supervisor mode 305 User mode 305 | | kuseg space 311 | operating 304 | | KX, bit 254, 304 | mode bits 186 | | L | CohPrcReqTar 124, 171, 174, 186<br>CTM 189, 349, 350 | | latency 49 | DevNum 186 | | accessing secondary cache 51 | Kseg0CA 186 | | definition of 358 | MemEnd 187 | | external coherency request 168 | ODrainSys 189, 220 | | FPU 291 | PrcElmReq 145, 175, 186 | | least-recently used replacement algorithm (LRU) 29 | PrcReqMax 115, 135, 137, 143, 147, 186<br>SCBlkSize 71, 82, 114, 187 | | level sensing, input 220 | SCCIkDiv 83, 178, 182, 187 | | list, free 360 | SCCIkTap 179, 188 | | LLAddr register 263 | SCCorEn 187, 201, 203 | | load operations, FPU registers 295 | SCSize 71, 82, 187 | | Load/Store Unit pipeline 26 | SysClkDiv 102, 178, 182, 187 | | loads | mode definitions, DC 220 | | nonblocking 361 | MP, field 267 | | logic diagram, FPU 292 | MTC0, instruction 91 | | logical register | multiple matches, in TLB 318 | | initialization (necessity for) 182 | multiplier pipeline 26 | | logical register, see also physical register 363 | multiply unit, FPU 291 | | LRU (least-recently used) replacement algorithm 29 | multiprocessor system 55 arbitration 133 | | M | cluster bus 55 | | mapped, virtual address region 305 | with external agent 55 | | mapping table 363 | multiprocessor system, using dedicated external agents 106 | | Mask, field 248 | multiprocessor system, using the cluster bus 107 | | master state 103 | | | and flow control 115 | N | | matches, multiple, in TLB 318 | NACK completion response 152 | | MemEnd, mode bits 187 | NACK, signal 112 | | memory dependencies 36 | NMI see also nonmaskable interrupt 290 | | memory ordering 37 | NMI, bit 255, 256 | | memory protection 316 | nonblocking cache 47 | | MIPS III ISA, disabled and enabled 246 | nonblocking, loads and stores 361 | | MIPS IV, instruction set see also ISA 344 | Nonmaskable Interrupt (NMI) exception 128, 320, 327 | | miscellaneous system signals 62 | normal read, cache test mode 354 | | mispredicted branch 51 | normal write, cache test mode 352 | | 1 | NT compatibility, LLAddr register 263 | | | - 1 comparisoney, 221 radi regioner 200 | | number, request 109 | phase-locked loop 180 | |---------------------------------------------------------------------|----------------------------------------------------------------------------------------------| | 0 | physical memory addresses 316 | | | physical page frame number 245 | | ODrainSys, mode bit 188, 220 | physical register, see also logical register 363 | | offset, in page address 316 | PIdx, primary cache index 89 | | operating conditions, AC 222 | pipeline 39 | | operating mode Kernel 304, 310 Supervisor 304, 308 User 304, 306 | definition of 358 fetch 26, 39 floating-point 27 floating-point multiplier 26 integer ALU 26 | | operations, FPU 292 | latency 358 | | ordering, memory 37 | Load/Store Unit 26 | | ordering, strong 37 | out of order execution 358 | | out of program order, execution 358 | repeat rate 358 | | outgoing buffer 111, 113, 114 | sequence 358 | | outstanding requests 109 | stage (definition) 358 | | overflow (FP) 300 P | stage 1 39, 40<br>stage 2 39<br>stages 4-6 40 | | | stalls 35 | | package configuration 227 | PLL 180 | | package, see CLGA | PLLDis, signal 63, 64 | | PAddr0, field 264 | PLLRC, capacitor 229 | | PAddr1, field 264 | PM, field 262 | | page address 316 | power interface signals, see also individual signals 58 | | offset 316 | power supply | | size code 316 defined 316 | levels, DC 218 regulation 224 power-on reset 181 | | virtual 316 | sequence 182 | | page table entry (PTE) array 247 | PrcElmReq, mode bit 145, 175, 186 | | PageMask register 248, 316, 317 | PrcReqMax, mode bits 115, 135, 137, 143, 147, 186 | | parity protection 197 | precise exceptions 37 | | PClk, signal 83, 102, 354, 355 | prediction, branch 362 | | PE, bit 262 | prediction, secondary cache, way 85 | | performance 5.1 | prefetch instruction 47 | | branch prediction 51 cache 51 | primary data cache, see also cache, primary data 29 | | R10000 48, 51 | primary instruction cache, see also cache, primary instruction 29 | | Performance Counter interrupt 250 | processor block read request protocol 135 | | Performance Counter register 270 | processor block write request 116 | | permanent register 360 | protocol 139 | | PFN | processor coherency data response 116 | | bits 246 | protocol 161 | | fields, in EntryLo registers 246 | processor coherency state response protocol 160 | | , , , , , , , , , , , , , , , , , , , , | processor double/single/partial-word read request protocol 137 | | processor double/single/partial-word write request protocol 141 | superscalar pipeline 25 | | | |-----------------------------------------------------------------|----------------------------------------------------|--|--| | processor eliminate request protocol 145 | R4000 superpipeline 23 | | | | processor request 102, 108 | Random entries 249 | | | | flow control protocol 147 | Random register 244 | | | | protocol 134 | RE, (reverse endian) bit 253 | | | | processor response 102, 109 | read port, FPU 292 | | | | protocols 159 | read sequences 90 | | | | Processor Revision Identifier (PRId) register 261 | 16-word 93 | | | | processor upgrade request 153 | 32-word 93 | | | | protocol 143 | 4-word 91 | | | | program order 35 | 8-word 92 | | | | dynamic execution 35 | tag 94 | | | | instruction completion 359 | reference voltage 224 | | | | instruction decoding 359 | DC 220 | | | | instruction execution 359 | register | | | | instruction fetching 359 | BadVAddr 247, 250, 265, 328 | | | | instruction graduation 359 | boundary scan, JTAG 214 | | | | instruction issue 359 | bypass, JTAG 213 | | | | protection | CacheErr 195, 196, 198, 199, 280 | | | | ECC 197 | Cause 127, 128, 250, 258, 260 | | | | memory 316 | Compare 128, 250 | | | | parity 197 | Config 262 | | | | secded 197 | Context 247, 265 | | | | sparse encoding 197 | Count 128, 250 | | | | protocol | CP0 (description of) 241 | | | | arbitration, System interface 130 | dependency 36, 363 | | | | error handling 209 | Diagnostic 267 | | | | write back 65 | ECC 91, 96, 279 | | | | write invalidate cache coherency 65 | EntryHi 251 | | | | PTE (page table entry) 247 | EntryLo0 245 | | | | PTEBase, field 247, 265 | EntryLo1 245 | | | | | EPC 260 | | | | Q | Error Exception Program Counter (ErrorEPC) 290 | | | | queue | Exception Program Counter (EPC) 260 | | | | address 26 | file | | | | instruction 39 | FPU 292 | | | | integer 26 | ports 363 | | | | | FrameMask 246, 266 | | | | R | Index 243 | | | | R, (region) field 251, 265 | instruction, JTAG 213 LLAddr 263 | | | | R, bit 264 | logical, <i>see also</i> physical register 39, 363 | | | | | PageMask 248, 316 | | | | R10000 processor | Performance Counter 270 | | | | ANDES architecture 24 | permanent 360 | | | | caches 24 | physical, see also logical register 39, 363 | | | | execution pipelines 26 | Processor Revision Identifier (PRId) 261 | | | | overview 24 | Random 244 | | | | pipeline stages 25 | | | | | renaming 36, 360 | SC(A,B)DWay, signals 59, 84, 92, 97 | |---------------------------------------------|-------------------------------------------------------------------| | Status 195, 196 | SC, bit 262 | | ERL bit 304 | SCADCS, signal 59 | | EXL bit 304 | SCADOE, signal 59 | | SX bit 314 | SCADWr, signal 59 | | TS bit 318 | SCBDCS, signal 59 | | USL field 304 | - | | UX bit 314 | SCBDOE, signal 59 | | TagHi 91, 96, 284 | SCBDWr, signal 59 | | TagLo 91, 96, 284<br>temporary 360 | SCBlkSize, mode bits 71, 82, 114, 187 | | unnamed 361 | SCClk frequency 140, 161 | | WatchHi 264 | SCClk, signal 59, 83, 179 | | WatchLo 264 | SCClkDiv, mode bits 83, 178, 182, 187 | | Wired 244, 249 | SCClkTap, mode bits 179, 188 | | write before reading (necessity for) 182 | SCCorEn, mode bits 187, 201, 203 | | XContext 265 | SCData, signal 59 | | renaming, register 360 | SCDataChk, bus 200, 203 | | repeat rate 49 | SCDataChk, signal 59 | | accessing secondary cache 51 | scheduling, dynamic 359 | | definition of 358 | SCSize, mode bits 71, 82, 187 | | FPU 291 | | | replacement algorithm, cache 29 | SCTag, signals 60, 88 | | request cycle 102 | SCTagChk, bus 203 | | request number 109 | SCTagChk, signal 60 | | freeing with completion response 152 | SCTagLSBAddr, signal 59, 85 | | request, outstanding 109 | SCTCS, signal 60 | | Reserved Instruction exception 339 | SCTOE, signal 60 | | reset | SCTWay, signal 60, 85, 87, 92 | | cold 181, 184 | SCTWr, signal 60 | | power-on 181, 182 | SECDED 197 | | soft (warm) 181, 185 | secondary cache interface signals, see also individual signals 59 | | response bus signals 62 | secondary cache, see also cache, secondary 71 | | response cycle 102 | SelDVCO, signal 63, 64 | | revision number, R10000 processor 261 | serial operations 45 | | RM, field (FP) 301 | serializing instruction 45 | | RN, field (FP) 301 | setup times, AC electrical 223 | | rounding modes, in FSR 301 | signal integrity 224 | | RP, (reduced power) bit 253 | decoupling capacitance 225 | | | maximum input voltage levels 224 | | RP, field (FP) 301 | power supply regulation 224 | | rules, arbitration for System interface 131 | reference voltage 224 | | RZ, field (FP) 301 | signals | | c c | power interface, <i>see also</i> individual signals 58 | | S | secondary cache interface, <i>see also</i> individual signals 59 | | SB, (secondary cache block size) bit 262 | System interface, <i>see also</i> individual signals 61 | | SC(A,B)Addr, signals 59, 84, 85 | test interface, see also individual signals 63 | | | | | size, page in memory 316 | superscalar processor 35 | |-----------------------------------------|-----------------------------------------------------------------------------------| | SK, bit 262 | Supervisor mode 304 | | slave state 103 | address mapping 308 | | and flow control 115 | csseg space 309 | | soft (warm) reset 181, 185 | operations 308 | | Soft Reset | sseg space 309 | | exception 325 | suseg space 308 | | Soft Reset exception 320 | xsseg space 309 | | software interrupts 128 | xsuseg space 309 | | SP, bit 279 | suseg space 308 | | sparse encoding protection 197 | switch, context 318 | | special interrupt vector 324 | SX, bit 254, 304, 314 | | specifications, test, AC electrical 222 | SYNC 77, 170 | | | instruction 77, 170 | | speculative branching 362 | prevented from graduating 114 | | speculative execution 36, 43, 362 | SysAD, bus signals 61, 117, 122, 124, 205, 206, 348, 350, 351, 352, 353, 354, 355 | | square-root unit, FPU 291 | | | SR, bit 256, 325, 327 | SysAD[20:16] interrupt register 127 | | SS, (secondary cache size) field 262 | SysAD[39:0] | | sseg space 309 | during address cycle 125 | | SSRAM 81, 86 | SysAD[56:40] | | address signals 59 | during address cycle 125 | | clock signals 59 | SysAD[57] | | data signals 59 | secondary cache block way indication 125 | | tag signals 60 | SysAD[59:58] | | stage, definition of 358 | uncached attribute 124 | | stalls, improving performance 35 | SysAD[63:0] | | standard package configuration 227 | address cycle encoding 124 | | state | data cycle encoding 126 | | master 103 | SysAD[63:60] | | slave 103 | address cycle 124 | | state bus signals 62 | interrupt 127 | | Status register 195 | SysADChk, bus 206 | | in FPU, see also FSR 294 | SysADChk, signal 62, 186 | | store operations, FPU registers 295 | SysClk cycle 115, 149, 170 | | stores | SysClk, signal 61, 102, 126, 128, 130, 131, 135, 143 | | and uncached buffer 76 | 147, 176, 177, 222, 223, 354, 355 | | nonblocking 361 | SysClkDiv, mode bits 178, 182, 187 | | strong ordering 37 | SysClkRet, signal 61, 178, 180 | | example of 38 | SysCmd, bus 61, 117, 194, 205, 206 | | superpipeline, architecture 23 | SysCmd[0] 112 | | superpipeline, R4000 23 | ECC 122 | | superscalar | processor data cycles 122 | | pipeline 23 | SysCmd[10:8] 117 | | processor | data response 121 | | definition of 23, 358 | external intervention and invalidate requests 120 | | SysCmd[11:0] | SysState[2:0] | |----------------------------------------------------------------------------------------------------|----------------------------------------------------------------| | map 123 | encoding 126 | | protocol 129 | SysStatePar, signal 62, 208 | | SysCmd[11] 117 | SysStateVal, signal 62, 126 | | SysCmd[2:0] | System Call exception 337 | | processor write requests 120 | system configuration | | SysCmd[2:1] | multiprocessor 55 | | block data response 122 | uniprocessor 54 | | processor requests 119 | System interface 31, 101 | | SysCmd[4:3] | arbitration | | data cycles 122 | in a cluster bus system 104, 133 | | external special requests 121 | in a uniprocessor system 132 | | processor read requests 118 | protocol 130 | | processor upgrade requests 119 | rules 131 | | SysCmd[5] | block write request protocol 139 | | data cycles 121 | buffers 111 | | SysCmd[5], bit 112 | bus encoding | | SysCmd[7:5] | description of buses 117 | | external requests 120 | SysAD 124 | | processor requests 118 | SysCmd 117 | | SysCmdPar, signal 61, 205 | SysResp 127 | | SysCorErr, signal 62, 192, 201, 203, 206 | SysState 126 | | SysCyc, signal 62, 176 | cached request buffer 111 | | SysGblPerf, signal 62, 77, 170 | clock domain 178 | | | cluster bus 104 | | SysGnt, signal 61, 130, 131, 132, 134, 136, 138, 140, 142, 144, 146, 149, 154, 155, 156, 157, 158, | cluster request buffer 111 coherency 163 | | 161, 170, 182, 184, 185, 324, 325, 349, 350 | coherency ross | | SysNMI, signal 62, 128, 327 | connecting to an external agent 103 | | SysRdRdy, signal 61, 131, 135, 137, 143, 147 | connections to various system configurations 105 | | and flow control 115 | directory-based coherency protocol 175 | | SysRel, signal 61, 130, 132, 134, 136, 138, 140, 142, | error handling | | 144, 146, 149, 154, 155, 156, 157, 158, 161, | on buses 205 | | 170 | on SysAD bus 206 | | SysReq, signal 61, 130, 131, 134, 136, 138, 140, 142, | on SysCmd bus 205 | | 144, 146, 161, 170, 184 | on SysResp bus $208$ | | SysReset, signal 62, 182, 184, 185, 212, 223, 324, 325, 326, 349, 350 | on SysState bus 208 schemes 204 | | SysResp, bus 62, 117, 127, 208 | error protection | | SysResp[4:0] | for buses 204 | | external completion response 152 | schemes 204 | | SysResp[4:2] | external agent 101 | | driving completion indication 127 | external allocate request number request protocol 156 | | SysRespPar, signal 62, 208 | external block data response protocol 149 | | SysRespVal, signal 62, 152, 182, 184, 185, 208 | external coherency requests, action taken 164 | | | external completion response protocol 152 | | SysState, bus 62, 117, 126, 194, 208 | external data response flow control 115, 116 | | SysState[0] | external double/single/partial-word data response protocol 151 | | processor coherency data response 168 | 1.01 | | external duplicate tags, support for 174 | signals 61, 103 | |-------------------------------------------------------------|---------------------------------------------------------------| | external interrupt request protocol 158 | slave state 103 | | external intervention exclusive request 163 | split transaction 109 | | external intervention request protocol 155 | support for I/O 174 | | external intervention shared request 163 | uncached attribute 175 | | external invalidate request $16\hat{3}$ | uncached buffer 114 | | protocol 157 | uniprocessor connections 105 | | external request 102, 109 | SysUncErr, signal 62, 193, 194, 198, 199, 203 | | flow control 115 | SysVal, signal 62, 135, 137, 139, 141, 143, 145, 149 | | protocol 154 | 151, 155, 156, 157, 158, 161, 205, 348, 352 | | external response 102, 109 | 353, 354, 355 | | protocol 149 | SysWrRdy, signal 61, 140, 141, 145, 147, 161 | | flow control 115 | and flow control 115 | | frequencies 102 | and now control 110 | | grant parking 130 | T | | hardware emulation, support for 176 | | | I/O 174 | table | | incoming buffer 112 | busy-bit 360 | | internal coherency conflicts 165 | mapping 363 | | interrupts 127 | tag bus, secondary cache, SCTag 88 | | master state 103 | tag read sequence 94 | | multiprocessor connections | tag write sequence 99 | | with cluster bus 107 | TagHi register 91, 96, 284 | | with dedicated external agents 106 | TagLo register 91, 96, 284 | | outgoing buffer 113 | | | outstanding processor requests 109 | tags, external, duplicate 174 | | outstanding requests on the System interface 109 | TAP controller 212, 213 | | port 24 | TCA, signal 63, 64 | | processor block read request protocol 135 | TCB, signal 63, 64 | | processor coherency data response protocol 161 | temporary register 360 | | processor coherency state response protocol 160 | test access port (TAP) 212 | | processor double/single/partial-word read request protocol | test interface signals, <i>see also</i> individual signals 63 | | 137 | - | | processor double/single/partial-word write request protocol | test mode, cache 349, 350 | | 141 | test signals, miscellaneous 63 | | processor eliminate request protocol 145 | Timer interrupt 128 | | processor request 102, 108 | disabling 250 | | flow control protocol 147 | TLB 317 | | protocol 134 | 32-bit-mode entry format 317 | | processor response 102, 109 | 64-bit-mode entry format 317 | | protocols 159 | address | | processor upgrade request protocol 143 | translation, avoiding multiple matches 318 | | register-to-register operation 102 | ASID field 318 | | request 108 | avoiding conflict 318 | | cycle 102 | Cache Algorithm fields 317 | | number field 109 | entry formats 317 | | protocol 134 | exceptions 329 | | response 108 | Global (G) bit 318 | | cycle 102 | ITLB 318 | | protocol 134 | misses 247 | | multiple matches, avoiding 318 | useg space 306, 307 | |-------------------------------------------------|--------------------------------------------| | number of entries 317 | User mode 304 | | page size code 316 | address mapping 306 | | used with Context register 247 | operations 306 | | TLB (Translation Lookaside Buffer) 27 | useg space 307 | | JTLB 318 | xuseg space 307 | | TLB Invalid exception 329, 331 | UX, bit 254, 304, 314 | | TLB Modified exception 329, 332 | V | | TLB Probe (TLBP) instruction 243, 251 | | | TLB Read (TLBR) instruction 243 | V, (valid) bit 245 | | TLB Read Indexed (TLBR) instruction 251 | Vcc, signal 58, 218, 229 | | TLB Refill 321 | VccPa, signal 58 | | TLB Refill exception 329, 330 | VccPd, signal 58 | | TLB Write Indexed (TLBWI) instruction 243, 251 | VccQ, signal 218, 219, 221 | | TLB Write Random instruction 244, 251 | VccQSC, signal 58, 218, 229 | | Translation Look-Aside Buffer, see also TLB 317 | VccQSys, signal 58, 218, 229 | | translation, virtual address 316, 318 | vector locations, TLB refill 322 | | Trap exception 336 | vector, special interrupt 324 | | trap physical address, and Watch registers 264 | virtual address | | TriState, signal 214 | space 305 | | TS, (TLB shutdown) bit 255, 256 | translation 316 | | TS, bit, in Status register 318 | virtual aliasing 89 | | two-level cache structure 65 | Virtual Coherency exception 333 | | | virtual memory addresses 316 | | U | voltage | | UC, (uncached attribute) bit 245 | input, maximum 224 | | uncached | reference 224 | | accelerated | VPN2, field 251 | | blocks, completely gathered 76 | Vref, signal 224 | | blocks, incompletely gathered 76 | VrefByp, signal 58 | | stores 76 | VrefSC, signal 58, 220 | | attribute, support for 175 buffer 111, 114 | VrefSys, signal 58, 220 | | cache algorithm 74, 75 | Vss, signal 58, 229 | | uncached accelerated 246 | VssPa, signal 58 | | uncached accelerated, cache algorithm 74, 76 | VssPd, signal 58 | | uncached attribute 246 | $\mathbf{W}$ | | uncorrectable error 193 | | | detection, suppressed 196 | W, bit 264 | | flag 112, 114 | Watch exception 342 | | underflow (FP) 300 | WatchHi register 264 | | unimplemented operation (FP) 300 | WatchLo register 264 | | uniprocessor system 54, 105 | way prediction table, secondary cache $86$ | | arbitration rules 132 | Wired entries 249 | | unnaming, register 361 | Wired register 244, 249 | | | | ``` write back protocol 65 primary data cache 68 write \ sequences \ 95 16-word 98 32-word 98 4-word 96 8-word 97 tag 99 \mathbf{X} XContext register 265 xkphys decoding virtual address bits VA(61:59) 318 space 312 xkseg space 314 xksseg space 312 xkuseg space 312 xsseg space 309 xsuseg space 309 XTLB \ Refill \ 321 XTLB refill handler, used with XContext register 265 xuseg space 306, 307 XX, (MIPS IV User mode) bit 252, 254, 304, 344 ``` ### [MEMO] # Facsimile Message Organization | Fron | acsimile | Mes | sage | Although NEC has ta<br>to ensure that the do<br>to our customers is<br>and up-to-date, we | cumentation sur<br>s complete, bug | pplied<br>g free | |----------------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------|------------------------------------------------|------------------------| | Name | | | | errors may occur. De<br>precautions we've<br>encounter problems<br>Please complete t<br>you'd like to repor | e taken, you<br>inthedocument<br>this form whe | may<br>ation.<br>never | | Oomp | arry | | | improvements to us | | | | Tel. | | FAX | | | | | | Addre | ss | | | | | | | | | | | Thank you for y | our kind supp | oort. | | NEC<br>Corpo<br>Fax: | America Electronics Inc. brate Communications Dept. 1-800-729-9288 1-408-588-6130 | Hong Kong, Philip<br>NEC Electronics H<br>Fax: +852-2886-96 | ong Kong Ltd. | Asian Nations except Philippines NEC Electronics Singapore Pte. Ltd. Fax: +65-250-3583 | | | | Techi | pe Electronics (Europe) GmbH nical Documentation Dept. +49-211-6503-274 | Korea<br>NEC Electronics House<br>Seoul Branch<br>Fax: 02-528-4411 | ong Kong Ltd. | Japan<br>NEC Semiconductor Technical Hotline<br>Fax: 044-548-7900 | | | | NEC | n <b>America</b><br>do Brasil S.A.<br>+55-11-6465-6829 | Taiwan<br>NEC Electronics Ta<br>Fax: 02-719-5951 | aiwan Ltd. | | | | | | d like to report the follo | | the following s | uggestion: | | | | | | | | | | | | Docui | ment number: | | | Page number: _ | | | | | | | | | | | | | | | | | | | | If pos | sible, please fax the ref | erenced page or | drawing. | | | | | | Document Rating | Excellent | Good | Acceptable | Poor | ı | | | Clarity | | | • | ۵ | | | | Technical Accuracy | | | | | 1 |