Ruby
Contents
- 1 Ruby
- 1.1 High level components of Ruby
- 1.2 Implementation of Ruby
Ruby
High level components of Ruby
Need to talk about the overview of Ruby and what are the major components.
Rathijit will do it
SLICC + Coherence protocols:
Need to say what is SLICC and whats its purpose. Talk about high level strcture of a typical coherence protocol file, that SLICC uses to generate code. A simple example structure from protocol like MI_example can help here.
Nilay will do it
Protocol independent memory components
- Cache Memory
- Replacement Policies
- Memory Controller
Arka will do it
Interconnection Network
The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together. There are 3 key parts to this:
- Topology specification: These are specified with python files that describe the topology (mesh/crossbar/ etc.), link latencies and link bandwidth. The network topology is thus configurable.
- Cycle-accurate network model: Garnet is a cycle accurate, pipelined, network model that builds and simulates the specified topology. It simulates the router pipeline and movement of flits across the network subject to the routing algorithm, latency and bandwidth constraints.
- Network Power model: The Orion power model is used to keep track of router and link activity in the network. It calculates both router static power and link and router dynamic power as flits move through the network.
More details about the network model implementation are described here
Implementation of Ruby
Directory Structure
- src/mem/
- protocols: SLICC specification for coherence protocols
- slicc: implementation for SLICC parser and code generator
- ruby
- buffers: implementation for message buffers that are used for exchanging information between the cache, directory, memory controllers and the interconnect
- common: frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block, basic types (int32, uint64, etc.)
- eventqueue: Ruby’s event queue mechanism for scheduling events
- filters: various Bloom filters
- network: Interconnect implementation, sample topology specification, network power calculations
- profiler: Profiling for cache events, memory controller events
- recorder: Cache warmup and access trace recording
- slicc_interface: Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
- system: Protocol independent memory components – CacheMemory, DirectoryMemory, Sequencer, RubyPort
SLICC
Explain functionality/ capability of SLICC Talk about AST, Symbols, Parser and code generation in some details but NO need to cover every file and/or functions. Few examples should suffice.
Nilay will do it
Protocols
Need to talk about each protocol being shipped. Need to talk about protocol specific configuration parameters. NO need to explain every action or every state/events, but need to give overall idea and how it works and assumptions (if any).
MI example
This is a simple cache coherence protocol that is used to illustrate protocol specification using SLICC.
Related Files
- src/mem/protocols
- MI_example-cache.sm: cache controller specification
- MI_example-dir.sm: directory controller specification
- MI_example-dma.sm: dma controller specification
- MI_example-msg.sm: message type specification
- MI_example.slicc: container file
Cache Hierarchy
This protocol assumes a 1-level cache hierarchy. The cache is private to each node. The caches are kept coherent by a directory controller. Since the hierarchy is only 1-level, there is no inclusion/exclusion requirement.
Stable States and Invariants
- M : The cache block has been accessed (read/written) by this node. No other node holds a copy of the cache block
- I : The cache block at this node is invalid
Cache controller
- Requests, Responses, Triggers:
- Load, Instruction fetch, Store from the core
- Replacement from self
- Data from the directory controller
- Forwarded request (intervention) from the directory controller
- Writeback acknowledgement from the directory controller
- Invalidations from directory controller (on dma activity)
- Main Operation:
- On a load/Instruction fetch/Store request from the core:
- it checks whether the corresponding block is present in the M state. If so, it returns a hit
- otherwise, if in I state, it initiates a GETX request from the directory controller
- Note: This protocol does not differentiate between load and store requests
- On a load/Instruction fetch/Store request from the core:
- On a replacement trigger from self:
- it evicts the block, issues a writeback request to the directory controller
- it waits for acknowledgement from the directory controller
- Note: writebacks are acknowledged to prevent races
- On a replacement trigger from self:
- On a forwarded request from the directory controller:
- This means that the block was in M state at this node when the request was generated by some other node
- It sends the block directly to the requesting node (cache-to-cache transfer)
- It evicts the block from this node
- On a forwarded request from the directory controller:
- Invalidations are similar to replacements
Directory controller
- Requests, Responses, Triggers:
- GETX from the cores, Forwarded GETX to the cores
- Data from memory, Data to the cores
- Writeback requests from the cores, Writeback acknowledgements to the cores
- DMA read, write requests from the DMA controllers
- Main Operation:
- The directory maintains track of which core has a block in the M state. It designates this core as owner of the block.
- On a GETX request from a core:
- If the block is not present, a memory fetch request is initiated
- If the block is already present, then it means the request is generated from some other core
- In this case, a forwarded request is sent to the original owner
- Ownership of the block is transferred to the requestor
- On a writeback request from a core:
- If the core is owner, the data is written to memory and acknowledgement is sent back to the core
- If the core is not owner, a NACK is sent back
- This can happen in a race condition
- The core evicted the block while a forwarded request some other core was on the way and the directory has already changed ownership for the core
- The evicting core holds the data till the forwarded request arrives
- On DMA accesses (read/write)
- Invalidation is sent to the owner node (if any). Otherwise data is fetched from memory.
- This ensures that the most recent data is available.
Other features
- MI protocols don't support LL/SC semantics.
- This protocol has no timeout mechanisms.
MOESI_hammer
Somayeh will do it
MOESI_CMP_token
Shoaib will do it
MOESI_CMP_directory
In contrast with the MESI protocol, the MOESI protocol introduces an additional Owned state. This enables sharing of a block after modification without needing to write it back to memory first. However, in that case, only 1 node is the owner while the others are sharers. The owner node has the responsibility to write the block back to memory on eviction. Sharers may evict the block without writeback. An overview of the protocol can be found here.
Related Files
- src/mem/protocols
- MOESI_CMP_directory-L1cache.sm: L1 cache controller specification
- MOESI_CMP_directory-L2cache.sm: L2 cache controller specification
- MOESI_CMP_directory-dir.sm: directory controller specification
- MOESI_CMP_directory-dma.sm: dma controller specification
- MOESI_CMP_directory-msg.sm: message type specification
- MOESI_CMP_directory.slicc: container file
Cache Hierarchy
Stable States and Invariants
L1 Cache controller
L2 Cache controller
Directory controller
Other features
Rathijit will do it
MESI_CMP_directory
Arka will do it
Protocol Independent Memory components
System
Arka will do it
Sequencer
Arka will do it
CacheMemory
Arka will do it
DMASequencer
There can be only 1 DMASequencer in the system. There can be a number of DMA controllers that interface with the DMASequencer.
Related Files
- src/mem/ruby/system
- DMASequencer.hh:
- DMASequencer.cc:
Main Operation
Rathijit will do it
Memory Controller
The writeup in this section reuses the documentation in the source files and rubyconfig.defaults file of GEMS.
This module simulates a basic DDR-style memory controller. It models a single channel, connected to any number of DIMMs with any number of ranks of DRAMs each. If you want multiple address/data channels, you need to instantiate multiple copies of this module.
Related Files
- src/mem/ruby/system
- MemoryControl.hh: This file declares the Memory Controller class.
- MemoryControl.cc: This file implements all the operations of the memory controller. This includes processing of input packets, address decoding and bank selection, request scheduling and contention handling, handling refresh, returning completed requests to the directory controller.
Configuration Parameters
- mem_bus_cycle_multiplier: Basic cycle time of the memory controller. This defines the period which is used as the memory channel clock period, the address bus bit time, and the memory controller cycle time. Assuming a 200 MHz memory channel (DDR-400, which has 400 bits/sec data), and a 2 GHz processor clock, mem_bus_cycle_multiplier=10.
- dimms_per_channel: Currently the only thing that matters is the number of ranks per channel, i.e. the product of this parameter and ranks_per_dimm. But if and when this is expanded to do FB-DIMMs, the distinction between the two will matter.
- rank_rank_delay: This is how many memory address cycles to delay between reads to different ranks of DRAMs to allow for clock skew.
- mem_fixed_delay: If this is nonzero, it will disable the memory controller and instead give every request a fixed latency. The nonzero value specified here is measured in memory cycles and is just added to MEM_CTL_LATENCY. It will also show up in the stats file as a contributor to memory delays stalled at head of bank queue.
- mem_random_arbitrate: By default, the memory controller uses round-robin to arbitrate between ready bank queues for use of the address bus. If you wish to add randomness to the system, set this parameter to one instead, and it will restart the round-robin pointer at a random bank number each cycle. If you want additional nondeterminism, set the parameter to some integer n >= 2, and it will in addition add a n% chance each cycle that a ready bank will be delayed an additional cycle. Note that if you are in MEM_FIXED_DELAY mode (see below), mem_random_arbitrate=1 will have no effect, but mem_random_arbitrate=2 or more will.
- tFAW: This is an obscure DRAM parameter that says that no more than four activate requests can happen within a window of a certain size. For most configurations this does not come into play, or has very little effect, but it could be used to throttle the power consumption of the DRAM. In this implementation (unlike in a DRAM data sheet) TFAW is measured in memory bus cycles; i.e. if TFAW = 16 then no more than four activates may happen within any 16 cycle window. Refreshes are included in the activates.
Main Operation
- Address Mapping
- Open/Closed Policy
We are assuming closed bank policy; that is, we automatically close each bank after a single read or write. Adding an option for open bank policy is for future work.
- Interfaces
- Data Structures
- Scheduling and Bank Contention
Each memory request is placed in a queue associated with a specific memory bank. This queue is of finite size; if the queue is full the request will back up in an (infinite) common queue and will effectively throttle the whole system. This sort of behavior is intended to be closer to real system behavior than if we had an infinite queue on each bank. If you want the latter, just make the bank queues unreasonably large.
The head item on a bank queue is issued when all of the following are true:
- The bank is available
- The address path to the DIMM is available
- The data path to or from the DIMM is available
Note that we are not concerned about fixed offsets in time. The bank will not be used at the same moment as the address path, but since there is no queue in the DIMM or the DRAM it will be used at a constant number of cycles later, so it is treated as if it is used at the same time.
We are assuming "posted CAS"; that is, we send the READ or WRITE immediately after the ACTIVATE. This makes scheduling the address bus trivial; we always schedule a fixed set of cycles. For DDR-400, this is a set of two cycles; for some configurations such as DDR-800 the parameter tRRD forces this to be set to three cycles.
We assume a four-bit-time transfer on the data wires. This is the minimum burst length for DDR-2. This would correspond to (for example) a memory where each DIMM is 72 bits wide and DIMMs are ganged in pairs to deliver 64 bytes at a shot.This gives us the same occupancy on the data wires as on the address wires (for the two-address-cycle case).
The only non-trivial scheduling problem is the data wires. A write will use the wires earlier in the operation than a read will; typically one cycle earlier as seen at the DRAM, but earlier by a worst-case round-trip wire delay when seen at the memory controller. So, while reads from one rank can be scheduled back-to-back every two cycles, and writes (to any rank) scheduled every two cycles, when a read is followed by a write we need to insert a bubble. Furthermore, consecutive reads from two different ranks may need to insert a bubble due to skew between when one DRAM stops driving the wires and when the other one starts. (These bubbles are parameters.)
This means that when some number of reads and writes are at the heads of their queues, reads could starve writes, and/or reads to the same rank could starve out other requests, since the others would never see the data bus ready. For this reason, we have implemented an anti-starvation feature. A group of requests is marked "old", and a counter is incremented each cycle as long as any request from that batch has not issued. If the counter reaches twice the bank busy time, we hold off any newer requests until all of the "old" requests have issued.
Refresh
Rathijit will do it
Interconnection Network
Topology specification
Python files specify connections. Shortest path graph traversals program the routing tables.
Network implementation
- SimpleNetwork
- Garnet
Life of a memory request in Ruby
Cpu model generates a packet -> RubyPort converts it to a ruby request -> L1 cache controller converts it to a protocol specific message ...etc.
Arka will do it