Difference between revisions of "Ruby"

From gem5
Jump to: navigation, search
(Interconnection Network)
(Life of a memory request in Ruby)
Line 1,001: Line 1,001:
  
  
=== Life of a memory request in Ruby ===
+
== Life of a memory request in Ruby ==
 
In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.
 
In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.
  

Revision as of 17:34, 7 April 2011

Contents

High level components of Ruby

Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:

  1. Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
  2. Rich configurability -- almost any aspect affecting the memory hierarchy functionality and timing can be controlled.
  3. Rapid prototyping -- a high-level specification language, SLICC, is used to specify functionality of various controllers.

The following picture, taken from the GEMS tutorial in ISCA 2005, shows a high-level view of the main components in Ruby.

Ruby overview.jpg

SLICC + Coherence protocols:

   Need to say what is SLICC and whats its purpose. 
   Talk about high level strcture of a typical coherence protocol file, that SLICC uses to generate code. 
   A simple example structure from protocol like MI_example can help here.

SLICC stands for Specification Language for Implementing Cache Coherence. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.

Slicc overview.jpg

Protocol independent memory components

  1. Cache Memory
  2. Replacement Policies
  3. Memory Controller

Arka will do it

Interconnection Network

The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together.

Interconnection network.jpg

The key components of an interconnection network are:

  1. Topology
  2. Routing
  3. Flow Control
  4. Router Microarchitecture

More details about the network model implementation are described here

Implementation of Ruby

Directory Structure

  • src/mem/
    • protocols: SLICC specification for coherence protocols
    • slicc: implementation for SLICC parser and code generator
    • ruby
      • buffers: implementation for message buffers that are used for exchanging information between the cache, directory, memory controllers and the interconnect
      • common: frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block, basic types (int32, uint64, etc.)
      • eventqueue: Ruby’s event queue API for scheduling events on the gem5 event queue
      • filters: various Bloom filters (stale code from GEMS)
      • network: Interconnect implementation, sample topology specification, network power calculations
      • profiler: Profiling for cache events, memory controller events
      • recorder: Cache warmup and access trace recording
      • slicc_interface: Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
      • system: Protocol independent memory components – CacheMemory, DirectoryMemory, Sequencer, RubyPort

Protocols

   Need to talk about each protocol being shipped. Need to talk about protocol specific configuration parameters.
   NO need to explain every action or every state/events, but need to give overall idea and how it works
   and assumptions (if any).

Common Notations and Data Structures

Coherence Messages

These are described in the <protocol-name>-msg.sm file for each protocol.

Message Description
ACK/NACK positive/negative acknowledgement for requests that wait for the direction of resolution before deciding on the next action. Examples are writeback requests, exclusive requests.
GETS request for shared permissions to satisfy a CPU's load or IFetch.
GETX request for exclusive access.
INV invalidation request. This can be triggered by the coherence protocol itself, or by the next cache level/directory to enforce inclusion or to trigger a writeback for a DMA access so that the latest copy of data is obtained.
PUTX request for writeback of cache block. Some protocols (e.g. MOESI_CMP_directory) may use this only for writeback requests of exclusive data.
PUTS request for writeback of cache block in shared state.
PUTO request for writeback of cache block in owned state.
PUTO_Sharers request for writeback of cache block in owned state but other sharers of the block exist.
UNBLOCK message to unblock next cache level/directory for blocking protocols.
AccessPermissions

These are associated with each cache block and determine what operations are permitted on that block. It is closely correlated with coherence protocol states.

Permissions Description
Invalid The cache block is invalid. The block must first be obtained (from elsewhere in the memory hierarchy) before loads/stores can be performed. No action on invalidates (except maybe sending an ACK). No action on replacements. The associated coherence protocol states are I or NP and are stable states in every protocol.
Busy TODO
Read_Only Only operations permitted are loads, writebacks, invalidates. Stores cannot be performed before transitioning to some other state.
Read_Write Loads, stores, writebacks, invalidations are allowed. Usually indicates that the block is dirty.
Data Structures
  • Message Buffers:TODO
  • TBE Table: TODO
  • Timer Table: This maintains a map of address-based timers. For each target address, a timeout value can be associated and added to the Timer table. This data structure is used, for example, by the L1 cache controller implementation of the MOESI_CMP_directory protocol to trigger separate timeouts for cache blocks. Internally, the Timer Table uses the event queue to schedule the timeouts. The TimerTable supports a polling-based interface, isReady() to check if a timeout has occurred. Timeouts on addresses can be set using the set() method and removed using the unset() method.
Related Files:
src/mem/ruby/system/TimerTable.hh: Declares the TimerTable class
src/mem/ruby/system/TimerTable.cc: Implementation of the methods of the TimerTable class, that deals with setting addresses & timeouts, scheduling events using the event queue.
Coherence controller FSM Diagrams
  • The Finite State Machines show only the stable states
  • Transitions are annotated using the notation "Event list" or "Event list : Action list" or "Event list : Action list : Event list". For example, Store : GETX indicates that on a Store event, a GETX message was sent whereas GETX : Mem Read indicates that on receiving a GETX message, a memory read request was sent. Only the main triggers and actions are listed.
  • In the diagrams, the transition labels are associated with the arc that cuts across the transition label or the closest arc.

MI example

Protocol Overview
  • This is a simple cache coherence protocol that is used to illustrate protocol specification using SLICC.
  • This protocol assumes a 1-level cache hierarchy. The cache is private to each node. The caches are kept coherent by a directory controller. Since the hierarchy is only 1-level, there is no inclusion/exclusion requirement.
  • This protocol does not differentiate between loads and stores.
  • This protocol cannot implement the semantics of LL/SC instructions, because external GETS requests that hit a block within a LL/SC sequence steal exclusive permissions, thus causing the SC instruction to fail.
Related Files
  • src/mem/protocols
    • MI_example-cache.sm: cache controller specification
    • MI_example-dir.sm: directory controller specification
    • MI_example-dma.sm: dma controller specification
    • MI_example-msg.sm: message type specification
    • MI_example.slicc: container file
Stable States and Invariants
States Invariants
M The cache block has been accessed (read/written) by this node. No other node holds a copy of the cache block
I The cache block at this node is invalid

The notation used in the controller FSM diagrams is described here.

Cache controller
  • Requests, Responses, Triggers:
    • Load, Instruction fetch, Store from the core
    • Replacement from self
    • Data from the directory controller
    • Forwarded request (intervention) from the directory controller
    • Writeback acknowledgement from the directory controller
    • Invalidations from directory controller (on dma activity)
MI example cache FSM.jpg
  • Main Operation:
    • On a load/Instruction fetch/Store request from the core:
      • it checks whether the corresponding block is present in the M state. If so, it returns a hit
      • otherwise, if in I state, it initiates a GETX request from the directory controller
    • On a replacement trigger from self:
      • it evicts the block, issues a writeback request to the directory controller
      • it waits for acknowledgement from the directory controller (to prevent races)
    • On a forwarded request from the directory controller:
      • This means that the block was in M state at this node when the request was generated by some other node
      • It sends the block directly to the requesting node (cache-to-cache transfer)
      • It evicts the block from this node
    • Invalidations are similar to replacements
Directory controller
  • Requests, Responses, Triggers:
    • GETX from the cores, Forwarded GETX to the cores
    • Data from memory, Data to the cores
    • Writeback requests from the cores, Writeback acknowledgements to the cores
    • DMA read, write requests from the DMA controllers
MI example dir FSM.jpg
  • Main Operation:
    • The directory maintains track of which core has a block in the M state. It designates this core as owner of the block.
    • On a GETX request from a core:
      • If the block is not present, a memory fetch request is initiated
      • If the block is already present, then it means the request is generated from some other core
        • In this case, a forwarded request is sent to the original owner
        • Ownership of the block is transferred to the requestor
    • On a writeback request from a core:
      • If the core is owner, the data is written to memory and acknowledgement is sent back to the core
      • If the core is not owner, a NACK is sent back
        • This can happen in a race condition
        • The core evicted the block while a forwarded request some other core was on the way and the directory has already changed ownership for the core
        • The evicting core holds the data till the forwarded request arrives
    • On DMA accesses (read/write)
      • Invalidation is sent to the owner node (if any). Otherwise data is fetched from memory.
      • This ensures that the most recent data is available.
Other features
    • MI protocols don't support LL/SC semantics. A load from a remote core will invalidate the cache block.
    • This protocol has no timeout mechanisms.

MOESI_hammer

This is an implementation of AMD's Hammer protocol, which is used in AMD's Hammer chip (also know as the Opteron or Athlon 64). The protocol implements both the original a HyperTransport protocol, as well as the more recent ProbeFilter protocol. The protocol also includes a full-bit directory mode.

Related Files
  • src/mem/protocols
    • MOESI_hammer-cache.sm: cache controller specification
    • MOESI_hammer-dir.sm: directory controller specification
    • MOESI_hammer-dma.sm: dma controller specification
    • MOESI_hammer-msg.sm: message type specification
    • MOESI_hammer.slicc: container file
Cache Hierarchy

This protocol implements a 2-level private cache hierarchy. It assigns separate Instruction and Data L1 caches, and a unified L2 cache to each core. These caches are private to each core and are controlled with one shared cache controller. This protocol enforce exclusion between L1 and L2 caches.

Stable States and Invariants
States Invariants
MM The cache block is held exclusively by this node and is potentially locally modified (similar to conventional "M" state).
O The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
M The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
S The cache line holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The cache line can be read, but not written in this state.
I The cache line is invalid and does not hold a valid copy of the data.
Cache controller

The notation used in the controller FSM diagrams is described here.

MOESI_hammer supports cache flushing. To flush a cache line, the cache controller first issues a GETF request to the directory to block the line until the flushing is completed. It then issues a PUTF and writes back the cache line.

MOESI hammer cache FSM.jpg
Directory controller

MOESI_hammer memory module, unlike a typical directory protocol, does not contain any directory state and instead broadcasts requests to all the processors in the system. In parallel, it fetches the data from the DRAM and forward the response to the requesters.

probe filter: TODO

  • Stable States and Invariants
States Invariants
NX Not Owner, probe filter entry exists, block in O at Owner.
NO Not Owner, probe filter entry exists, block in E/M at Owner.
S Data clean, probe filter entry exists pointing to the current owner.
O Data clean, probe filter entry exists.
E Exclusive Owner, no probe filter entry.
  • Controller


The notation used in the controller FSM diagrams is described here.

MOESI hammer dir FSM.jpg

MOESI_CMP_token

Protocol Overview
  • This protocol also models a 2-level cache hierarchy.
  • It maintains coherence permission by explicitly exchanging and counting tokens.
  • A fix number of token are assigned to each cache block in the beginning, the number of token remains unchanged.
  • To write a block, the processor must have all the token for that block. For reading at least one token is required.
  • The protocol also has a persistent message support to avoid starvation.
Related Files
  • src/mem/protocols
    • MOESI_CMP_token-L1cache.sm: L1 cache controller specification
    • MOESI_CMP_token-L2cache.sm: L2 cache controller specification
    • MOESI_CMP_token-dir.sm: directory controller specification
    • MOESI_CMP_token-dma.sm: dma controller specification
    • MOESI_CMP_token-msg.sm: message type specification
    • MOESI_CMP_token.slicc: container file
Controller Description
  • L1 Cache
States Invariants
MM The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state).
MM_W The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state). Replacements and DMA accesses are not allowed in this state. The block automatically transitions to MM state after a timeout.
O The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
M The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
M_W The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Only loads and stores are allowed. Silent upgrade happens to MM_W state on store. Replacements and DMA accesses are not allowed in this state. The block automatically transitions to M state after a timeout.
S The cache block is held in shared state by 1 or more nodes. Stores are not allowed in this state.
I The cache block is invalid.
  • L2 cache
States Invariants
NP The cache block is held exclusively by this node and is potentially locally modified (similar to conventional "M" state).
O The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
M The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
S The cache line holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The cache line can be read, but not written in this state.
I The cache line is invalid and does not hold a valid copy of the data.
  • Directory controller
States Invariants
O Owner .
NO Not Owner.
L Locked.

MOESI_CMP_directory

Editing in progress.

Protocol Overview
  • TODO: cache hierarchy
  • In contrast with the MESI protocol, the MOESI protocol introduces an additional Owned state.
  • The MOESI protocol also includes many coalescing optimizations not available in the MESI protocol.
Related Files
  • src/mem/protocols
    • MOESI_CMP_directory-L1cache.sm: L1 cache controller specification
    • MOESI_CMP_directory-L2cache.sm: L2 cache controller specification
    • MOESI_CMP_directory-dir.sm: directory controller specification
    • MOESI_CMP_directory-dma.sm: dma controller specification
    • MOESI_CMP_directory-msg.sm: message type specification
    • MOESI_CMP_directory.slicc: container file
L1 Cache Controller
  • Stable States and Invariants
States Invariants
MM The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state).
MM_W The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state). Replacements and DMA accesses are not allowed in this state. The block automatically transitions to MM state after a timeout.
O The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
M The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
M_W The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Only loads and stores are allowed. Silent upgrade happens to MM_W state on store. Replacements and DMA accesses are not allowed in this state. The block automatically transitions to M state after a timeout.
S The cache block is held in shared state by 1 or more nodes. Stores are not allowed in this state.
I The cache block is invalid.
  • FSM Abstraction

The notation used in the controller FSM diagrams is described here.

MOESI CMP directory L1cache FSM.jpg
    • Optimizations
States Description
SM A GETX has been issued to get exclusive permissions for an impending store to the cache block, but an old copy of the block is still present. Stores and Replacements are not allowed in this state.
OM A GETX has been issued to get exclusive permissions for an impending store to the cache block, the data has been received, but all expected acknowledgments have not yet arrived. Stores and Replacements are not allowed in this state.

The notation used in the controller FSM diagrams is described here.

MOESI CMP directory L1cache optim FSM.jpg
L2 Cache Controller
  • Stable States and Invariants
Intra-chip Inclusion Inter-chip Exclusion States Description
Not in any L1 or L2 at this chip May be present at other chips NP/I The cache block at this chip is invalid.
Not in L2, but in 1 or more L1s at this chip May be present at other chips ILS The cache block is not present at L2 on this chip. It is shared locally by L1 nodes in this chip.
ILO The cache block is not present at L2 on this chip. Some L1 node in this chip is an owner of this cache block.
ILOS The cache block is not present at L2 on this chip. Some L1 node in this chip is an owner of this cache block. There are also L1 sharers of this cache block in this chip.
Not present at any other chip ILX The cache block is not present at L2 on this chip. It is held in exclusive mode by some L1 node in this chip.
ILOX The cache block is not present at L2 on this chip. It is held exclusively by this chip and some L1 node in this chip is an owner of the block.
ILOSX The cache block is not present at L2 on this chip. It is held exclusively by this chip. Some L1 node in this chip is an owner of the block. There are also L1 sharers of this cache block in this chip.
In L2, but not in any L1 at this chip May be present at other chips S The cache block is not present at L1 on this chip. It is held in shared mode at L2 on this chip and is also potentially shared across chips.
O The cache block is not present at L1 on this chip. It is held in owned mode at L2 on this chip. It is also potentially shared across chips.
Not present at any other chip M The cache block is not present at L1 on this chip. It is present at L2 on this chip and is potentially modified.
Both in L2, and 1 or more L1s at this chip May be present at other chips SLS The cache block is present at L2 in shared mode on this chip. There exists local L1 sharers of the block on this chip. It is also potentially shared across chips.
OLS The cache block is present at L2 in owned mode on this chip. There exists local L1 sharers of the block on this chip. It is also potentially shared across chips.
Not present at any other chip OLSX The cache block is present at L2 in owned mode on this chip. There exists local L1 sharers of the block on this chip. It is held exclusively by this chip.


  • FSM Abstraction

The controller is described in 2 parts. The first picture shows transitions between all "intra-chip inclusion" categories and within categories 1, 3, 4. Transitions within category 2 (Not in L2, but in 1 or more L1s at this chip) are shown in the second picture.

The notation used in the controller FSM diagrams is described here. Transitions involving other chips are annotated in brown.

MOESI CMP directory L2cache FSM part 1.jpg

The second picture below expands the central hexagonal portion of the above picture to show transitions within category 2 (Not in L2, but in 1 or more L1s at this chip).

The notation used in the controller FSM diagrams is described here. Transitions involving other chips are annotated in brown.

MOESI CMP directory L2cache FSM part 2.jpg
Directory Controller
  • Stable States and Invariants
States Invariants
M The cache block is held in exclusive state by only 1 node (which is also the owner). There are no sharers of this block. The data is potentially different from that in memory.
O The cache block is owned by exactly 1 node. There may be sharers of this block. The data is potentially different from that in memory.
S The cache block is held in shared state by 1 or more nodes. No node has ownership of the block. The data is consistent with that in memory (Check).
I The cache block is invalid.
  • FSM Abstraction

The notation used in the controller FSM diagrams is described here.

MOESI CMP directory dir FSM.jpg
Other features
  • Timeouts:

Rathijit will do it

MESI_CMP_directory

Protocol Overview
  • This protocol models two-level cache hierarchy. The L1 cache is private to a core, while the L2 cache is shared among the cores. L1 Cache is split into Instruction and Data cache.
  • Inclusion is maintained between the L1 and L2 cache.
  • At high level the protocol has four stable states, M, E, S and I. A block in M state means the blocks is writable (i.e. has exclusive permission) and has been dirtied (i.e. its the only valid copy on-chip). E state represent a cache block with exclusive permission (i.e. writable) but is not written yet. S state means the cache block is only readable and possible multiple copies of it exists in multiple private cache and as well as in the shared cache. I means that the cache block is invalid.
  • The on-chip cache coherence is maintained through Directory Coherence scheme, where the directory information is co-located with the corresponding cache blocks in the shared L2 cache.
  • The protocol has four types of controllers -- L1 cache controller, L2 cache controller, Directory controller and DMA controller. L1 cache controller is responsible for managing L1 Instruction and L1 Data Cache. Number of instantiation of L1 cache controller is equal to the number of cores in the simulated system. L2 cache controller is responsible for managing the shared L2 cache and for maintaining coherence of on-chip data through directory coherence scheme. The Directory controller act as interface to the Memory Controller/Off-chip main memory and also responsible for coherence across multiple chips/and external coherence request from DMA controller. DMA controller is responsible for satisfying coherent DMA requests.
  • One of the primary optimization in this protocol is that if a L1 Cache request a data block even for read permission, the L2 cache controller if finds that no other core has the block, it returns the cache block with exclusive permission. This is an optimization done in anticipation that a cache blocks read would be written by the same core soon and thus save an extra request with this optimization. This is exactly why E state exits (i.e. when a cache block is writable but not yet written).
  • The protocol supports silent eviction of clean cache blocks from the private L1 caches. This means that cache blocks which have not been written to and has readable permission only can drop the cache block from the private L1 cache without informing the L2 cache. This optimization helps reducing write-back traffic to the L2 cache controller.
Related Files
  • src/mem/protocols
    • MESI_CMP_directory-L1cache.sm: L1 cache controller specification
    • MESI_CMP_directory-L2cache.sm: L2 cache controller specification
    • MESI_CMP_directory-dir.sm: directory controller specification
    • MESI_CMP_directory-dma.sm: dma controller specification
    • MESI_CMP_directory-msg.sm: coherence message type specifications. This defines different field of different type of messages that would be used by the given protocol
    • MESI_CMP_directory.slicc: container file
Controller Description
  • L1 cache controller
States Invariants and Semantic/Purpose of the state
M The cache block is held in exclusive state by only one L1 cache. There are no sharers of this block. The data is potentially is the only valid copy in the system. The copy of the cache block is writable and as well as readable.
E The cache block is held with exclusive permission by exactly only one L1 cache. The difference with the M state is that the cache block is writable (and readable) but not yet written.
S The cache block is held in shared state by 1 or more L1 caches and/or by the L2 cache. The block is only readable. No cache can have the cache block with exclusive permission.
I / NP The cache block is invalid.
IS Its a transient state. This means that GETS (Read) request has been issued for the cache block and awaiting for response. The cache block is neither readable nor writable.
IM Its a transient state. This means that GETX (Write) request has been issued for the cache block and awaiting for response. The cache block is neither readable nor writable.
SM Its a transient state. This means the cache block was originally in S state and then UPGRADE (Write) request was issued to get exclusive permission for the blocks and awaiting response. The cache block is readable.
IS_I Its a transient state. This means that while in IS state the cache controller received Invalidation from the L2 Cache's directory. This happens due to race condition due to write to the same cache block by other core, while the given core was trying to get the same cache blocks for reading. The cache block is neither readable nor writable..
M_I Its a transient state. This state indicates that the cache is trying to replace a cache block in M state from its cache and the write-back (PUTX) to the L2 cache's directory has been issued but awaiting write-back acknowledgement.
SINK_WB_ACK Its a transient state. This state is reached when waiting for write-back acknowledgement from the L2 cache's directory, the L1 cache received intervention (forwarded request from other cores). This indicates a race between the issued write-back to the directory and another request from the another cache has happened. This also indicates that the write-back has lost the race (i.e. before it reached the L2 cache's directory, another core's request has reached the L2). This state is essential to avoid possibility of complicated race condition that can happen if write-backs are silently dropped at the directory.
  • L2 cache controller

Recall that the on-chip directory is co-located with the corresponding cache blocks in the L2 Cache. Thus following states in the L2 cache block encodes the information about the status and permissions of the cache blocks in the L2 cache as well as the coherence status of the cache block that may be present in one or more private L1 caches. Beyond the coherence states there are also two more important fields per cache block that aids to make proper coherence actions. These fields are Sharers field, which can be thought of as a bit-vector indicating which of the private L1 caches potentially have the given cache block. The other important field is the Owner field, which is the identity of the private L1 cache in case the cache block is held with exclusive permission in a L1 cache.

States Invariants and Semantic/Purpose of the state
NP The cache blocks is not present in the on-chip cache hierarchy.
SS The cache block is present in potentially multiple private caches in only readable mode (i.e.in "S" state in private caches). Corresponding "Sharers" vector with the block should give the identity of the private caches which possibly have the cache block in its cache. The cache block in the L2 cache is valid and readable.
M The cache block is present ONLY in the L2 cache and has exclusive permission. L1 Cache's read/write requests (GETS/GETX) can be satisfied directly from the L2 cache.
MT The cache block is in ONE of the private L1 caches with exclusive permission. The data in the L2 cache is potentially stale. The identity of the L1 cache which has the block can be found in the "Owner" field associated with the cache block. Any request for read/write (GETS/GETX) from other cores/private L1 caches need to be forwarded to the owner of the cache block. L2 can not service requests itself.
M_I Its a transient state. This state indicates that the cache is trying to replace the cache block from its cache and the write-back (PUTX/PUTS) to the Directory controller (which act as interface to Main memory) has been issued but awaiting write-back acknowledgement. The data is neither readable nor writable.
MT_I Its a transient state. This state indicates that the cache is trying to replace a cache block in MT state from its cache. Invalidation to the current owner (private L1 cache) of the cache block has been issued and awaiting write-back from the Owner L1 cache. Note that the this Invalidation (called back-invalidation) is instrumental in making sure that the inclusion is maintained between L1 and L2 caches. The data is neither readable nor writable.
MCT_I Its a transient state.This state is same as MT_I, except that it is known that the data in the L2 cache is in clean state. The data is neither readable nor writable.
I_I Its a transient state. The L2 cache is trying to replace a cache block in the SS state and the cache block in the L2 is in clean state. Invalidations has been sent to all potential sharers (L1 caches) of the cache block. The L2 cache's directory is waiting for all the required Acknowledgements to arrive from the L1 caches. Note that the this Invalidation (called back-invalidation) is instrumental in making sure that the inclusion is maintained between L1 and L2 caches. The data is neither readable nor writable.
S_I Its a transient state.Same as I_I, except the data in L2 cache for the cache block is dirty. This means unlike in the case of I_I, the data needs to be sent to the Main memory. The cache block is neither readable nor writable..
ISS Its a transient state. L2 has received a GETS (read) request from one of the private L1 caches, for a cache block that it not present in the on-chip caches. A read request has been sent to the Main Memory (Directory controller) and waiting for the response from the memory. This state is reached only when the request is for data cache block (not instruction cache block). The purpose of this state is that if it is found that only one L1 cache has requested the cache block then the block is returned to the requester with exclusive permission (although it was requested for reading permission). The cache block is neither readable nor writable.
IS Its a transient state. The state is similar to ISS, except the fact that if the requested cache block is Instruction cache block or more than one core request the same cache block while waiting for the response from the memory, this state is reached instead of ISS. Once the requested cache block arrives from the Main Memory, the block is sent to the requester(s) with read-only permission. The cache block is neither readable nor writable at this state.
IM Its a transient state. This state is reached when a L1 GETX (write) request is received by the L2 cache for a cache blocks that is not present in the on-chip cache hierarchy. The request for the cache block in exclusive mode has been issued to the main memory but response is yet to arrive.The cache block is neither readable nor writable at this state.
SS_MB Its a transient state. In general any state whose name ends with "B" (like this one) also means that it is a blocking coherence state. This means the directory awaiting for some response from the private L1 cache ans until it receives the desired response any other request is not entertained (i.e. request are effectively serialized). This particular state is reached when a L1 cache requests a cache block with exclusive permission (i.e. GETX or UPGRADE) and the coherence state of the cache blocks was in SS state. This means that the requested cache blocks potentially has readable copies in the private L1 caches. Thus before giving the exclusive permission to the requester, all the readable copies in the L1 caches need to be invalidated. This state indicate that the required invalidations has been sent to the potential sharers (L1 caches) and the requester has been informed about the required number of Invalidation Acknowledgement it needs before it can have the exclusive permission for the cache block. Once the requester L1 cache gets the required number of Invalidation Acknowledgement it informs the director about this by UNBLOCK message which allows the directory to move out of this blocking coherence state and thereafter it can resume entertaining other request for the given cache block. The cache block is neither readable nor writable at this state.
MT_MB Its a transient state and also a blocking state. This state is reached when L2 cache's directory has sent out a cache block with exclusive permission to a requester L1 cache but yet to receive UNBLOCK from the requester L1 cache acknowledging the receipt of exclusive permission. The cache block is neither readable nor writable at this state.
MT_IIB Its a transient state and also a blocking state. This state is reached when a read request (GETS) request is received for a cache blocks which is currently held with exclusive permission in another private L1 cache (i.e. directory state is MT). On such requests the L2 cache's directory forwards the request to the current owner L1 cache and transitions to this state. Two events need to happen before this cache block can be unblocked (and thus start entertaining further request for this cache block). The current owner cache block need to send a write-back to the L2 cache to update the L2's copy with latest value. The requester L1 cache also needs to send UNBLOCK to the L2 cache indicating that it has got the requested cache block with desired coherence permissions. The cache block is neither readable nor writable at this state in the L2 cache.
MT_IB Its a transient state and also a blocking state. This state is reached when at MT_IIB state the L2 cache controller receives the UNBLOCK from the requester L1 cache but yet to receive the write-back from the previous owner L1 cache of the block. The cache block is neither readable nor writable at this state in the L2 cache.
MT_IB Its a transient state and also a blocking state. This state is reached when at MT_IIB state the L2 cache controller receives write-back from the previous owner L1 cache for the blocks, while yet to receive the UNBLOCK from the current requester for the cache block. The cache block is neither readable nor writable at this state in the L2 cache.

Network_test

This is a dummy cache coherence protocol that is used to operate the ruby network tester. The details about running the network tester can be found here.

Related Files
  • src/mem/protocols
    • Network_test-cache.sm: cache controller specification
    • Network_test-dir.sm: directory controller specification
    • Network_test-msg.sm: message type specification
    • Network_test.slicc: container file
Cache Hierarchy

This protocol assumes a 1-level cache hierarchy. The role of the cache is to simply send messages from the cpu to the appropriate directory (based on the address), in the appropriate virtual network (based on the message type). It does not track any state. Infact, no CacheMemory is created unlike other protocols. The directory receives the messages from the caches, but does not send any back. The goal of this protocol is to enable simulation/testing of just the interconnection network.

Stable States and Invariants
States Invariants
I Default state of all cache blocks
Cache controller
  • Requests, Responses, Triggers:
    • Load, Instruction fetch, Store from the core.

The network tester (in src/cpu/testers/networktest/networktest.cc) generates packets of the type ReadReq, INST_FETCH, and WriteReq, which are converted into RubyRequestType:LD, RubyRequestType:IFETCH, and RubyRequestType:ST, respectively, by the RubyPort (in src/mem/ruby/system/RubyPort.hh/cc). These messages reach the cache controller via the Sequencer. The destination for these messages is determined by the traffic type, and embedded in the address. More details can be found here.

  • Main Operation:
    • The goal of the cache is only to act as a source node in the underlying interconnection network. It does not track any states.
    • On a LD from the core:
      • it returns a hit, and
      • maps the address to a directory, and issues a message for it of type MSG, and size Control (8 bytes) in the request vnet (0).
      • Note: vnet 0 could also be made to broadcast, instead of sending a directed message to a particular directory, by uncommenting the appropriate line in the a_issueRequest action in Network_test-cache.sm
    • On a IFETCH from the core:
      • it returns a hit, and
      • maps the address to a directory, and issues a message for it of type MSG, and size Control (8 bytes) in the forward vnet (1).
    • On a ST from the core:
      • it returns a hit, and
      • maps the address to a directory, and issues a message for it of type MSG, and size Data (72 bytes) in the response vnet (2).
    • Note: request, forward and response are just used to differentiate the vnets, but do not have any physical significance in this protocol.
Directory controller
  • Requests, Responses, Triggers:
    • MSG from the cores
  • Main Operation:
    • The goal of the directory is only to act as a destination node in the underlying interconnection network. It does not track any states.
    • The directory simply pops its incoming queue upon receiving the message.
Other features
    • This protocol assumes only 3 vnets.
    • It should only be used when running the ruby network test.

Protocol Independent Memory components

System

This is a high level container for few of the important components of the Ruby which may need to be accessed from various parts and components of Ruby. Only ONE instance of this class is created. The instance of this class is globally available through a pointer named g_system_ptr. It holds pointer to the Ruby's profiler object. This allows any component of Ruby to get hold of the profiler and collect statistics in a central location by accessing it though g_system_ptr. It also holds important information about the memory hierarchy like Cache blocks size, Physical memory size and makes them available to all parts of Ruby as and when required. It also holds the pointer to the on-chip network in Ruby. Another important objects that it hold pointer to is the Ruby's wrapper for the simulator event queue (called RubyEventQueue). It also contains pointer to the simulated physical memory (pointed by variable name m_mem_vec_ptr). Thus in sum, System class in Ruby just acts as a container to pointers to some important objects of Ruby's memory system and makes them available globally to all places in Ruby through exposing it self through g_system_ptr.

Parameters
  1. random_seed is seed to randomize delays in Ruby. This allows simulating multiple runs with slightly perturbed delays/timings.
  2. randomization is the parameter when turned on, asks Ruby to randomly delay messages. This falg is useful when stress testing a system to expose corner cases. This flag should NOT be turned on when collecting simulation statistics.
  3. clock is the parameter for setting the clock frequency (on-chip).
  4. block_size_bytes specifies the size of cache blocks in bytes.
  5. mem_size specifies the physical memory size of the simulated system.
  6. network gives the pointer to the on-chip network for Ruby.
  7. profiler gives the pointer to the profiler of Ruby.
  8. tracer is the pointer to the Ruby 's memory request tracer. Tracer is primarily used to playback memory request trace in order to warm up Ruby's caches before actual simulation starts.
Related files
  • src/mem/ruby/system
    • System.hh/cc: contains code for the System
    • RubySystem.py :the corresponding python file with parameters.

Sequencer

Sequencer is one of the most important classes in Ruby, through which every memory request must pass through at least twice -- once before getting serviced by the cache coherence mechanism and once just after being serviced by the cache coherence mechanism. There is one instantiation of Sequencer class for each of the hardware thread being simulated. For example, if we are simulating a 16-core system with each core has single hardware thread context, then there would be 16 Sequencer object in the system, with each being responsible for managing memory requests (Load, Store, Atomic operations etc) from one of the given hardware thread context. ith Sequencer object handles request from only ith hardware context (in case of above example it is ith core). Each Sequencer assumes that it has access to the L1 Instruction and Data caches that are attached to a given core.

Following are the primary responsibilities of Sequencer:

  1. Injecting and accounting for all memory request to the underlying cache hierarchy (and coherence).
  2. Resource allocation and accounting (e.g. makes sure a particular core/hardware thread does not have more than specified number of outstanding memory requests).
  3. Making sure atomic operations are handled properly.
  4. Making sure that underlying cache hierarchy and coherence protocol is making forward progress.
  5. Once a request is serviced by the underlying cache hierarchy, Sequencer is responsible for returning the result to the corresponding port of the frontend (i.e.M5).
Parameters
  1. icache is the parameter where the L1 Instruction cache pointer is passed.
  2. dcache is the parameter where the L1 Data cache pointer is passed.
  3. max_outstanding_requests is the parameter for specifying maximum allowed number of outstanding memory request from a given core or hardware thread.
  4. deadlock_threshold is the parameter that specifies number of cycle (ruby cycles) after which if a given memory request is not satisfied by the cache hierarchy, a possibility of deadlock (or lack of forward progress) is declared.
Related files
  • src/mem/ruby/system
    • Sequencer.hh/cc: contains code for the Sequencer
    • Sequencer.py :the corresponding python file with parameters.
More detailed operation description

In this section, we will describe the operations of Sequencer in more details.

  • Injection of memory request to cache hierarchy, request accounting and resource allocation:
The entry point for a memory request to the Sequencer is method called makeRequest. This method is called from the corresponding RubyPort (Ruby's wrapper for M5 port). Once it gets the request, Sequencer checks for whether required resource limitations need to be enforced or not. This is done by calling a method called getRequestStatus. In this method, it is made sure that a given Sequencer (i.e. a core or hardware thread context) can does NOT issue multiple simultaneous requests to same cache block. If the current request indeed for a cache block for which another request is still pending from the same Sequencer then the current request is not issued to the cache hierarchy and instead the current request wait for the previous request from the same cache block to satisfied first. This is done in the code by checking for two requesting accounting table called m_writeRequestTable and m_readRequestTable and setting the status to RequestSatus_Aliased. It also makes sure that number of outsatnding memory request from a given Sequencer does not overshoot.
If it is found that the current request won't violate any of the above constrained then the request is registered for accounting purposes. This is done by calling the method named insertRequest. In this method, depending upon the type of the request, an entry for the request is created in the either m_writeRequestTable or m_readRequestTable. These two tables keep record for write request and read requests, respectively, that are issued to the cache hierarchy by the given Sequencer but still to be satisfied. Finally the memory request is finally pushed to the cache hierarchy by calling the method named issueRequest. This method is responsible of creating the request structure that is understood by the underlying SLICC generated coherence protocol implementation and cache hierarchy. This is done by setting the request type and mode accordingly and creating an object of class RubyRequest for the current request. Finally, L1 Instruction or L1 Data cache accesses latencies are accounted for an the request is pushed to the Cache hierarchy and the coherence mechanism for processing. This is done by enqueue-ing the request to the pointer to the mandatory queue (m_mandatory_q_ptr). Every request is passed to the corresponding L1 Cache controller through this mandatory queue. Cache hierarchy is then responsible for satisfying the request.
  • Deadlock/lack of forward progress detection:
As mentioned earlier, one other responsibility of the Sequencer is to make sure that Cache hierarchy is making progress in servicing the memory requests that have been issued. This is done by periodically waking up and scanning through the m_writeRequestTable and m_readRequestTable tables. which holds the currently outstanding requests from the given Sequencer and finds out which requests have been issued but have not satisfied by the cache hierarchy. If it finds any unsatisfied request that have been issues more than m_deadlock_threshold (parameter) cycles back, it reports a possible deadlock by the Cache hierarchy. Note that, although it reports possible deadlock, it actually detects lack of forward progress. Thus there may be false positives in this deadlock detection mechanism.
  • Send back result to the front-end and making sure Atomic operations are handled properly:
Once the Cache hierarchy satisfies a request it calls Sequencer's readCallback or writeCallback method depending upon the type of the request. Note that the time taken to service the request is automatically accounted for through event scheduling as the readCallback or writeCallback'' are called only after number of cycles required to satisfy the request has been accounted for. In these two methods the corresponding record of the request from the m_readRequestTable or m_writeRequestTable is removed. Also if the request was found to be part of a Atomic operations (e.g. RMW or LL/SC), then appropriate actions are taken to make sure that semantics of atomic operations are respected.
After that a method called hitCallback is called. In this method, some statics is collected by calling functions on the Ruby's profiler. For write request, the data is actually updated in this function (and not while simulating the request through the cache hierarchy and coherence protocol). Finally ruby_hit_callback is called ultimately sends back the packet to the front-end, signifying completion of the processing of the memory request from Ruby's side.

CacheMemory and Cache Replacement Polices

This module can model any Set-associative Cache structure with a given associativity and size. Each instantiation of the following module models a single bank of a cache. Thus different types of caches in system (e.g. L1 Instruction, L1 Data , L2 etc) and every banks of a cache needs to have separate instantiation of this module. This module can also model Fully associative cache when the associativity is set to 1. In Ruby memory system, this module is primarily expected to be accessed by the SLICC generated codes of the given Coherence protocol being modeled.

Basic Operation

This module models the set-associative structure as a two dimensional (2D) array. Each row of the 2D array represents a set of in the set-associative cache structure, while columns represents ways. The number of columns is equal to the given associativity (parameter), while the number of rows is decided depending on the desired size of the structure (parameter), associativity (parameter) and the size of the cache line (parameter). This module exposes six important functionalities which Coherence Protocols uses to manage the caches.

  1. It allows to query if a given cache line address is present in the set-associative structure being modeled through a function named isTagPresent. This function returns true, iff the given cache line address is present in it.
  2. It allows a lookup operation which returns the cache entry for a given cache line address (if present), through a function named lookup. It returns NULL if the blocks with given address is not present in the set-associative cache structure.
  3. It allows to allocate a new cache entry in the set-associative structure through a function named allocate.
  4. It allows to deallocate a cache entry of a given cache line address through a function named deallocate.
  5. It can be queried to find out whether to allocate an entry with given cache line address would require replacement of another entry in the designated set (derived from the cache line address) or not. This functionality is provided through cacheAvail function, which for a given cache line address, returns True, if NO replacement of another entry the same set as the given address is required to make space for a new entry with the given address.
  6. The function cacheProbe is used to find out cache line address of a victim line, in case placing a new entry would require victimizing another cache blocks in the same set. This function returns the cache line address of the victim line given the address of the address of the new cache line that would have to be allocated.
Parameters

There are four important parameters for this class.

  1. size is the parameter that provides the size of the set-associative structure being modeled in units of bytes.
  2. assoc specifies the set-associativity of the structure.
  3. replacement_policy is the name of the replacement policy that would be used to select victim cache line when there is conflict in a given set. Currently, only two possible choices are available (PSEUDO_LRU and LRU).
  4. Finally, start_index_bit parameter specifies the bit position in the address from where indexing into the cache should start. This is a tricky parameter and if not set properly would end up using only portion of the cache capacity. Thus how this value should be specified is explained through couple of examples. Let us assume the cache line size if 64 bytes and a single core machine with a L1 cache with only one bank and a L2 cache with 4 banks. For the CacheMemory module that would model the L1 cache should have start_index_bit set to log2(64) = 6 (this is the default value assuming 64 bytes cache line). This is required as addresses passed around in the Ruby is full address (i.e. equal to the number of bits required to access any address in the physical address range) and as the caches would be accessed in granularity of cache line size (here 64 bytes), the lower order 6 bits in the address would be essentially 0. So we should discard last 6 bits of the given address while calculating which set (index) in the set associative structure the given address should go to. Now let's look into a more complicated case of L2 cache, which has 4 banks. As mentioned previously, this modules models a single bank of a set-associative cache. Thus there will be four instantiation of the CacheMemory class to model the whole L2 cache. Assuming which cache bank a request goes to is statically decided by the low oder log2(4) = 2 bits of the cache line address, the value of the bits in the address at the position 6 and 7 would be same for all accesses coming to a given bank (i.e. a instance of CacheMemory here). Thus indexing within the set associative structure (CacheMemory instance) modeling a given bank should use address bits 8 and higher for finding which set a cache block should go to. Thus start_index_bit parameter should be set to 8 for the banks of L2 in this example. If erroneously if this is set 6, only a fourth of desired L2 capacity would be utilized !!!
More detailed description of operation

As mentioned previously, the set-associative structure is modeled as a 2D array in the CacheMemory class. The variable m_cache is this all important 2D array containing the set-associative structure. Each element of this 2D array is derived from type AbstractCacheEntry class. Beside the minimal required functionality and contents of each cache entry, it can be extended inside the Coherence protocol files. This allows CacheMemory to be generic enough to hold any type of cache entry as desired by a given Coherence protocol as long as it derives from AbstractCacheEntry interface. The m_cache 2D array has number of rows equal to the number of sets in the set-associative structure being modeled while the number of columns is equal to associativity.

As should happen in any set-associative structure, which set (row) a cache entry should reside is decided by part of the cache block address used for indexing. The function addressToCacheSet calculates this index given an address. The way in which a cache entry reside in its designated set (row) is noted in the a hash_map object named m_tag_index. So to access an cache entry in the set-associative structure, first the set number where the cache block should reside is calculated and then m_tag_index is looked-up to find out the way in which the required cache block resides. If an cache entry holds invalid entry or its empty then its set to NULL or its permission is set to NotPresent.

One important aspect of the Ruby's caches are the segregation of the set-associative structure for the cache and its replacement policy. This allows modular design where structure of the cache is independent of the replacement policy in the cache. When a victim needs to be selected to make space for a new cache block (by calling cacheProbe function), getVictim function of the class implementing replacement policy is called for the given set. getVictim returns the way number of the victim. The replacement policy is updated about accesses by calling touch function of the replacement policy, which allows it to update the access recency. Currently there are two replacement policies are supported -- LRU and PseudoLRU. LRU policy has a straight forward implementation where it keeps track of the absolute time when each way within each set is accessed last time and it always victimizes the entry which was last accessed furthest back in time. PseudoLRU implements a binary-tree based Non-Recently-Used policy. It arranges the ways in each set in an implicit binary tree like structure. Every node of the binary tree encodes the information which of its two subtrees was accessed more recently. During victim selection process, it starts from the root of the tree and traverse down such that it chooses the subtree which was touched less recently. Traversal continues until it reaches a leaf node. It then returns the id of the leaf node reached.

Related files
  • src/mem/ruby/system
    • CacheMemory.cc: contains CacheMemory class which models a cache bank
    • CacheMemory.hh: Interface for the CacheMemory class
    • Cache.py: Python configuration file for CacheMemory class
    • AbstractReplacementPolicy.hh: Generic interface for Replacement policies for the Cache
    • LRUPolicy.hh: contains LRU replacement policy
    • PseudoLRUPolicy.hh: contains Pseudo-LRU replacement policy
  • src/mem/ruby/slicc_interface
    • AbstarctCacheEntry.hh: contains the interface of Cache Entry

DMASequencer

This module implements handling for DMA transfers. It is derived from the RubyPort class. There can be a number of DMA controllers that interface with the DMASequencer. The DMA sequencer has a protocol-independent interface and implementation. The DMA controllers are described with SLICC and are protocol-specific.

TODO: Fix documentation to reflect latest changes in the implementation.

Note:

  1. At any time there can be only 1 request active in the DMASequencer.
  2. Only ordinary load and store requests are handled. No other request types such as Ifetch, RMW, LL/SC are handled
Related Files
  • src/mem/ruby/system
    • DMASequencer.hh: Declares the DMASequencer class and structure of a DMARequest
    • DMASequencer.cc: Implements the methods of the DMASequencer class, such as request issue and callbacks.
Configuration Parameters

Currently there are no special configuration parameters for the DMASequencer.

Basic Operation

A request for data transfer is split up into multiple requests each transferring cache-block-size chunks of data. A request is active as long as all the smaller transfers are not completed. During this time, the DMASequencer is in a busy state and cannot accepts any new transfer requests.

DMA requests are made through the makeRequest method. If the sequencer is not busy and the request is of the correct type (LD/ST), it is accepted. A sequence of requests for smaller data chunks is then issued. The issueNext method issues each of the smaller requests. A data/acknowledgment callback signals completion of the last transfer and triggers the next call to issueNext as long as all of the original data transfer is not complete. There is no separate event scheduler within the DMASequencer.

Memory Controller

This module simulates a basic DDR-style memory controller. It models a single channel, connected to any number of DIMMs with any number of ranks of DRAMs each. The following picture shows an overview of the memory organization and connections to the memory controller. General information about memory controllers can be found here.

Mc overview.jpg

Note:

  1. The product of the memory bus cycle multiplier, memory controller latency, and clock cycle time(=1/processor frequency) gives a first-order approximation of the latency of memory requests in time. The Memory Controller module refines this further by considering bank & bus contention, queueing effects of finite queues, and refreshes.
  2. Data sheet values for some components of the memory latency are specified in time (nanoseconds), whereas the Memory Controller module expects all delay configuration parameters in cycles. The parameters should be set appropriately taking into account the processor and bus frequencies.
  3. The current implementation does not consider pin-bandwidth contention. Infinite bandwidth is assumed.
  4. Only closed bank policy is currently implemented; that is, each bank is automatically closed after a single read or write.
  5. The current implementation handles only a single channel. If you want multiple address/data channels, you need to instantiate multiple copies of this module.
  6. This is the only controller that is NOT specified in SLICC, but in C++.

Documentation source: Most (but not all) of the writeup in this section is taken verbatim from documentation in the gem5 source files, rubyconfig.defaults file of GEMS, and a ppt created by Andy Phelps on Jan 18, 2008.

Related Files
  • src/mem/ruby/system
    • MemoryControl.hh: This file declares the Memory Controller class.
    • MemoryControl.cc: This file implements all the operations of the memory controller. This includes processing of input packets, address decoding and bank selection, request scheduling and contention handling, handling refresh, returning completed requests to the directory controller.
    • MemoryControl.py: Configuration parameters
Configuration Parameters
  • dimms_per_channel: Currently the only thing that matters is the number of ranks per channel, i.e. the product of this parameter and ranks_per_dimm. But if and when this is expanded to do FB-DIMMs, the distinction between the two will matter.
  • Address Mapping: This is controlled by configuration parameters banks_per_rank, bank_bit_0, ranks_per_dimm, rank_bit_0, dimms_per_channel, dimm_bit_0. You could choose to have the bank bits, rank bits, and DIMM bits in any order. For the default values, we assume this format for addresses:
    • Offset within line: [5:0]
    • Memory controller #: [7:6]
    • Bank: [10:8]
    • Rank: [11]
    • DIMM: [12]
    • Row addr / Col addr: [top:13]

If you get these bits wrong, then some banks won't see any requests; you need to check for this in the .stats output.

  • mem_bus_cycle_multiplier: Basic cycle time of the memory controller. This defines the period which is used as the memory channel clock period, the address bus bit time, and the memory controller cycle time. Assuming a 200 MHz memory channel (DDR-400, which has 400 bits/sec data), and a 2 GHz processor clock, mem_bus_cycle_multiplier=10.
  • mem_ctl_latency: Latency to returning read request or writeback acknowledgement. Measured in memory address cycles. This equals tRCD + CL + AL + (four bit times) + (round trip on channel) + (memory control internal delays). It's going to be an approximation, so pick what you like. Note: The fact that latency is a constant, and does not depend on two low-order address bits, implies that our memory controller either: (a) tells the DRAM to read the critical word first, and sends the critical word first back to the CPU, or (b) waits until it has seen all four bit times on the data wires before sending anything back. Either is plausible. If (a), remove the "four bit times" term from the calculation above.
  • rank_rank_delay: This is how many memory address cycles to delay between reads to different ranks of DRAMs to allow for clock skew.
  • read_write_delay: This is how many memory address cycles to delay between a read and a write. This is based on two things: (1) the data bus is used one cycle earlier in the operation; (2) a round-trip wire delay from the controller to the DIMM that did the reading. Usually this is set to 2.
  • basic_bus_busy_time: Basic address and data bus occupancy. If you are assuming a 16-byte-wide data bus (pairs of DIMMs side-by-side), then the data bus occupancy matches the address bus occupancy at 2 cycles. But if the channel is only 8 bytes wide, you need to increase this bus occupancy time to 4 cycles.
  • mem_random_arbitrate: By default, the memory controller uses round-robin to arbitrate between ready bank queues for use of the address bus. If you wish to add randomness to the system, set this parameter to one instead, and it will restart the round-robin pointer at a random bank number each cycle. If you want additional nondeterminism, set the parameter to some integer n >= 2, and it will in addition add a n% chance each cycle that a ready bank will be delayed an additional cycle. Note that if you are in mem_fixed_delay mode (see below), mem_random_arbitrate=1 will have no effect, but mem_random_arbitrate=2 or more will.
  • mem_fixed_delay: If this is nonzero, it will disable the memory controller and instead give every request a fixed latency. The nonzero value specified here is measured in memory cycles and is just added to MEM_CTL_LATENCY. It will also show up in the stats file as a contributor to memory delays stalled at head of bank queue.
  • tFAW: This is an obscure DRAM parameter that says that no more than four activate requests can happen within a window of a certain size. For most configurations this does not come into play, or has very little effect, but it could be used to throttle the power consumption of the DRAM. In this implementation (unlike in a DRAM data sheet) TFAW is measured in memory bus cycles; i.e. if TFAW = 16 then no more than four activates may happen within any 16 cycle window. Refreshes are included in the activates.
  • refresh_period: This is the number of memory cycles between refresh of row x in bank n and refresh of row x+1 in bank n. For DDR-400, this is typically 7.8 usec for commercial systems; after 8192 such refreshes, this will have refreshed the whole chip in 64 msec. If we have a 5 nsec memory clock, 7800 / 5 = 1560 cycles. The memory controller will divide this by the total number of banks, and kick off a refresh to somebody every time that amount is counted down to zero. (There will be some rounding error there, but it should have minimal effect.)
  • Typical Settings for configuration parameters: The default values are for DDR-400 assuming a 2GHz processor clock. If instead of DDR-400, you wanted DDR-800, the channel gets faster but the basic operation of the DRAM core is unchanged. Busy times appear to double just because they are measured in smaller clock cycles. The performance advantage comes because the bus busy times don't actually quite double. You would use something like these values:
mem_bus_cycle_multiplier: 5
bank_busy_time: 22
rank_rank_delay: 2
read_write_delay: 3
basic_bus_busy_time: 3
mem_ctl_latency: 20
refresh_period: 3120
Basic Operation
Mc data struct.jpg
  • Data Structures

Requests are enqueued into a single input queue. Responses are dequeued from a single response queue. There is a single bank queue for each DRAM bank (the total number of banks is the number of DIMMs per channel x number of ranks per DIMM x number of banks per rank). Each bank also has a busy counter. tFAW shift registers are maintained per rank.

  • Timing
Mc addr command timing.jpg
Mc addr command timing back to back.jpg

The “Act” (Activate) and “Rd” (Read) commands (or activate and write) always come as a pair, because we are modeling posted-CAS mode. (In non-posted-CAS, the read or write command would be scheduled separately later.) We do not explicitly model the separate commands; we simply say that the address bus occupancy is 2 cycles.

Since the data bus is also occupied for 2 cycles at a fixed offset in time as shown above, we do not need to explicitly model it; memory channel occupancy is still 2 cycles.

For back-to-back requests the data for the 2nd request could be delayed due to the following reasons:

    • Read happens from a different rank
    • Read is followed by a write
    • 2nd request has a busy bank
    • Basic request time > 2 (e.g. needs 8 data phits)
  • Scheduling and Bank Contention

The wakeup function, and in turn, the executeCycle function is tiggered once every memory clock cycle.

Each memory request is placed in a queue associated with a specific memory bank. This queue is of finite size; if the queue is full the request will back up in an (infinite) common queue and will effectively throttle the whole system. This sort of behavior is intended to be closer to real system behavior than if we had an infinite queue on each bank. If you want the latter, just make the bank queues unreasonably large.

The head item on a bank queue is issued when all of the following are true:

  1. The bank is available
  2. The address path to the DIMM is available
  3. The data path to or from the DIMM is available

Note that we are not concerned about fixed offsets in time. The bank will not be used at the same moment as the address path, but since there is no queue in the DIMM or the DRAM it will be used at a constant number of cycles later, so it is treated as if it is used at the same time.

We are assuming "posted CAS"; that is, we send the READ or WRITE immediately after the ACTIVATE. This makes scheduling the address bus trivial; we always schedule a fixed set of cycles. For DDR-400, this is a set of two cycles; for some configurations such as DDR-800 the parameter tRRD forces this to be set to three cycles.

We assume a four-bit-time transfer on the data wires. This is the minimum burst length for DDR-2. This would correspond to (for example) a memory where each DIMM is 72 bits wide and DIMMs are ganged in pairs to deliver 64 bytes at a shot.This gives us the same occupancy on the data wires as on the address wires (for the two-address-cycle case).

The only non-trivial scheduling problem is the data wires. A write will use the wires earlier in the operation than a read will; typically one cycle earlier as seen at the DRAM, but earlier by a worst-case round-trip wire delay when seen at the memory controller. So, while reads from one rank can be scheduled back-to-back every two cycles, and writes (to any rank) scheduled every two cycles, when a read is followed by a write we need to insert a bubble. Furthermore, consecutive reads from two different ranks may need to insert a bubble due to skew between when one DRAM stops driving the wires and when the other one starts. (These bubbles are parameters.)

This means that when some number of reads and writes are at the heads of their queues, reads could starve writes, and/or reads to the same rank could starve out other requests, since the others would never see the data bus ready. For this reason, we have implemented an anti-starvation feature. A group of requests is marked "old", and a counter is incremented each cycle as long as any request from that batch has not issued. If the counter reaches twice the bank busy time, we hold off any newer requests until all of the "old" requests have issued.

Interconnection Network

The various controllers of the memory hierarchy (L1/L2 caches, Directory, DMA etc), specified by the coherence protocol, are connected together via an interconnection network. The primary components of this network are

  1. Topology
  2. Routing
  3. Flow-Control
  4. Router Microarchitecture

Topology

The connection between the various controllers are specified via python files.

  • Related Files:
    • src/mem/ruby/network/topologies/Pt2Pt.py
    • src/mem/ruby/network/topologies/Crossbar.py
    • src/mem/ruby/network/topologies/Mesh.py
    • src/mem/ruby/network/topologies/MeshDirCorners.py
    • src/mem/ruby/network/Network.py
  • Topology Descriptions:
    • Pt2Pt: Each controller (L1/L2/Directory) is connected to every other controller via a direct link. This can be invoked from command line by --topology=Pt2Pt.
    • Crossbar: Each controller (L1/L2/Directory) is connected to every other controller via one switch (modeling the crossbar). This can be invoked from command line by --topology=Crossbar.
    • Mesh: This topology requires the number of directories to be equal to the number of cpus. The number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present), and one Directory. It can be invoked from command line by --topology=Mesh. The number of rows in the mesh has to be specified by --mesh-rows. This parameter enables the creation of non-symmetrical meshes too.
    • MeshDirCorners: This topology requires the number of directories to be equal to 4. number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present). Each corner router/switch is connected to one Directory. It can be invoked from command line by --topology=MeshDirCorners. The number of rows in the mesh has to be specified by --mesh-rows.
Topology overview.jpg
  • Optional parameters specified by the topology files (defaults in Network.py):
    • latency: latency of traversal within the link.
    • weight: weight associated with this link. This parameter is used by the routing table while deciding routes, as explained next in Routing.
    • bw_multiplier: used by simple network to model different link bandwidths. This parameter is specified in 1000th of a byte, and the individual link bandwidth = bw_multiplier x endpoint_bandwidth (specified in Network.py).


Routing

Based on the topology, shortest path graph traversals are used to populate routing tables at each router/switch. The default routing algorithm tries to choose the route with minimum number of link traversals. Links can be given weights in the topology files to model different routing algorithms. For example, in Mesh.py and MeshDirCorners.py Y-direction links are given weights of 2, while X-direction links are given weights of 1, resulting in XY traversals in a mesh. adaptive_routing (in src/mem/ruby/network/SimpleNetwork.py) can be enabled to make the simple network choose routes based on occupancy of queues at each output port.


Flow-Control and Router Microarchitecture

Ruby supports two network models, Simple and Garnet, which trade-off detailed modeling versus simulation speed respectively.

  • Related Files:
    • src/mem/ruby/network/Network.py
    • src/mem/ruby/network/simple
    • src/mem/ruby/network/garnet/BaseGarnetNetwork.py
    • src/mem/ruby/network/garnet/fixed-pipeline
    • src/mem/ruby/network/garnet/flexible-pipeline
Configuration and Setup

The default network model in Ruby is the simple network. Garnet fixed-pipeline or flexible-pipeline networks can be enabled by adding --garnet-network=fixed, or --garnet-network=flexible on the command line, respectively.

  • Configuration:

Some of the network parameters specified in Network.py are:

    • number_of_virtual_networks: This is the maximum number of virtual networks. The actual number of active virtual networks is determined by the protocol.
    • control_msg_size: The size of control messages in bytes. Default is 8. m_data_msg_size in Network.cc is set to the block size in bytes + control_msg_size.
    • link_latency: Latency of each link in cycles. This can be specified for each link from the topology files. Default is 1.


Simple Network

The simple network models hop-by-hop network traversal, but abstracts out detailed modeling within the switches. The switches are modeled in simple/PerfectSwitch.cc while the links are modeled in simple/Throttle.cc. The flow-control is implemented by monitoring the available buffers and available bandwidth in output links before sending.

Simple network.jpg
  • Configuration:

Simple network uses the generic network parameters in Network.py. Additional parameters are specified in SimpleNetwork.py:

    • buffer_size: Size of buffers at each switch input and output ports. A value of 0 implies infinite buffering.
    • endpoint_bandwidth: Bandwidth at the end points of the network in 1000th of byte.
    • bw_multiplier: Bandwidth specified in 1000th of byte. The individual link bandwidth becomes bw_multipler x endpoint_bandwidth.
    • adaptive_routing: This enables adaptive routing based on occupancy of output buffers.


Garnet Networks

Garnet is a detailed interconnection network model inside GEM5. It consists of a detailed fixed-pipeline model, and an approximate flexible-pipeline model.

The fixed-pipeline model is intended for low-level interconnection network evaluations and models the detailed micro-architectural features of a 5-stage Virtual Channel router with credit-based flow-control. Researchers interested in investigating different network microarchitectures can readily modify the modeled microarchitecture and pipeline. Also, for system level evaluations that are not concerned with the detailed network characteristics, this model provides an accurate network model and should be used as the default model.

The flexible-pipeline model is intended to provide a reasonable abstraction of all interconnection network models, while allowing the router pipeline depth to be flexibly adjusted. A router pipeline might range from a single cycle to several cycles. For evaluations that wish to easily change the router pipeline depth, the flexible-pipeline model provides a neat abstraction that can be used.

If your use of Garnet contributes to a published paper, please cite the research paper which can be found here.

  • Configuration

Garnet uses the generic network parameters in Network.py. Additional parameters are specified in BaseGarnetNetwork.py:

    • flit_size: flit size in bytes. Flits are the granularity at which information is sent from one router to the other. Default is 16 (=> 128 bits). [This default value of 16 results in control messages fitting within 1 flit, and data messages fitting within 5 flits].
    • vcs_per_class: number of virtual channels (VC) per message class (Note: message class = virtual network). Default is 4.

The following are only valid for fixed-pipeline:

    • buffers_per_data_vc: number of flit-buffers per VC in the data message class. Since data messages occupy 5 flits, this value can lie between 1-5. Default is 4.
    • buffers_per_ctrl_vc: number of flit-buffers per VC in the control message class. Since control messages occupy 1 flit, and a VC can only hold one message at a time, this value has to be 1. Default is 1.

Note: garnet assumes that ctrl messages are 1-flit wide. If ctrl flits occupy more than one flit (due to smaller flit-size, or larger control_msg_size), all VCs are given buffers_per_data_vc number of flit-buffers.

The following are only valid for flexible-pipeline:

    • number_of_pipe_stages: number of pipeline stages in each router in the flexible-pipeline model. Default is 4.


  • Additional features
    • Routing: Currently, garnet only models deterministic routing using the routing tables described earlier.
    • Modeling variable link bandwidth: The flit size specifies the link bandwidth as the number of bytes per cycle per network link. Links which have lower bandwidth than this (for instance some off-chip links) can be modeled by specifying a longer latency across them in the topology file (as explained earlier).
    • Multicast messages: The network modeled does not have hardware multi-cast support within the network. A multi-cast message gets broken into multiple uni-cast messages at the interface to the network.


  • Garnet fixed-pipeline network

The garnet fixed-pipeline models a classic 5-stage Virtual Channel router. The 5-stages are:

  1. Buffer Write (BW) + Route Compute (RC): The incoming flit gets buffered and computes its output port.
  2. VC Allocation (VA): All buffered flits allocate for VCs at the next routers. [The allocation occurs in a separable manner: First, each input VC chooses one output VC, choosing input arbiters, and places a request for it. Then, each output VC breaks conflicts via output arbiters]. All arbiters in ordered virtual networks are queueing to maintain point-to-point ordering. All other arbiters are round-robin.
  3. Switch Allocation (SA): All buffered flits try to reserve the switch ports for the next cycle. [The allocation occurs in a separable manner: First, each input chooses one input VC, using input arbiters, which places a switch request. Then, each output port breaks conflicts via output arbiters]. All arbiters in ordered virtual networks are queueing to maintain point-to-point ordering. All other arbiters are round-robin.
  4. Switch Traversal (ST): Flits that won SA traverse the crossbar switch.
  5. Link Traversal (LT): Flits from the crossbar traverse links to reach the next routers.

The flow-control implemented is credit-based.

Garnet router.jpg


  • Garnet flexible-pipeline network

The garnet flexible-pipeline model should be used when one desires a router pipeline different than 5 stages (the 5 stages include the link traversal stage). All the components of a router (buffers, VC and switch allocators, switch etc) are modeled similar to the fixed-pipeline design, but the pipeline depth is not modeled, and comes as an input parameter number_of_pipe_stages. The flow-control is implemented by monitoring the availability of buffers at each output port before sending.


Life of a memory request in Ruby

In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.

  1. A memory request from a core or hardware context of M5 enters the jurisdiction of Ruby through the RubyPort::recvTiming interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of non-multithreaded cores). A port from the side of each core is tied to a corresponding RubyPort.
  2. The memory request arrives as a M5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a normal memory request (not PIO access), it passes the request to the Sequencer::makeRequest interface of the attached Sequencer object with the port (variable ruby_port holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
  3. As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the Sequencer::makeRequest, it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the mandatory queue after accounting for L1 cache access latency. The mandatory queue (variable name m_mandatory_q_ptr) effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
  4. L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the mandatory queue and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of MessageBuffer class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through MessageBuffer. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols). Once the request get satisfied it is pushed upwards in the hierarchy through MessageBuffers.
  5. Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its readCallback or 'writeCallback method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
  6. The Sequencer then clears up the accounting information for the corresponding request and then calls the RubyPort::ruby_hit_callback method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (M5).