Ruby

From gem5
Revision as of 00:29, 31 March 2011 by ArkapravaBasu (talk | contribs) (MESI_CMP_directory)
Jump to: navigation, search

Contents

Ruby

High level components of Ruby

Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:

  1. Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
  2. Rich configurability -- almost any aspect affecting the memory hierarchy functionality and timing can be controlled.
  3. Rapid prototyping -- a high-level specification language, SLICC, is used to specify functionality of various controllers.

The following picture, taken from the GEMS tutorial in ISCA 2005, shows a high-level view of the main components in Ruby.

Ruby overview.jpg

SLICC + Coherence protocols:

   Need to say what is SLICC and whats its purpose. 
   Talk about high level strcture of a typical coherence protocol file, that SLICC uses to generate code. 
   A simple example structure from protocol like MI_example can help here.

SLICC stands for Specification Language for Implementing Cache Coherence. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.

Slicc overview.jpg

The language it self is syntactically similar to C/C++. The SLICC compiler generates C++ code for the state machine. Additionally it can also generate HTML documentation for the protocol.

Nilay will do it

Protocol independent memory components

  1. Cache Memory
  2. Replacement Policies
  3. Memory Controller

Arka will do it

Interconnection Network

The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together. There are 3 key parts to this:

  1. Topology specification: These are specified with python files that describe the topology (mesh/crossbar/ etc.), link latencies and link bandwidth. The following picture shows a high level view of several well-known network topologies, all of which can be easily specified.
Topology overview.jpg
  1. Cycle-accurate network model: Garnet is a cycle accurate, pipelined, network model that builds and simulates the specified topology. It simulates the router pipeline and movement of flits across the network subject to the routing algorithm, latency and bandwidth constraints.
  2. Network Power model: The Orion power model is used to keep track of router and link activity in the network. It calculates both router static power and link and router dynamic power as flits move through the network.

More details about the network model implementation are described here

Implementation of Ruby

Directory Structure

  • src/mem/
    • protocols: SLICC specification for coherence protocols
    • slicc: implementation for SLICC parser and code generator
    • ruby
      • buffers: implementation for message buffers that are used for exchanging information between the cache, directory, memory controllers and the interconnect
      • common: frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block, basic types (int32, uint64, etc.)
      • eventqueue: Ruby’s event queue API for scheduling events on the gem5 event queue
      • filters: various Bloom filters (stale code from GEMS)
      • network: Interconnect implementation, sample topology specification, network power calculations
      • profiler: Profiling for cache events, memory controller events
      • recorder: Cache warmup and access trace recording
      • slicc_interface: Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
      • system: Protocol independent memory components – CacheMemory, DirectoryMemory, Sequencer, RubyPort

SLICC

   Explain functionality/ capability of SLICC
   Talk about
   AST, Symbols, Parser and code generation in some details but NO need to cover every file and/or functions. 
   Few examples should suffice.

Nilay will do it

Protocols

   Need to talk about each protocol being shipped. Need to talk about protocol specific configuration parameters.
   NO need to explain every action or every state/events, but need to give overall idea and how it works
   and assumptions (if any).
Common Notations and Data Structures
Coherence Messages

These are described in the <protocol-name>-msg.sm file for each protocol.

Message Description
ACK/NACK positive/negative acknowledgement for requests that wait for the direction of resolution before deciding on the next action. Examples are writeback requests, exclusive requests.
GETS request for shared permissions to satisfy a CPU's load or IFetch.
GETX request for exclusive access.
INV invalidation request. This can be triggered by the coherence protocol itself, or by the next cache level/directory to enforce inclusion or to trigger a writeback for a DMA access so that the latest copy of data is obtained.
PUTX request for writeback of cache block. Some protocols (e.g. MOESI_CMP_directory) may use this only for writeback requests of exclusive data.
PUTS request for writeback of cache block in shared state.
PUTO request for writeback of cache block in owned state.
PUTO_Sharers request for writeback of cache block in owned state but other sharers of the block exist.
UNBLOCK message to unblock next cache level/directory for blocking protocols.
AccessPermissions

These are associated with each cache block and determine what operations are permitted on that block. It is closely correlated with coherence protocol states.

Permissions Description
Invalid The cache block is invalid. The block must first be obtained (from elsewhere in the memory hierarchy) before loads/stores can be performed. No action on invalidates (except maybe sending an ACK). No action on replacements. The associated coherence protocol states are I or NP and are stable states in every protocol.
Busy TODO
Read_Only Only operations permitted are loads, writebacks, invalidates. Stores cannot be performed before transitioning to some other state.
Read_Write Loads, stores, writebacks, invalidations are allowed. Usually indicates that the block is dirty.
Data Structures
  • Message Buffers:TODO
  • TBE Table: TODO
  • Timer Table: This maintains a map of address-based timers. For each target address, a timeout value can be associated and added to the Timer table. This data structure is used, for example, by the L1 cache controller implementation of the MOESI_CMP_directory protocol to trigger separate timeouts for cache blocks. Internally, the Timer Table uses the event queue to schedule the timeouts. The TimerTable supports a polling-based interface, isReady() to check if a timeout has occurred. Timeouts on addresses can be set using the set() method and removed using the unset() method.
Related Files:
src/mem/ruby/system/TimerTable.hh: Declares the TimerTable class
src/mem/ruby/system/TimerTable.cc: Implementation of the methods of the TimerTable class, that deals with setting addresses & timeouts, scheduling events using the event queue.
Coherence controller FSM Diagrams
  • The Finite State Machines show only the stable states
  • Transitions are annotated using the notation "Event list" or "Event list : Action list" or "Event list : Action list : Event list". For example, Store : GETX indicates that on a Store event, a GETX message was sent whereas GETX : Mem Read indicates that on receiving a GETX message, a memory read request was sent. Only the main triggers and actions are listed.
  • In the diagrams, the transition labels are associated with the arc that cuts across the transition label or the closest arc.
MI example

This is a simple cache coherence protocol that is used to illustrate protocol specification using SLICC.

Related Files
  • src/mem/protocols
    • MI_example-cache.sm: cache controller specification
    • MI_example-dir.sm: directory controller specification
    • MI_example-dma.sm: dma controller specification
    • MI_example-msg.sm: message type specification
    • MI_example.slicc: container file
Cache Hierarchy

This protocol assumes a 1-level cache hierarchy. The cache is private to each node. The caches are kept coherent by a directory controller. Since the hierarchy is only 1-level, there is no inclusion/exclusion requirement. This protocol does not differentiate between loads and stores.

Stable States and Invariants
States Invariants
M The cache block has been accessed (read/written) by this node. No other node holds a copy of the cache block
I The cache block at this node is invalid

The notation used in the controller FSM diagrams is described here.

Cache controller
  • Requests, Responses, Triggers:
    • Load, Instruction fetch, Store from the core
    • Replacement from self
    • Data from the directory controller
    • Forwarded request (intervention) from the directory controller
    • Writeback acknowledgement from the directory controller
    • Invalidations from directory controller (on dma activity)
MI example cache FSM.jpg
  • Main Operation:
    • On a load/Instruction fetch/Store request from the core:
      • it checks whether the corresponding block is present in the M state. If so, it returns a hit
      • otherwise, if in I state, it initiates a GETX request from the directory controller
    • On a replacement trigger from self:
      • it evicts the block, issues a writeback request to the directory controller
      • it waits for acknowledgement from the directory controller (to prevent races)
    • On a forwarded request from the directory controller:
      • This means that the block was in M state at this node when the request was generated by some other node
      • It sends the block directly to the requesting node (cache-to-cache transfer)
      • It evicts the block from this node
    • Invalidations are similar to replacements
Directory controller
  • Requests, Responses, Triggers:
    • GETX from the cores, Forwarded GETX to the cores
    • Data from memory, Data to the cores
    • Writeback requests from the cores, Writeback acknowledgements to the cores
    • DMA read, write requests from the DMA controllers
MI example dir FSM.jpg
  • Main Operation:
    • The directory maintains track of which core has a block in the M state. It designates this core as owner of the block.
    • On a GETX request from a core:
      • If the block is not present, a memory fetch request is initiated
      • If the block is already present, then it means the request is generated from some other core
        • In this case, a forwarded request is sent to the original owner
        • Ownership of the block is transferred to the requestor
    • On a writeback request from a core:
      • If the core is owner, the data is written to memory and acknowledgement is sent back to the core
      • If the core is not owner, a NACK is sent back
        • This can happen in a race condition
        • The core evicted the block while a forwarded request some other core was on the way and the directory has already changed ownership for the core
        • The evicting core holds the data till the forwarded request arrives
    • On DMA accesses (read/write)
      • Invalidation is sent to the owner node (if any). Otherwise data is fetched from memory.
      • This ensures that the most recent data is available.
Other features
    • MI protocols don't support LL/SC semantics. A load from a remote core will invalidate the cache block.
    • This protocol has no timeout mechanisms.
MOESI_hammer

This is an implementation of AMD's Hammer protocol, which is used in AMD's Hammer chip (also know as the Opteron or Athlon 64).

Related Files
  • src/mem/protocols
    • MOESI_hammer-cache.sm: cache controller specification
    • MOESI_hammer-dir.sm: directory controller specification
    • MOESI_hammer-dma.sm: dma controller specification
    • MOESI_hammer-msg.sm: message type specification
    • MOESI_hammer.slicc: container file
Cache Hierarchy

This protocol implements a 2-level private cache hierarchy. It assigns separate Instruction and Data L1 caches, and a unified L2 cache to each core. These caches are private to each core and are controlled with one shared cache controller. This protocol enforce exclusion between L1 and L2 caches.

Stable States and Invariants
States Invariants
MM The cache block is held exclusively by this node and is potentially locally modified (similar to conventional "M" state).
O The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
M The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
S The cache line holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The cache line can be read, but not written in this state.
I The cache line is invalid and does not hold a valid copy of the data.
Cache controller

The notation used in the controller FSM diagrams is described here.

MOESI_hammer supports cache flushing. To flush a cache line, the cache controller first issues a GETF request to the directory to block the line until the flushing is completed. It then issues a PUTF and writes back the cache line.

MOESI hammer cache FSM.jpg
Directory controller

MOESI_hammer memory module, unlike a typical directory protocol, does not contain any directory state and instead broadcasts requests to all the processors in the system. In parallel, it fetches the data from the DRAM and forward the response to the requesters.

probe filter: TODO

  • Stable States and Invariants
States Invariants
NX Not Owner, probe filter entry exists, block in O at Owner.
NO Not Owner, probe filter entry exists, block in E/M at Owner.
S Data clean, probe filter entry exists pointing to the current owner.
O Data clean, probe filter entry exists.
E Exclusive Owner, no probe filter entry.
  • Controller

TODO

Other features

TODO

MOESI_CMP_token

Shoaib will do it

MOESI_CMP_directory

In contrast with the MESI protocol, the MOESI protocol introduces an additional Owned state.

Related Files
  • src/mem/protocols
    • MOESI_CMP_directory-L1cache.sm: L1 cache controller specification
    • MOESI_CMP_directory-L2cache.sm: L2 cache controller specification
    • MOESI_CMP_directory-dir.sm: directory controller specification
    • MOESI_CMP_directory-dma.sm: dma controller specification
    • MOESI_CMP_directory-msg.sm: message type specification
    • MOESI_CMP_directory.slicc: container file
Cache Hierarchy
L1 Cache
  • Stable States and Invariants
States Invariants
MM The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state).
MM_W The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state). Replacements and DMA accesses are not allowed in this state. The block automatically transitions to MM state after a timeout.
O The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
M The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
M_W The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Only loads and stores are allowed. Silent upgrade happens to MM_W state on store. Replacements and DMA accesses are not allowed in this state. The block automatically transitions to M state after a timeout.
S The cache block is held in shared state by 1 or more nodes. Stores are not allowed in this state.
I The cache block is invalid.
  • Controller

The notation used in the controller FSM diagrams is described here.

MOESI CMP directory L1cache FSM.jpg
    • Optimizations
States Description
SM A GETX has been issued to get exclusive permissions for an impending store to the cache block, but an old copy of the block is still present. Stores and Replacements are not allowed in this state.
OM A GETX has been issued to get exclusive permissions for an impending store to the cache block, the data has been received, but all expected acknowledgments have not yet arrived. Stores and Replacements are not allowed in this state.

The notation used in the controller FSM diagrams is described here.

MOESI CMP directory L1cache optim FSM.jpg
L2 Cache
  • Stable States and Invariants
States Invariants
NP/I TODO
ILS TODO
ILX TODO
ILO TODO
ILOX TODO
ILOS TODO
ILOSX TODO
S TODO
O TODO
OLS TODO
OLSX TODO
SLS TODO
M TODO
  • Controller
Directory
  • Stable States and Invariants
States Invariants
M The cache block is held in exclusive state by only 1 node (which is also the owner). There are no sharers of this block. The data is potentially different from that in memory.
O The cache block is owned by exactly 1 node. There may be sharers of this block. The data is potentially different from that in memory.
S The cache block is held in shared state by 1 or more nodes. No node has ownership of the block. The data is consistent with that in memory (Check).
I The cache block is invalid.
  • Controller

The notation used in the controller FSM diagrams is described here.

MOESI CMP directory dir FSM.jpg
Other features
  • Timeouts:

Rathijit will do it

MESI_CMP_directory
Important features of the protocol
  • This protocol models two-level cache hierarchy. The L1 cache is private to a core, while the L2 cache is shared among the cores.
  • Inclusion is maintained between the L1 and L2 cache.
  • At high level the protocol has four stable states, M, E, S and I. A block in M state means the blocks is writable (i.e. has exclusive permission) and has been dirtied (i.e. its the only valid copy on-chip). E state represent a cache block with exclusive permission (i.e. writable) but is not written yet. S state means the cache block is only readable and possible multiple copies of it exists in multiple private cache and as well as in the shared cache. I means that the cache block is invalid.
  • The on-chip cache coherence is maintained through Directory Coherence scheme, where the directory information is co-located with the corresponding cache blocks in the shared L2 cache.
  • The protocol has four types of controllers -- L1 cache controller, L2 cache controller, Directory controller and DMA controller. L1 cache controller is responsible for managing L1 Instruction and L1 Data Cache. Number of instantiation of L1 cache controller is equal to the number of cores in the simulated system. L2 cache controller is responsible for managing the shared L2 cache and for maintaining coherence of on-chip data through directory coherence scheme. The Directory controller act as interface to the Memory Controller/Off-chip main memory and also responsible for coherence across multiple chips/and external coherence request from DMA controller. DMA controller is responsible for satisfying coherent DMA requests.
  • One of the primary optimization in this protocol is that if a L1 Cache request a data block even for read permission, the L2 cache controller if finds that no other core has the block, it returns the cache block with exclusive permission. This is an optimization done in anticipation that a cache blocks read would be written by the same core soon and thus save an extra request with this optimization. This is exactly why E state exits (i.e. when a cache block is writable but not yet written).
  • The protocol supports silent eviction of clean cache blocks from the private L1 caches. This means that cache blocks which have not been written to and has readable permission only can drop the cache block from the private L1 cache without informing the L2 cache. This optimization helps reducing write-back traffic to the L2 cache controller.
Related Files
  • src/mem/protocols
    • MESI_CMP_directory-L1cache.sm: L1 cache controller specification
    • MESI_CMP_directory-L2cache.sm: L2 cache controller specification
    • MESI_CMP_directory-dir.sm: directory controller specification
    • MESI_CMP_directory-dma.sm: dma controller specification
    • MESI_CMP_directory-msg.sm: coherence message type specifications
    • MESI_CMP_directory.slicc: container file
Network_test

This is a dummy cache coherence protocol that is used to operate the ruby network tester. The details about running the network tester can be found here.

Related Files
  • src/mem/protocols
    • Network_test-cache.sm: cache controller specification
    • Network_test-dir.sm: directory controller specification
    • Network_test-msg.sm: message type specification
    • Network_test.slicc: container file
Cache Hierarchy

This protocol assumes a 1-level cache hierarchy. The role of the cache is to simply send messages from the cpu to the appropriate directory (based on the address), in the appropriate virtual network (based on the message type). It does not track any state. Infact, no CacheMemory is created unlike other protocols. The directory receives the messages from the caches, but does not send any back. The goal of this protocol is to enable simulation/testing of just the interconnection network.

Stable States and Invariants
States Invariants
I Default state of all cache blocks
Cache controller
  • Requests, Responses, Triggers:
    • Load, Instruction fetch, Store from the core.

The network tester (in src/cpu/testers/networktest/networktest.cc) generates packets of the type ReadReq, INST_FETCH, and WriteReq, which are converted into RubyRequestType:LD, RubyRequestType:IFETCH, and RubyRequestType:ST, respectively, by the RubyPort (in src/mem/ruby/system/RubyPort.hh/cc). These messages reach the cache controller via the Sequencer. The destination for these messages is determined by the traffic type, and embedded in the address. More details can be found here.

  • Main Operation:
    • The goal of the cache is only to act as a source node in the underlying interconnection network. It does not track any states.
    • On a LD from the core:
      • it returns a hit, and
      • maps the address to a directory, and issues a message for it of type MSG, and size Control (8 bytes) in the request vnet (0).
      • Note: vnet 0 could also be made to broadcast, instead of sending a directed message to a particular directory, by uncommenting the appropriate line in the a_issueRequest action in Network_test-cache.sm
    • On a IFETCH from the core:
      • it returns a hit, and
      • maps the address to a directory, and issues a message for it of type MSG, and size Control (8 bytes) in the forward vnet (1).
    • On a ST from the core:
      • it returns a hit, and
      • maps the address to a directory, and issues a message for it of type MSG, and size Data (72 bytes) in the response vnet (2).
    • Note: request, forward and response are just used to differentiate the vnets, but do not have any physical significance in this protocol.
Directory controller
  • Requests, Responses, Triggers:
    • MSG from the cores
  • Main Operation:
    • The goal of the directory is only to act as a destination node in the underlying interconnection network. It does not track any states.
    • The directory simply pops its incoming queue upon receiving the message.
Other features
    • This protocol assumes only 3 vnets.
    • It should only be used when running the ruby network test.


Protocol Independent Memory components

System

This is a high level container for few of the important components of the Ruby which may need to be accessed from various parts and components of Ruby. Only ONE instance of this class is created. The instance of this class is globally available through a pointer named g_system_ptr. It holds pointer to the Ruby's profiler object. This allows any component of Ruby to get hold of the profiler and collect statistics in a central location by accessing it though g_system_ptr. It also holds important information about the memory hierarchy like Cache blocks size, Physical memory size and makes them available to all parts of Ruby as and when required. It also holds the pointer to the on-chip network in Ruby. Another important objects that it hold pointer to is the Ruby's wrapper for the simulator event queue (called RubyEventQueue). It also contains pointer to the simulated physical memory (pointed by variable name m_mem_vec_ptr). Thus in sum, System class in Ruby just acts as a container to pointers to some important objects of Ruby's memory system and makes them available globally to all places in Ruby through exposing it self through g_system_ptr.

Parameters
  1. random_seed is seed to randomize delays in Ruby. This allows simulating multiple runs with slightly perturbed delays/timings.
  2. randomization is the parameter when turned on, asks Ruby to randomly delay messages. This falg is useful when stress testing a system to expose corner cases. This flag should NOT be turned on when collecting simulation statistics.
  3. clock is the parameter for setting the clock frequency (on-chip).
  4. block_size_bytes specifies the size of cache blocks in bytes.
  5. mem_size specifies the physical memory size of the simulated system.
  6. network gives the pointer to the on-chip network for Ruby.
  7. profiler gives the pointer to the profiler of Ruby.
  8. tracer is the pointer to the Ruby 's memory request tracer. Tracer is primarily used to playback memory request trace in order to warm up Ruby's caches before actual simulation starts.
Related files
  • src/mem/ruby/system
    • System.hh/cc: contains code for the System
    • RubySystem.py :the corresponding python file with parameters.
Sequencer

Sequencer is one of the most important classes in Ruby, through which every memory request must pass through at least twice -- once before getting serviced by the cache coherence mechanism and once just after being serviced by the cache coherence mechanism. There is one instantiation of Sequencer class for each of the hardware thread being simulated. For example, if we are simulating a 16-core system with each core has single hardware thread context, then there would be 16 Sequencer object in the system, with each being responsible for managing memory requests (Load, Store, Atomic operations etc) from one of the given hardware thread context. ith Sequencer object handles request from only ith hardware context (in case of above example it is ith core). Each Sequencer assumes that it has access to the L1 Instruction and Data caches that are attached to a given core.

Following are the primary responsibilities of Sequencer:

  1. Injecting and accounting for all memory request to the underlying cache hierarchy (and coherence).
  2. Resource allocation and accounting (e.g. makes sure a particular core/hardware thread does not have more than specified number of outstanding memory requests).
  3. Making sure atomic operations are handled properly.
  4. Making sure that underlying cache hierarchy and coherence protocol is making forward progress.
  5. Once a request is serviced by the underlying cache hierarchy, Sequencer is responsible for returning the result to the corresponding port of the frontend (i.e.M5).
Parameters
  1. icache is the parameter where the L1 Instruction cache pointer is passed.
  2. dcache is the parameter where the L1 Data cache pointer is passed.
  3. max_outstanding_requests is the parameter for specifying maximum allowed number of outstanding memory request from a given core or hardware thread.
  4. deadlock_threshold is the parameter that specifies number of cycle (ruby cycles) after which if a given memory request is not satisfied by the cache hierarchy, a possibility of deadlock (or lack of forward progress) is declared.
Related files
  • src/mem/ruby/system
    • Sequencer.hh/cc: contains code for the Sequencer
    • Sequencer.py :the corresponding python file with parameters.
More detailed operation description

In this section, we will describe the operations of Sequencer in more details.

  • Injection of memory request to cache hierarchy, request accounting and resource allocation:
The entry point for a memory request to the Sequencer is method called makeRequest. This method is called from the corresponding RubyPort (Ruby's wrapper for M5 port). Once it gets the request, Sequencer checks for whether required resource limitations need to be enforced or not. This is done by calling a method called getRequestStatus. In this method, it is made sure that a given Sequencer (i.e. a core or hardware thread context) can does NOT issue multiple simultaneous requests to same cache block. If the current request indeed for a cache block for which another request is still pending from the same Sequencer then the current request is not issued to the cache hierarchy and instead the current request wait for the previous request from the same cache block to satisfied first. This is done in the code by checking for two requesting accounting table called m_writeRequestTable and m_readRequestTable and setting the status to RequestSatus_Aliased. It also makes sure that number of outsatnding memory request from a given Sequencer does not overshoot.
If it is found that the current request won't violate any of the above constrained then the request is registered for accounting purposes. This is done by calling the method named insertRequest. In this method, depending upon the type of the request, an entry for the request is created in the either m_writeRequestTable or m_readRequestTable. These two tables keep record for write request and read requests, respectively, that are issued to the cache hierarchy by the given Sequencer but still to be satisfied. Finally the memory request is finally pushed to the cache hierarchy by calling the method named issueRequest. This method is responsible of creating the request structure that is understood by the underlying SLICC generated coherence protocol implementation and cache hierarchy. This is done by setting the request type and mode accordingly and creating an object of class RubyRequest for the current request. Finally, L1 Instruction or L1 Data cache accesses latencies are accounted for an the request is pushed to the Cache hierarchy and the coherence mechanism for processing. This is done by enqueue-ing the request to the pointer to the mandatory queue (m_mandatory_q_ptr). Every request is passed to the corresponding L1 Cache controller through this mandatory queue. Cache hierarchy is then responsible for satisfying the request.
  • Deadlock/lack of forward progress detection:
As mentioned earlier, one other responsibility of the Sequencer is to make sure that Cache hierarchy is making progress in servicing the memory requests that have been issued. This is done by periodically waking up and scanning through the m_writeRequestTable and m_readRequestTable tables. which holds the currently outstanding requests from the given Sequencer and finds out which requests have been issued but have not satisfied by the cache hierarchy. If it finds any unsatisfied request that have been issues more than m_deadlock_threshold (parameter) cycles back, it reports a possible deadlock by the Cache hierarchy. Note that, although it reports possible deadlock, it actually detects lack of forward progress. Thus there may be false positives in this deadlock detection mechanism.
  • Send back result to the front-end and making sure Atomic operations are handled properly:
Once the Cache hierarchy satisfies a request it calls Sequencer's readCallback or writeCallback method depending upon the type of the request. Note that the time taken to service the request is automatically accounted for through event scheduling as the readCallback or writeCallback'' are called only after number of cycles required to satisfy the request has been accounted for. In these two methods the corresponding record of the request from the m_readRequestTable or m_writeRequestTable is removed. Also if the request was found to be part of a Atomic operations (e.g. RMW or LL/SC), then appropriate actions are taken to make sure that semantics of atomic operations are respected.
After that a method called hitCallback is called. In this method, some statics is collected by calling functions on the Ruby's profiler. For write request, the data is actually updated in this function (and not while simulating the request through the cache hierarchy and coherence protocol). Finally ruby_hit_callback is called ultimately sends back the packet to the front-end, signifying completion of the processing of the memory request from Ruby's side.
CacheMemory and Cache Replacement Polices

This module can model any Set-associative Cache structure with a given associativity and size. Each instantiation of the following module models a single bank of a cache. Thus different types of caches in system (e.g. L1 Instruction, L1 Data , L2 etc) and every banks of a cache needs to have separate instantiation of this module. This module can also model Fully associative cache when the associativity is set to 1. In Ruby memory system, this module is primarily expected to be accessed by the SLICC generated codes of the given Coherence protocol being modeled.

Basic Operation

This module models the set-associative structure as a two dimensional (2D) array. Each row of the 2D array represents a set of in the set-associative cache structure, while columns represents ways. The number of columns is equal to the given associativity (parameter), while the number of rows is decided depending on the desired size of the structure (parameter), associativity (parameter) and the size of the cache line (parameter). This module exposes six important functionalities which Coherence Protocols uses to manage the caches.

  1. It allows to query if a given cache line address is present in the set-associative structure being modeled through a function named isTagPresent. This function returns true, iff the given cache line address is present in it.
  2. It allows a lookup operation which returns the cache entry for a given cache line address (if present), through a function named lookup. It returns NULL if the blocks with given address is not present in the set-associative cache structure.
  3. It allows to allocate a new cache entry in the set-associative structure through a function named allocate.
  4. It allows to deallocate a cache entry of a given cache line address through a function named deallocate.
  5. It can be queried to find out whether to allocate an entry with given cache line address would require replacement of another entry in the designated set (derived from the cache line address) or not. This functionality is provided through cacheAvail function, which for a given cache line address, returns True, if NO replacement of another entry the same set as the given address is required to make space for a new entry with the given address.
  6. The function cacheProbe is used to find out cache line address of a victim line, in case placing a new entry would require victimizing another cache blocks in the same set. This function returns the cache line address of the victim line given the address of the address of the new cache line that would have to be allocated.
Parameters

There are four important parameters for this class.

  1. size is the parameter that provides the size of the set-associative structure being modeled in units of bytes.
  2. assoc specifies the set-associativity of the structure.
  3. replacement_policy is the name of the replacement policy that would be used to select victim cache line when there is conflict in a given set. Currently, only two possible choices are available (PSEUDO_LRU and LRU).
  4. Finally, start_index_bit parameter specifies the bit position in the address from where indexing into the cache should start. This is a tricky parameter and if not set properly would end up using only portion of the cache capacity. Thus how this value should be specified is explained through couple of examples. Let us assume the cache line size if 64 bytes and a single core machine with a L1 cache with only one bank and a L2 cache with 4 banks. For the CacheMemory module that would model the L1 cache should have start_index_bit set to log2(64) = 6 (this is the default value assuming 64 bytes cache line). This is required as addresses passed around in the Ruby is full address (i.e. equal to the number of bits required to access any address in the physical address range) and as the caches would be accessed in granularity of cache line size (here 64 bytes), the lower order 6 bits in the address would be essentially 0. So we should discard last 6 bits of the given address while calculating which set (index) in the set associative structure the given address should go to. Now let's look into a more complicated case of L2 cache, which has 4 banks. As mentioned previously, this modules models a single bank of a set-associative cache. Thus there will be four instantiation of the CacheMemory class to model the whole L2 cache. Assuming which cache bank a request goes to is statically decided by the low oder log2(4) = 2 bits of the cache line address, the value of the bits in the address at the position 6 and 7 would be same for all accesses coming to a given bank (i.e. a instance of CacheMemory here). Thus indexing within the set associative structure (CacheMemory instance) modeling a given bank should use address bits 8 and higher for finding which set a cache block should go to. Thus start_index_bit parameter should be set to 8 for the banks of L2 in this example. If erroneously if this is set 6, only a fourth of desired L2 capacity would be utilized !!!
More detailed description of operation

As mentioned previously, the set-associative structure is modeled as a 2D array in the CacheMemory class. The variable m_cache is this all important 2D array containing the set-associative structure. Each element of this 2D array is derived from type AbstractCacheEntry class. Beside the minimal required functionality and contents of each cache entry, it can be extended inside the Coherence protocol files. This allows CacheMemory to be generic enough to hold any type of cache entry as desired by a given Coherence protocol as long as it derives from AbstractCacheEntry interface. The m_cache 2D array has number of rows equal to the number of sets in the set-associative structure being modeled while the number of columns is equal to associativity.

As should happen in any set-associative structure, which set (row) a cache entry should reside is decided by part of the cache block address used for indexing. The function addressToCacheSet calculates this index given an address. The way in which a cache entry reside in its designated set (row) is noted in the a hash_map object named m_tag_index. So to access an cache entry in the set-associative structure, first the set number where the cache block should reside is calculated and then m_tag_index is looked-up to find out the way in which the required cache block resides. If an cache entry holds invalid entry or its empty then its set to NULL or its permission is set to NotPresent.

One important aspect of the Ruby's caches are the segregation of the set-associative structure for the cache and its replacement policy. This allows modular design where structure of the cache is independent of the replacement policy in the cache. When a victim needs to be selected to make space for a new cache block (by calling cacheProbe function), getVictim function of the class implementing replacement policy is called for the given set. getVictim returns the way number of the victim. The replacement policy is updated about accesses by calling touch function of the replacement policy, which allows it to update the access recency. Currently there are two replacement policies are supported -- LRU and PseudoLRU. LRU policy has a straight forward implementation where it keeps track of the absolute time when each way within each set is accessed last time and it always victimizes the entry which was last accessed furthest back in time. PseudoLRU implements a binary-tree based Non-Recently-Used policy. It arranges the ways in each set in an implicit binary tree like structure. Every node of the binary tree encodes the information which of its two subtrees was accessed more recently. During victim selection process, it starts from the root of the tree and traverse down such that it chooses the subtree which was touched less recently. Traversal continues until it reaches a leaf node. It then returns the id of the leaf node reached.

Related files
  • src/mem/ruby/system
    • CacheMemory.cc: contains CacheMemory class which models a cache bank
    • CacheMemory.hh: Interface for the CacheMemory class
    • Cache.py: Python configuration file for CacheMemory class
    • AbstractReplacementPolicy.hh: Generic interface for Replacement policies for the Cache
    • LRUPolicy.hh: contains LRU replacement policy
    • PseudoLRUPolicy.hh: contains Pseudo-LRU replacement policy
  • src/mem/ruby/slicc_interface
    • AbstarctCacheEntry.hh: contains the interface of Cache Entry
DMASequencer

This module implements handling for DMA transfers. It is derived from the RubyPort class. There can be a number of DMA controllers that interface with the DMASequencer. The DMA sequencer has a protocol-independent interface and implementation. The DMA controllers are described with SLICC and are protocol-specific.

TODO: Fix documentation to reflect latest changes in the implementation.

Note:

  1. At any time there can be only 1 request active in the DMASequencer.
  2. Only ordinary load and store requests are handled. No other request types such as Ifetch, RMW, LL/SC are handled
Related Files
  • src/mem/ruby/system
    • DMASequencer.hh: Declares the DMASequencer class and structure of a DMARequest
    • DMASequencer.cc: Implements the methods of the DMASequencer class, such as request issue and callbacks.
Configuration Parameters

Currently there are no special configuration parameters for the DMASequencer.

Basic Operation

A request for data transfer is split up into multiple requests each transferring cache-block-size chunks of data. A request is active as long as all the smaller transfers are not completed. During this time, the DMASequencer is in a busy state and cannot accepts any new transfer requests.

DMA requests are made through the makeRequest method. If the sequencer is not busy and the request is of the correct type (LD/ST), it is accepted. A sequence of requests for smaller data chunks is then issued. The issueNext method issues each of the smaller requests. A data/acknowledgment callback signals completion of the last transfer and triggers the next call to issueNext as long as all of the original data transfer is not complete. There is no separate event scheduler within the DMASequencer.

Memory Controller

This module simulates a basic DDR-style memory controller. It models a single channel, connected to any number of DIMMs with any number of ranks of DRAMs each. The following picture shows an overview of the memory organization and connections to the memory controller. General information about memory controllers can be found here.

Mc overview.jpg

Note:

  1. The product of the memory bus cycle multiplier, memory controller latency, and clock cycle time(=1/processor frequency) gives a first-order approximation of the latency of memory requests in time. The Memory Controller module refines this further by considering bank & bus contention, queueing effects of finite queues, and refreshes.
  2. Data sheet values for some components of the memory latency are specified in time (nanoseconds), whereas the Memory Controller module expects all delay configuration parameters in cycles. The parameters should be set appropriately taking into account the processor and bus frequencies.
  3. The current implementation does not consider pin-bandwidth contention. Infinite bandwidth is assumed.
  4. Only closed bank policy is currently implemented; that is, each bank is automatically closed after a single read or write.
  5. The current implementation handles only a single channel. If you want multiple address/data channels, you need to instantiate multiple copies of this module.
  6. This is the only controller that is NOT specified in SLICC, but in C++.

Documentation source: Most (but not all) of the writeup in this section is taken verbatim from documentation in the gem5 source files and rubyconfig.defaults file of GEMS.

Related Files
  • src/mem/ruby/system
    • MemoryControl.hh: This file declares the Memory Controller class.
    • MemoryControl.cc: This file implements all the operations of the memory controller. This includes processing of input packets, address decoding and bank selection, request scheduling and contention handling, handling refresh, returning completed requests to the directory controller.
    • MemoryControl.py: Configuration parameters
Configuration Parameters
  • dimms_per_channel: Currently the only thing that matters is the number of ranks per channel, i.e. the product of this parameter and ranks_per_dimm. But if and when this is expanded to do FB-DIMMs, the distinction between the two will matter.
  • Address Mapping: This is controlled by configuration parameters banks_per_rank, bank_bit_0, ranks_per_dimm, rank_bit_0, dimms_per_channel, dimm_bit_0. You could choose to have the bank bits, rank bits, and DIMM bits in any order. For the default values, we assume this format for addresses:
    • Offset within line: [5:0]
    • Memory controller #: [7:6]
    • Bank: [10:8]
    • Rank: [11]
    • DIMM: [12]
    • Row addr / Col addr: [top:13]

If you get these bits wrong, then some banks won't see any requests; you need to check for this in the .stats output.

  • mem_bus_cycle_multiplier: Basic cycle time of the memory controller. This defines the period which is used as the memory channel clock period, the address bus bit time, and the memory controller cycle time. Assuming a 200 MHz memory channel (DDR-400, which has 400 bits/sec data), and a 2 GHz processor clock, mem_bus_cycle_multiplier=10.
  • mem_ctl_latency: Latency to returning read request or writeback acknowledgement. Measured in memory address cycles. This equals tRCD + CL + AL + (four bit times) + (round trip on channel) + (memory control internal delays). It's going to be an approximation, so pick what you like. Note: The fact that latency is a constant, and does not depend on two low-order address bits, implies that our memory controller either: (a) tells the DRAM to read the critical word first, and sends the critical word first back to the CPU, or (b) waits until it has seen all four bit times on the data wires before sending anything back. Either is plausible. If (a), remove the "four bit times" term from the calculation above.
  • rank_rank_delay: This is how many memory address cycles to delay between reads to different ranks of DRAMs to allow for clock skew.
  • read_write_delay: This is how many memory address cycles to delay between a read and a write. This is based on two things: (1) the data bus is used one cycle earlier in the operation; (2) a round-trip wire delay from the controller to the DIMM that did the reading. Usually this is set to 2.
  • basic_bus_busy_time: Basic address and data bus occupancy. If you are assuming a 16-byte-wide data bus (pairs of DIMMs side-by-side), then the data bus occupancy matches the address bus occupancy at 2 cycles. But if the channel is only 8 bytes wide, you need to increase this bus occupancy time to 4 cycles.
  • mem_random_arbitrate: By default, the memory controller uses round-robin to arbitrate between ready bank queues for use of the address bus. If you wish to add randomness to the system, set this parameter to one instead, and it will restart the round-robin pointer at a random bank number each cycle. If you want additional nondeterminism, set the parameter to some integer n >= 2, and it will in addition add a n% chance each cycle that a ready bank will be delayed an additional cycle. Note that if you are in mem_fixed_delay mode (see below), mem_random_arbitrate=1 will have no effect, but mem_random_arbitrate=2 or more will.
  • mem_fixed_delay: If this is nonzero, it will disable the memory controller and instead give every request a fixed latency. The nonzero value specified here is measured in memory cycles and is just added to MEM_CTL_LATENCY. It will also show up in the stats file as a contributor to memory delays stalled at head of bank queue.
  • tFAW: This is an obscure DRAM parameter that says that no more than four activate requests can happen within a window of a certain size. For most configurations this does not come into play, or has very little effect, but it could be used to throttle the power consumption of the DRAM. In this implementation (unlike in a DRAM data sheet) TFAW is measured in memory bus cycles; i.e. if TFAW = 16 then no more than four activates may happen within any 16 cycle window. Refreshes are included in the activates.
  • refresh_period: This is the number of memory cycles between refresh of row x in bank n and refresh of row x+1 in bank n. For DDR-400, this is typically 7.8 usec for commercial systems; after 8192 such refreshes, this will have refreshed the whole chip in 64 msec. If we have a 5 nsec memory clock, 7800 / 5 = 1560 cycles. The memory controller will divide this by the total number of banks, and kick off a refresh to somebody every time that amount is counted down to zero. (There will be some rounding error there, but it should have minimal effect.)
  • Typical Settings for configuration parameters: The default values are for DDR-400 assuming a 2GHz processor clock. If instead of DDR-400, you wanted DDR-800, the channel gets faster but the basic operation of the DRAM core is unchanged. Busy times appear to double just because they are measured in smaller clock cycles. The performance advantage comes because the bus busy times don't actually quite double. You would use something like these values:
mem_bus_cycle_multiplier: 5
bank_busy_time: 22
rank_rank_delay: 2
read_write_delay: 3
basic_bus_busy_time: 3
mem_ctl_latency: 20
refresh_period: 3120
Basic Operation
  • Data Structures

Requests are enqueued into a single input queue. Responses are dequeued from a single response queue. There is a single bank queue for each DRAM bank (the total number of banks is the number of DIMMs per channel x number of ranks per DIMM x number of banks per rank). Each bank also has a busy counter. tFAW shift registers are maintained per rank.

  • Scheduling and Bank Contention

The wakeup function, and in turn, the executeCycle function is tiggered once every memory clock cycle.

Each memory request is placed in a queue associated with a specific memory bank. This queue is of finite size; if the queue is full the request will back up in an (infinite) common queue and will effectively throttle the whole system. This sort of behavior is intended to be closer to real system behavior than if we had an infinite queue on each bank. If you want the latter, just make the bank queues unreasonably large.

The head item on a bank queue is issued when all of the following are true:

  1. The bank is available
  2. The address path to the DIMM is available
  3. The data path to or from the DIMM is available

Note that we are not concerned about fixed offsets in time. The bank will not be used at the same moment as the address path, but since there is no queue in the DIMM or the DRAM it will be used at a constant number of cycles later, so it is treated as if it is used at the same time.

We are assuming "posted CAS"; that is, we send the READ or WRITE immediately after the ACTIVATE. This makes scheduling the address bus trivial; we always schedule a fixed set of cycles. For DDR-400, this is a set of two cycles; for some configurations such as DDR-800 the parameter tRRD forces this to be set to three cycles.

We assume a four-bit-time transfer on the data wires. This is the minimum burst length for DDR-2. This would correspond to (for example) a memory where each DIMM is 72 bits wide and DIMMs are ganged in pairs to deliver 64 bytes at a shot.This gives us the same occupancy on the data wires as on the address wires (for the two-address-cycle case).

The only non-trivial scheduling problem is the data wires. A write will use the wires earlier in the operation than a read will; typically one cycle earlier as seen at the DRAM, but earlier by a worst-case round-trip wire delay when seen at the memory controller. So, while reads from one rank can be scheduled back-to-back every two cycles, and writes (to any rank) scheduled every two cycles, when a read is followed by a write we need to insert a bubble. Furthermore, consecutive reads from two different ranks may need to insert a bubble due to skew between when one DRAM stops driving the wires and when the other one starts. (These bubbles are parameters.)

This means that when some number of reads and writes are at the heads of their queues, reads could starve writes, and/or reads to the same rank could starve out other requests, since the others would never see the data bus ready. For this reason, we have implemented an anti-starvation feature. A group of requests is marked "old", and a counter is incremented each cycle as long as any request from that batch has not issued. If the counter reaches twice the bank busy time, we hold off any newer requests until all of the "old" requests have issued.

Interconnection Network

The various controllers of the memory hierarchy (L1/L2 caches, Directory, DMA etc), specified by the coherence protocol, are connected together via an interconnection network. The primary components of this network are

  1. Topology
  2. Routing
  3. Router Microarchitecture and Flow-Control
Topology

The connection between the various controllers are specified via python files.

  • Related Files:
    • src/mem/ruby/network/topologies/Pt2Pt.py
    • src/mem/ruby/network/topologies/Crossbar.py
    • src/mem/ruby/network/topologies/Mesh.py
    • src/mem/ruby/network/topologies/MeshDirCorners.py
  • Topology Descriptions:
    • Pt2Pt: Each controller (L1/L2/Directory) is connected to every other controller via a direct link. This can be invoked from command line by --topology=Pt2Pt.
    • Crossbar: Each controller (L1/L2/Directory) is connected to every other controller via one switch (modeling the crossbar). This can be invoked from command line by --topology=Crossbar.
    • Mesh: This topology requires the number of directories to be equal to the number of cpus. The number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present), and one Directory. It can be invoked from command line by --topology=Mesh. The number of rows in the mesh has to be specified by --mesh-rows. This parameter enables the creation of non-symmetrical meshes too.
    • MeshDirCorners: This topology requires the number of directories to be equal to 4. number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present). Each corner router/switch is connected to one Directory. It can be invoked from command line by --topology=MeshDirCorners. The number of rows in the mesh has to be specified by --mesh-rows.


Routing

Based on the topology, shortest path graph traversals are used to populate routing tables at each router/switch. The default routing algorithm tries to choose the route with minimum number of link traversals. Links can be given weights to model different routing algorithms. For example, in Mesh.py and MeshDirCorners.py Y-direction links are given weights of 2, while X-direction links are given weights of 1, resulting in XY traversals.


Router Microarchitecture and Flow-Control

Ruby supports two network models, Simple and Garnet, which trade-off detailed modeling versus simulation speed respectively.

  • Related Files:
    • src/mem/ruby/network/Network.py
    • src/mem/ruby/network/simple
    • src/mem/ruby/network/garnet/BaseGarnetNetwork.py
    • src/mem/ruby/network/garnet/fixed-pipeline
    • src/mem/ruby/network/garnet/flexible-pipeline


Simple Network

TODO

Garnet Fixed Pipeline Network

TODO

Garnet Flexible Pipeline Network

TODO


Life of a memory request in Ruby

In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.

  1. A memory request from a core or hardware context of M5 enters the jurisdiction of Ruby through the RubyPort::recvTiming interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of non-multithreaded cores). A port from the side of each core is tied to a corresponding RubyPort.
  2. The memory request arrives as a M5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a normal memory request (not PIO access), it passes the request to the Sequencer::makeRequest interface of the attached Sequencer object with the port (variable ruby_port holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
  3. As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the Sequencer::makeRequest, it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the mandatory queue after accounting for L1 cache access latency. The mandatory queue (variable name m_mandatory_q_ptr) effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
  4. L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the mandatory queue and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of MessageBuffer class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through MessageBuffer. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols). Once the request get satisfied it is pushed upwards in the hierarchy through MessageBuffers.
  5. Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its readCallback or 'writeCallback method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
  6. The Sequencer then clears up the accounting information for the corresponding request and then calls the RubyPort::ruby_hit_callback method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (M5).