Platform Integration
In order to add a new platform to Beethoven, you must essentially implement a shim between the Beethoven layer
and the device. The full Platform
interface is defined here
and is intended to provide a comprehensive backbone for Beethoven to generate interconnects and floorplans.
We will go through the entire interface below.
Front Bus
The front bus is intended to receive commands from the host over some sort of memory-mapped IO (MMIO)
and service responses requests. Because of the breadth of ways this communication may be exposed, we leave it quite open ended and define a FrontBusProtocol
as an interface
that
- Defines Diplomacy nodes for delivering RoCC commands
- Defines top-level IOs
Accordingly, FrontBusProtocol
is define as:
abstract class FrontBusProtocol {
def deriveTLSources(implicit p:Parameters) : Config
def deriveTopIOs(tlChainObj: Any, withClock: Clock, withActiveHighReset: Reset)(implicit p: Parameters): Unit
}
Diplomacy Node Exposure
As an example of how we implement a protocol using this interface, we look to our AXIFrontBusProtocol
implementation.
class AXIFrontBusProtocol(withDMA: Boolean) extends FrontBusProtocol {
...
override def deriveTLSources(implicit p: Parameters): Config = {
// which floorplanning object (e.g., die) contains the front bus interfaces
// this allows us to place any hardware attached to these interfaces close to the interfaces
val frontInterfaceID = platform.physicalInterfaces.find(_.isInstanceOf[PhysicalHostInterface]).get.locationDeviceID
// We're going to expose an AXI4 interface to the device so we need to declare it here
DeviceContext.withDevice(frontInterfaceID) {
val axi_master = AXI4MasterNode(Seq(AXI4MasterPortParameters(
masters = Seq(AXI4MasterParameters(
name = "S00_AXI",
aligned = true,
maxFlight = Some(1),
id = IdRange(0, 1 << 16)
)),
)))
// instantiate the front-hub module which converts AXI4 MMIOs into RoCC commands
val fronthub =
DeviceContext.withDevice(frontInterfaceID) {
val fronthub = LazyModuleWithFloorplan(new FrontBusHub(), "zzfront6_axifronthub")
fronthub.axi_in := axi_master
fronthub
}
// if there is a DMA port from the host, we can instnatate that
val (dma_node, dma_front) = if (withDMA) {
val node = AXI4MasterNode(Seq(AXI4MasterPortParameters(
masters = Seq(AXI4MasterParameters(
name = "S01_AXI",
maxFlight = Some(1),
aligned = true,
id = IdRange(0, 1 << 6)
))
)))
// convert it to TileLink
val dma2tl = TLIdentityNode()
DeviceContext.withDevice(frontInterfaceID) {
dma2tl :=
make_tl_buffer() :=
LazyModuleWithFloorplan(new LongAXI4ToTL(64)).node :=
AXI4UserYanker(capMaxFlight = Some(1)) :=
AXI4IdIndexer(1) :=
AXI4Buffer() := node
}
(Some(node), Some(dma2tl))
} else (None, None)
// finally, expose the RoCC interface to Beethoven
val rocc_xb = DeviceContext.withDevice(frontInterfaceID) { RoccFanout("zzfront_7roccout") }
rocc_xb := fronthub.rocc_out
// We return the following 3 keys to Beethoven
// OptionalPassKey lets us pass arbitrary objects to ourselves when we expose the Top-level IOs
// Since we declared the AXI diplomacy nodes for the host and dma, we'll need to connect them
// to the IOs when we construct them
new Config((_, _, _) => {
case OptionalPassKey => (axi_master, dma_node)
case RoccNodeKey => rocc_xb
case DMANodeKey => dma_front
// debug only
case DebugCacheProtSignalKey => fronthub.module.io.cache_prot
})
}
}
There's quite a bit going on here but we can break it down into two categories.
Host Commands
We declare a diplomacy node that corresponds to the communication channel that we're
going to connect to the host: axi_master
. Because this front-end is intended to facilitate MMIO between
us and the host, we have to implement that functionality. For this reason, we instantiate a FrontBusHub
,
which implements this MMIO slave port. From this hub, it produces a RoCC diplomacy node, rocc_out
. Below
the DMA instantation, you see we connect this node to a RoCC crossbar node to add a buffer between the hub
and Beethoven. Finally, we pass crossbar node to Beethoven by placing it in a Config object under the
RoccNodeKey
key.
But how are we going to connect axi_master
to the top-level IOs? Notice that we haven't actually instantiated
them yet. We will do that next in the second FrontBusProtocol
function. To pass axi_master
to our function
that we'll define in the future, we put it inside an object under the OptionalPassKey
key in our Config.
Host DMA
In case the device supports DMA from host, we can add an additional AXI port for servicing these reads and writes
that are functionally different from the MMIOs we receive on our axi_master
port.
This time, instead of connecting it to a FrontBusHub
, we'll convert it to the TileLink protocol, which Beethoven
uses internally for elaborating the memory interconnect. From this construction we, like before, obtain two objects:
a diplomacy node that corresponds to our top-level DMA IOs (AXI4) and a diplomacy node that corresponds to the
TileLink node that will issue memory transactions into our memory interconnect.
Like with host commands, we need to pass the diplomacy node coresponding to our top-level IOs to ourselves for later
use so we put that, as well, into the OptionalPassKey
key. Next, for exposing the TileLink node we pass it to
Beethoven under the DMANodeKey
key as an optional type. If you are not using DMA, then simply pass None
to this
field.
Top-Level IO Exposure
To tie the aformentioned diplomacy nodes to top-level IOs, we implement the deriveTopIOs
function for FrontBusProtocol
.
def deriveTopIOs(tlChainObj: Any, withClock: Clock, withActiveHighReset: Reset)(implicit p: Parameters): Unit
tlChainObj: Any
: This is the object that you created and passed toOptionalPassKey
. As you can see, this is anAny
type, giving you freedom to pass arbitrary objects/information between these functions.withClock: Clock
,withActiveHighReset: Reset
: In case you need to instantiate anyModule
types, you would do it in this function and use these clocks and resets. Be wary of the active-high reset signal.
Here is how we use this function to connect our diplomacy nodes to the AXI nodes in Beethoven.
override def deriveTopIOs(tlChainObj: Any, withClock: Clock, withActiveHighReset: Reset)(implicit p: Parameters): Unit = {
val (port_cast, dma_cast) = tlChainObj.asInstanceOf[(AXI4MasterNode, Option[AXI4MasterNode])]
val ap = port_cast.out(0)._1.params
// instantiate the top-level IO
val S00_AXI = IO(Flipped(new AXI4Compat(MasterPortParams(
base = 0,
size = 1L << p(PlatformKey).frontBusAddressNBits,
beatBytes = ap.dataBits / 8,
idBits = ap.idBits))))
// connect it to our diplomacy node
AXI4Compat.connectCompatSlave(S00_AXI, port_cast.out(0)._1)
if (withDMA) {
val dma = IO(Flipped(new AXI4Compat(MasterPortParams(
base = 0,
size = platform.extMem.master.size,
beatBytes = dma_cast.get.out(0)._1.r.bits.data.getWidth/8,
idBits = 6))))
AXI4Compat.connectCompatSlave(dma, dma_cast.get.out(0)._1)
}
}
We use AXI4Compat
because it provides better naming compared to more Chisel-friendly module types and, as a result,
maps to a AXI4 port when you instantiate a BeethovenTop
module in Vivado. We provide the connectCompatSlave
to
connect these Vivado-friendly ports with Diplomacy-friendly AXI4 ports.
Platform Parameters
In addition to these functions, the platform should also declare the following values to facilitate C++ code generation.
// these are the default values we use for the AXIFrontBusProtocol
override val frontBusBaseAddress: Long = 0
override val frontBusAddressNBits: Int = 16
override val frontBusAddressMask: Long = 0xFFFF
override val frontBusBeatBytes: Int = 4
Alternative Usage Patterns
While we believe this exposure of AXI4 ports will likely be the typical usage-pattern, we have internally used these functions to test other integrations.
For instance, we tested and verified ChipKIT integration in this way.
Instead of AXI4 slave ports for communicating with host, we instantiated an ARM M0 CPU core inside of deriveTopIOs
and connected
it to external UART IOs using the ChipKIT IPs. Because the M0 only has a single AHB port for communicating with instruction SRAM,
data SRAM, external memory, and Beethoven, we did the following:
- Implemented an AHB filter for these domains
- Instantiated the SRAMs inside
deriveTopIOs
and connected these to the AHB filter slave side. - Convert the AHB Beethoven slave to AXI4 and connect it to a
FrontBusHub
module. - Convert the AHB external memory slave to TileLink and expose to Beethoven as a DMA node.
This integration was one our preliminary efforts towards test-chip integration and, while it was functional, had some issues. Our current test-chip integration is currently a work in progress and we will work towards making it usable by others in a future release. Our emphasis here is that with these interfaces, you can implement a reasonably sophisticated integration.
Memory
Currently, Beethoven exposes AXI4 interfaces to the external memory. While this will be the common case for FPGAs, it may not be universal. Beethoven instantiates the AXI4 interfaces according to the following parameters as part of your platform declaration:
// these are the default values for the AWS F2
override val hasDiscreteMemory: Boolean = true
override val physicalMemoryBytes: Long = 0x400000000L
override val memorySpaceAddressBase: Long = 0x0
override val memorySpaceSizeBytes: BigInt = BigInt(1) << 34
override val memoryControllerIDBits: Int = 16
override val memoryControllerBeatBytes: Int = 64
override val memoryNChannels: Int = 1
hasDiscreteMemory
- this is only used for C++ header generation. It informs the Beethoven runtime whether to use the system allocator or to instantiate an allocator for allocating regions in the FPGA's discrete address space.physicalMemoryBytes
- the size of the physical memory space. This quantity will be generated into the C++ bindings.memorySpaceAddressBase
- the base address offset for the external memory.memorySpaceSizeBytes
- the size of the full address space. That is, if you are on an embedded FPGA and the physical address space is much smaller than the virtual address space, you should provide the size of the virtual address space. This determines the address width of the memory interconnect.memoryControllerIDBits
- the number of usable ID bits on the external memory interface. Specifying0
here will still elaborate theID
field but it will always be driven 0. Otherwise, Beethoven is free to generate memory requests in the range[0, 1 << memoryControllerIDBits)
. If the number of bits of the field is greater than the number of supported IDs, then set this parameter to support the latter.memoryControllerBeatBytes
- the data width of the external memory bus.memoryNChannels
- If you wish to generate multiple different memory channels, increase this parameter. The memory channels are discrete, as is required by diplomacy. That is, each memory channel is non-overlapping with the other memory channels. The above parameters apply to a single channel. So if you have two channels, each with 8GB of capacity, you would specifymemorySpaceSizeBytes=BigInt(1) << 33
.
Floorplanning
Beethoven will attempt to generate device-aware floorplans using the information provided to Beethoven as to the device's topology. The topology presented to Beethoven can correspond directly to the device topology or, for more sophisticated floorplanning, the developer could specify a more precise topology on top of a device. For simplicity, we'll first device topologies.
The basic interface is shown below for a single-die device. In such cases, there is really nothing more to do and we let our backend tool handle placement by itself.
// default settings - corresponds to a single-die device
def placementAffinity: Map[Int, Double] = Map.from(physicalDevices.map { dev => (dev.identifier, 1.0 / physicalDevices.length) })
val physicalDevices: List[DeviceConfig] = List(DeviceConfig(0, ""))
val physicalInterfaces: List[PhysicalInterface] = List(PhysicalHostInterface(0), PhysicalMemoryInterface(0, 0))
val physicalConnectivity: List[(Int, Int)] = List()
However, if we are deploying designs on an AWS F2 FPGA for instance, those FPGAs are constructed from three silicon dies connected with Through-Silicon Vias (TSVs). The delay through the TSVs is high and the number of TSVs are limited so we attempt to minimize crossovers by pinning cores onto specified dies and elaborating the interconnects to minimize die crossings.
On the AWS F2 instances, we expose three dies using unique integral IDs corresponding to a name that will appear in the floorplanning files.
override val physicalDevices: List[DeviceConfig] = List(
DeviceConfig(0, "pblock_CL_bot"),
DeviceConfig(1, "pblock_CL_mid"),
DeviceConfig(2, "pblock_CL_top")
)
Next, we tell Beethoven which dies contain which physical interfaces. For the AWS F2 instances, the host AXI interface is on die 0 and the memory interface is on die 1. In the future, when we add the support for the many on-chip HBM interfaces, it will be added here to die 0.
override val physicalInterfaces: List[PhysicalInterface] = List(
PhysicalHostInterface(0),
PhysicalMemoryInterface(1, 0)
)
Next, we tell Beethoven the connectivity between the dies. The connectivity need not be a DAG, but it often is.
override val physicalConnectivity: List[(Int, Int)] = List((0, 1), (1, 2))
Here, we specify that the connectivity is 0 - 1 - 2.
Platform Fine-Tuning
Finally, Beethoven allows the platform developer to tweak interconnect generation paramters to be more or less aggressive based on their needs.
// only really used in ASIC mode
val clockRateMHz: Int = 100
// suggest 64 for AWS
val prefetchSourceMultiplicity: Int = 16
val defaultReadTXConcurrency: Int = 4
val defaultWriteTXConcurrency: Int = 4
val xbarMaxDegree = 2
val maxMemEndpointsPerSystem = 1
val maxMemEndpointsPerCore = 1
val interCoreMemReductionLatency = 1
val interCoreMemBusWidthBytes = 4
/**
* These default values should typically be fine.
*/
val net_intraDeviceXbarLatencyPerLayer = 1
val net_intraDeviceXbarTopLatency = 1
val net_fpgaSLRBridgeLatency = 2
val memEndpointsPerDevice = 1
prefetchSourceMultiplicity
- What is the longest single transaction a reader/writer should emit (beats)? On AWS platforms, the DDR controller recommends 64 to be able to achieve maximum DDR efficiency. However, this value linearly impacts the amount of buffering that is required inside of the reader/writer.defaultReadTXConcurrency
/defaultWriteTXConcurrency
- This parameter determines how many concurrent transactions a reader/writer can have in flight. This also linearly impacts the amount of necessary buffering and the complexity of the some of the logic. This can be set manually per-reader/writer in their configurations for modules that require especially high throughput on only some interfaces.xbarMaxDegree
- the maximum input/output degree for any crossbar in our interconnect generationmaxMemEndpointsPerSystem
/maxMemEndpointsPerCore
/memEndpointsPerDevice
- memory nodes are reduced in degree before leaving each system and core. These parameters can be tweaked to impact the shape of the interconnect.interCoreMemReductionLatency
/interCoreMemBusWidthBytes
- these parameters impact inter-core communication latencies.- Intra-Device crossbar latencies - these parameters determine latencies of interconnects on die boundaries.