An FPGA Reading List¶

Version History

Date	Description
Aug 26, 2020	Add those 2 ISCA‘20 papers to Host Virtual Memory Section
Nov 30, 2019	Add a lot security papers
Oct 22, 2019	Shuffle scheduling section. More focused. Add two more recent fpga-virt papers
Oct 5, 2019	More on scheduling. Add NoC. Add Security.
Oct 4, 2019	Add more papers extracted from AmophOS
Oct 3, 2019	Initial version from Github

A list of related papers I came across while doing FPGA-related research. If you’d like to contribute, please comment below or create PR here.

Virtualization
Languages, Runtime, and Framework
Applications
FPGA Internal

Virtualization¶

Scheduling¶

Scheduling is big topic for FPGA. Unlike the traditional CPU scheduling, there are more aspects to consider, e.g., 1) Partial reconfiguration (PR), 2) Dynamic self PR, 3) Preemptive scheduling, 4) Relocation, 5) Floorplanning, and so on.

Preemptive Scheduling¶

Preemptive multitasking on FPGAs, 2000
Multitasking on FPGA Coprocessors, 2000
Context saving and restoring for multitasking in reconfigurable systems, 2005
ReconOS Cooperative multithreading in dynamically reconfigurable systems, FPL‘09
Block, drop or roll(back): Alternative preemption methods for RH multi-tasking, FCCM‘09
Hardware Context-Switch Methodology for Dynamically Partially Reconfigurable Systems, 2010
On-chip Context Save and Restore of Hardware Tasks on Partially Reconfigurable FPGAs, 2013
HTR: on-chip Hardware Task Relocation For Partially Reconfigurable FPGAs, 2013
Preemptive Hardware Multitasking in ReconOS, 2015

Preemptive Reconfiguration¶

Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing, 2018

Bitstreams¶

Github 7-series bitmap reverse engineering
PARBIT: A Tool to Transform Bitfiles to Implement Partial Reconfiguration of Field Programmable Gate Arrays (FPGAs), 2001
BIL: A TOOL-CHAIN FOR BITSTREAM REVERSE-ENGINEERING, 2012
BITMAN: A Tool and API for FPGA Bitstream Manipulations, 2017

Relocation:¶

Context saving and restoring for multitasking in reconfigurable systems, 2005
REPLICA2Pro: Task Relocation by Bitstream Manipulation in Virtex-II/Pro FPGAs, 2006
Relocation and Automatic Floor-planning of FPGA Partial Configuration Bit-Streams, MSR 2008
Internal and External Bitstream Relocation for Partial Dynamic Reconfiguration, 2009
PRR-PRR Dynamic Relocation, 2009
HTR: on-chip Hardware Task Relocation For Partially Reconfigurable FPGAs, 2003
AutoReloc, 2016
HTR: on-chip Hardware Task Relocation For Partially Reconfigurable FPGAs, 2013

Others¶

hthreads: A hardware/software co-designed multithreaded RTOS kernel, 2005
hthreads: Enabling a Uniform Programming Model Across the Software/Hardware Boundary, FCCM‘16
Tartan: Evaluating Spatial Computation for Whole Program Execution, ASPLOS‘06
A virtual hardware operating system for the Xilinx XC6200, 1996
The Swappable Logic Unit: a Paradigm for Virtual Hardware, FCCM‘97
Run-time management of dynamically reconfigurable designs, 1998
- All above ones are early work on FPGA scheduling.
- Worth a read, but don’t take some of their assumptions. Some have been changed after SO many years.
S1. Reconfigurable Hardware Operating Systems: From Design Concepts to Realizations, 2003
S2. Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks, 2004
- Very fruitful discussion. The paper schedules bitstreams inside FPGA, following a Real-Time sched policy (deadline).
- Different from CPU sched, FPGA scheduling needs to consider “areas”. The chip is a rectangle box, allocating areas needs great care to avoid fragmentation!
Context saving and restoring for multitasking in reconfigurable systems, FPL‘05
- Optimizing deschedule perf.
- This paper discusses ways to save and restore the state information of a hardware task. There are generally three approachs: a) adding indirection. Let app use system API to read/write states. b) yield-type API. c) use PR controller to read back bitstream.
- This paper used ICAP to read the bitstream back and extract necenssay state information that must be present at next bitstream resume.
Scheduling intervals for reconfigurable computing, FCCM‘08
Hardware context-switch methodology for dynamically partially reconfigurable systems, 2010
Online Scheduling for Multi-core Shared Reconfigurable Fabric, DATE‘12
Multi-shape Tasks Scheduling for Online Multitasking on FPGAs, 2014
AmophOS, OSDI‘18
Hardware context switching on FPGAs, 2014
Efficient Hardware Context-Switch for Task Migration between Heterogeneous FPGAs, 2016

NoC¶

Network-on-Chip on FPGA.

Interconnection Networks Enable Fine-Grain Dynamic Multi-Tasking on FPGAs, 2002
- Like the idea of separating computation from communication.
- Also a lot discussions about possible NoC designs within FPGA.
LEAP Soft connections: Addressing the hardware-design modularity problem, DAC‘09
- Virtual channel concept. Time-insensitive.
Leveraging Latency-Insensitivity to Ease Multiple FPGA Design, FPGA‘12
CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs, FPGA‘12
Your Programmable NIC Should be a Programmable Switch, HotNets‘18

Memory Hierarchy¶

Papers deal with BRAM, registers, on-board DRAM, and host DRAM.

LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic, FPGA‘11
- Main design hierarchy: Use BRAM as L1 cache, use on-board DRAM as L2 cache, and host memory as the backing store. Everthing is abstracted away through their interface (similar to load/store). Programming is pretty much the same as if you are writing for CPU.
- According to sec 2.2.2, its scratchpad controller, is using simple segment-based mapping scheme. Like AmorphOS’s one.
LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories, FCCM‘14
- Follow up work on LEAP Scratchpads, extends the work to have cache coherence between multiple FPGAs.
- Coherent Scatchpads with MOSI protocol.
MATCHUP: Memory Abstractions for Heap Manipulating Programs, FPGA‘15
CoRAM: An In-Fabric Memory Architecture for FPGA-Based Computing
- CoRAM provides an interface for managing the on- and off-chip memory resource of an FPGA. It use “control threads” enforce low-level control on data movement.
- Seriously, the CoRAM is just like Processor L1-L3 caches.
CoRAM Prototype and evaluation of the CoRAM memory architecture for FPGA-based computing, FPGA‘12
- Prototype on FPGA.
Sharing, Protection, and Compatibility for Reconfigurable Fabric with AMORPHOS, OSDI‘18
- Hull: provides memory protection for on-board DRAM using segment-based address translation.
Virtualized Execution Runtime for FPGA Accelerators in the Cloud, IEEE Access‘17

Dynamic Memory Allocation¶

malloc() and free() for FPGA on-board DRAM.

Integrate with Host Virtual Memory¶

Papers deal with OS Virtual Memory System (VMS). Note that, all these papers introduce some form of MMU into the FPGA to let FPGA be able to work with host VMS. This added MMU is similar to CPU’s MMU and RDMA NIC’s internal cache. Note that the VMS still runs inside Linux (include pgfault, swapping, TLB shootdown and so on.), except one recent ISCA‘20 paper.

Virtual Memory Window for Application-Specific Reconfigurable Coprocessors, DAC‘04
- Early work that adds a new MMU to FPGA to let FPGA logic access on-chip DRAM. Note, it’s not the system main memory. Thus the translation pgtable is different.
- Has some insights on prefetching and MMU CAM design.
Seamless Hardware Software Integration in Reconfigurable Computing Systems, 2005
- Follow up summary on previous DAC‘04 Virtual Memory Window.
A Reconfigurable Hardware Interface for a Modern Computing System, FCCM‘07
- This work adds a new MMU which includes a 16-entry TLB to FPGA. FPGA and CPU shares the same user virtual address space, use the same physical memory. FPGA and CPU share memory at cacheline granularity, FPGA is just another core in this sense. Upon a TLB miss at FPGA MMU, the FPGA sends interrupt to CPU, to let software to handle the TLB miss. Using software-managed TLB miss is not efficient. But they made cache coherence between FPGA and CPU easy.
Low-Latency High-Bandwidth HW/SW Communication in a Virtual Memory Environment, FPL‘08
- This work actually add a new MMU to FPGA, which works just like CPU MMU. It’s similar to IOMMU, in some sense.
- But I think they missed one important aspect: cache coherence between CPU and FPGA. There is not too much information about this in the paper, it seems they do not have cache at FPGA. Anyhow, this is why recently CCIX and OpenCAPI are proposed.
Memory Virtualization for Multithreaded Reconfigurable Hardware, FPL‘11
- Part of the ReconOS project
- They implemented a simple MMU inside FPGA that includes a TLB. On protection violation or page invalid access cases, their MMU just hand over to CPU pgfault routines. How is this different from the FPL‘08 one? Actually, IMO, they are the same.
S4 Virtualized Execution Runtime for FPGA Accelerators in the Cloud, IEEE Access‘17
- This paper also implemented a hardware MMU, but the virtual memory system still run on Linux.
- Also listed in Cloud Infrastructure part.
Lightweight Virtual Memory Support for Many-Core Accelerators in Heterogeneous Embedded SoCs, 2015
Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs, IEEE‘17
- Part of the PULP project.
- Essentially a software-managed IOMMU. The control path is running as a Linux kernel module. The datapath is a lightweight AXI transation translation.
Flick: Fast and Lightweight ISA-Crossing Call for Heterogeneous-ISA Environments, ISCA‘20
- This paper adds an MMU/TLB into FPGA-side RISC-V to fetch/translate host pgtable entries. This paper’s goal is to migrate threads between different ISAs, the key is VM. But what’s new?
A Case for Hardware-Based Demand Paging, ISCA‘20
- This paper is not FPGA-based, but does augments host MMU with pgfault handling capability.
- This paper targets file-backed pgfault, more specific, ultra-low-latency SSD backed files. It adds several HW units to let CPU MMU able to handle and resolve such pgfaults (essentially offload VFS->FS->BLK->NVMe Driver functionalties into HW. Some part is done via mmap() beforehand).
- It’s async free page list, async LRU handling are used by our work as well.

Integrate with Host OSs¶

A Virtual Hardware Operating System for the Xilinx XC6200, FPL‘96
Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks, IEEE‘04
hthreads: a hardware/software co-designed multithreaded RTOS kernel, 2005
Reconfigurable computing: architectures and design methods, IEE‘05
BORPH: An Operating System for FPGA-Based Reconfigurable Computers. PhD Thesis.
FUSE: Front-end user framework for O/S abstraction of hardware accelerators, FCCM‘11
ReconOS – an Operating System Approach for Reconfigurable Computing, IEEE Micro‘14
- Invoke kernel from FPGA. They built a shell in FPGA and delegation threads in CPU to achieve this.
- They implemented their own MMU (using pre-established pgtables) to let FPGA logic to access system memory. Ref.
- Read the “Operating Systems for Reconfigurable Computing” sidebar, nice summary.
LEAP Soft connections: Addressing the hardware-design modularity problem, DAC‘09
- Channel concept. Good.
LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic, FPGA‘11
- BRAM/on-board DRAM/host DRAM layering. Caching.
LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories
- Add cache-coherence on top of previous work.
- Also check out my note on Cache Coherence.
LEAP FPGA Operating System, FPL‘14.
A Survey on FPGA Virtualization, FPL‘18
ZUCL 2.0: Virtualised Memory and Communication for ZYNQ UltraScale+ FPGAs, FSP‘19

Security¶

If I were to recommend, I’d suggest start from:

Recent Attacks and Defenses on FPGA-based Systems, 2019
Physical Side-Channel Attacks and Covert Communication on FPGAs: A Survey, 2019
FPGA security: Motivations, features, and applications, 2014

The whole list:

FPGAhammer : Remote Voltage Fault Attacks on Shared FPGAs , suitable for DFA on AES
FPGA-Based Remote Power Side-Channel Attacks
Characterization of long wire data leakage in deep submicron FPGAS
Protecting against cryptographic Trojans in FPGAS
FPGA Side Channel Attacks without Physical Access
FPGA security: Motivations, features, and applications
FPGA side-channel receivers
Security of FPGAs in data centers
Secure Function Evaluation Using an FPGA Overlay Architecture
Mitigating Electrical-level Attacks towards Secure Multi-Tenant FPGAs in the Cloud
The Costs of Confidentiality in Virtualized FPGAs
Temporal Thermal Covert Channels in Cloud FPGAs
Characterizing Power Distribution Attacks in Multi-User FPGA Environments
FASE: FPGA Acceleration of Secure Function Evaluation
Securing Cryptographic Circuits by Exploiting Implementation Diversity and Partial Reconfiguration on FPGAs
Measuring Long Wire Leakage with Ring Oscillators in Cloud FPGAs
Physical Side-Channel Attacks and Covert Communication on FPGAs: A Survey
Leaky Wires: Information Leakage and Covert Communication Between FPGA Long Wires
Using the Power Side Channel of FPGAs for Communication
An Inside Job: Remote Power Analysis Attacks on FPGAs
Leakier Wires: Exploiting FPGA Long Wires for Covert- and Side-channel Attacks
Voltage drop-based fault attacks on FPGAs using valid bitstreams
Moats and Drawbridges: An Isolation Primitive for Reconfigurable Hardware Based Systems
Sensing nanosecond-scale voltage attacks and natural transients in FPGAs
Holistic Power Side-Channel Leakage Assessment:
Hiding Intermittent Information Leakage with Architectural Support for Blinking
Examining the consequences of high-level synthesis optimizations on power side-channel
Register transfer level information flow tracking for provably secure hardware design
A Protection and Pay-per-use Licensing Scheme for On-cloud FPGA Circuit IPs
Recent Attacks and Defenses on FPGA-based Systems
PFC: Privacy Preserving FPGA Cloud - A Case Study of MapReduce
A Pay-per-Use Licensing Scheme for Hardware IP Cores in Recent SRAM-Based FPGAs
FPGAs for trusted cloud computing

Summary¶

Summary on current FPGA Virtualization Status. Prior art mainly focus on: 1) How to virtualize on-chip BRAM (e.g., CoRAM, LEAP Scratchpad), 2) How to work with host, specifically, how to use the host DRAM, how to use host virtual memory. 3) How to schedule bitstreams inside a FPGA chip. 4) How to provide certain services to make FPGA programming easier (mostly work with host OS).

Languages, Runtime, and Framework¶

Innovations in the toolchain space.

Xilinx HLS¶

Design Patterns for Code Reuse in HLS Packet Processing Pipelines, FCCM‘19
- A very good HLS library from Mellanox folks.
Templatised Soft Floating-Point for High-Level Synthesis, FCCM‘19
ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA, FCCM‘18
HLScope+: Fast and Accurate Performance Estimation for FPGA HLS, ICCAD‘17
Separation Logic-Assisted Code Transformations for Efficient High-Level Synthesis, FCCM‘14
- An HLS design aids that analyze the original program at compile time and perform automated code transformations. The tool analysis pointer-manipulating programs and automatically splits heap-allocated data structures into disjoint, independent regions.
- The tool is for C++ heap operations.
- To put in another way: the tool looks at your BRAM usage, found any false-dependencies, and make multiple independent regions, then your II is improved.
MATCHUP: Memory Abstractions for Heap Manipulating Programs, FPGA‘15
- This is an HLS toolchain aid.
- Follow-up work of the above FCCM‘14 one. This time they use LEAP scracchpads as the underlying caching block.

Xilinx CAD¶

Maverick: A Stand-alone CAD Flow for Partially Reconfigurable FPGA Modules, FCCM‘19

High-Level Languages and Platforms¶

Just-in-Time Compilation for Verilog, ASPLOS‘19
Chisel: Constructing Hardware in a Scala Embedded Language, DAC‘12
- Chisel is being actively improved and used by UCB folks.
Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs, FPGA‘18
From JVM to FPGA: Bridging Abstraction Hierarchy via Optimized Deep Pipelining, HotCloud‘18
HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing, FPGA‘19
LINQits: Big Data on Little Clients, ISCA‘13
- From Microsoft, used to express SQL-like functions (thus big data) and runs on ZYNQ (thus little client),
- You wrote C#, LINQits translate it to verilog, and run the whole thing at a ZYNQ (ARM+FPGA) board.
Lime: a Java-Compatible and Synthesizable Language for Heterogeneous Architectures, OOPSLA‘10
- Lime is a Java-based programming model and runtime from IBM which aims to provide a single unified language to program heterogeneous architectures, from FPGAs to conventional CPUs
A line of work from Standord
- Generating configurable hardware from parallel patterns, ASPLOS‘16
- Plasticine: A Reconfigurable Architecture For Parallel Patterns, ISCA‘17
- Spatial: A Language and Compiler for Application Accelerators, PLDI‘18
  - Spatial generates Chisel code along with C++ code which can be used on a host CPU to control the execution of the accelerator on the target FPGA.
  - This kind of academic papers must have a lot good ideas. But the truth is it will not be reliable because it’s from academic labs.

Integrate with Frameworks¶

Map-reduce as a Programming Model for Custom Computing Machines, FCCM‘08
- This paper proposes a model to translate MapReduce code written in C to code that could run on FPGA and GPU. Many details are omitted, and they don’t really have the compiler.
- Single-host framework, everything is in FPGA and GPU.
Axel: A Heterogeneous Cluster with FPGAs and GPUs, FPGA‘10
- A distributed MapReduce Framework, targets clusters with CPU, GPU, and FPGA. Mainly the idea of scheduling FPGA/GPU jobs.
- Distributed Framework.
FPMR: MapReduce Framework on FPGA, FPGA‘10
- A MapReduce framework on a single host’s FPGA. You need to write Verilog/HLS for processing logic to hook with their framework. The framework mainly includes a data transfer controller, a simple schedule that enable certain blocks at certain time.
- Single-host framework, everything is in FPGA.
Melia: A MapReduce Framework on OpenCL-Based FPGAs, IEEE‘16
- Another framework, written in OpenCL, and users can use OpenCL to program as well. Similar to previous work, it’s more about the framework design, not specific algorithms on FPGA.
- Single-host framework, everything is in FPGA. But they have a discussion on running on multiple FPGAs.
- Four MapReduce FPGA papers here, I believe there are more. The marriage between MapReduce and FPGA is not something hard to understand. FPGA can be viewed as another core with different capabilities. The thing is, given FPGA’s reprogram-time and limited on-board memory, how to design a good scheduling algorithm and data moving/caching mechanisms. Those papers give some hints on this.
UCLA: When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration, HotCloud‘16
UCLA: Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale, SoCC‘16
- A system that hooks FPGA with Spark.
- There is a line of work that hook FPGA with big data processing framework (Spark), so the implementation of FPGA and the scale-out software can be separated. The Spark can schedule FPGA jobs to different machines, and take care of scale-out, failure handling etc. But, I personally think this line of work is really just an extension to ReconOS/FUSE/BORPH line of work. The main reason is: both these two lines of work try to integrate jobs run on CPU and jobs run on FPGA, so CPU and FPGA have an easier way to talk, or put in another way, CPU and FPGA have a better division of labor. Whether it’s single-machine (like ReconOS, Melia), or distributed (like Blaze, Axel), they are essentially the same.
UCLA: Heterogeneous Datacenters: Options and Opportunities, DAC‘16
- Follow up work of Blaze. Nice comparison of big and wimpy cores.

Cloud Infrastructure¶

Huawei: FPGA as a Service in the Cloud
UCLA: Customizable Computing: From Single Chip to Datacenters, IEEE‘18
UCLA: Accelerator-Rich Architectures: Opportunities and Progresses, DAC‘14
- Reminds me of OmniX. Disaggregation at a different scale.
- This paper actually targets single-machine case. But it can reflect a distributed setting.
Enabling FPGAs in the Cloud, CF‘14
- Paper raised four important aspects to enable FPGA in cloud: Abstraction, Sharing, Compatibility, and Security. FPGA itself requires a shell (paper calls it service logic) and being partitioned into multiple slots. Things discussed in the paper are straightforward, but worth reading. They did not solve the FPGA sharing issue, which, is solved by AmorphOS.
FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack, FCCM‘14
- Use OpenStack to manage FPGA resource. The FPGA is partitioned into multiple regions, each region can use PR. The FPGA shell includes: 1) basic MAC, and packet dispatcher, 2) memory controller, and segment-based partition scheme, 3) a soft processor used for runtime PR control. One very important aspect of this project is: they envision input to FPGA comes from Ethernet, which is very true nowadays. And this also makes their project quite similar to Catapult. It’s a very solid paper, though the evaluation is a little bit weak. What could be added: migration, different-sized region.
- The above CF and FCCM papers are similar in the sense that they are both building SW framework and HW shell to provide a unified cloud management system. They differ in their shell design: CF one take inputs from DMA engine, which is local system DRAM, FCCM one take inputs from Ethernet. The things after DMA or MAC, are essentially similar.
- It seems all of them are using simple segment-based memory partition for user FPGA logic. What’s the pros and cons of using paging here?
S1 DyRACT: A partial reconfiguration enabled accelerator and test platform, FPL‘14
S2 Virtualized FPGA Accelerators for Efficient Cloud Computing, CloudCom‘15
S3 Designing a Virtual Runtime for FPGA Accelerators in the Cloud, FPL‘16
S4 Virtualized Execution Runtime for FPGA Accelerators in the Cloud, IEEE Access‘17
- The above four papers came from the same group of folks. S1 developed a framework to use PCIe to do PR, okay. S2 is a follow-up on S1, read S2’s chapter IV hardware architecture, many implementation details like internal FPGA switch, AXI stream interface. But no memory virtualization discussion. S3 is a two page short paper. S4 is the realization of S3. I was particularly interested if S4 has implemented their own virtual memory management. The answer is NO. S4 leveraged on-chip Linux, they just build a customized MMU (in the form of using BRAM to store page tables. This approach is similar to the papers listed in Integrate with Virtual Memory). Many things discussed in S4 have been proposed multiple times in previous cloud FPGA papers since 2014.
MS: A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA‘14
MS: A Cloud-Scale Acceleration Architecture, Micro‘16
- Catapult is unique in its shell, which includes the Lightweight Transport Layer (LTL), and Elastic Router(ER). The cloud management part, which the paper just briefly mentioned, actually should include everything the above CF‘14 and FCCM‘14 have. The LTL has congestion control, packet loss detection/resend, ACK/NACK. The ER is a crossbar switch used by FPGA internal modules, which is essential to connect shell and roles.
- These two Catapult papers are simply a must read.
MS: A Configurable Cloud-Scale DNN Processor for Real-Time AI, Micro‘18
MS: Azure Accelerated Networking: SmartNICs in the Public Cloud, NSDI‘18
MS: Direct Universal Access : Making Data Center Resources Available to FPGA, NSDI‘19
- Catapult is just sweet, isn’t it?
ASIC Clouds: Specializing the Datacenter, ISCA‘16
Virtualizating FPGAs in the Cloud, ASPLOS‘20, to appear.

Misc¶

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems, FPGA‘16

Applications¶

Programmable Network¶

MS: ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware, SIGCOMM‘16
MS: Multi-Path Transport for RDMA in Datacenters, NSDI‘18
MS: Azure Accelerated Networking: SmartNICs in the Public Cloud, NSDI‘18
Mellanox. NICA: An Infrastructure for Inline Acceleration of Network Applications, ATC‘19
The Case For In-Network Computing On Demand, EuroSys‘19
Fast, Scalable, and Programmable Packet Scheduler in Hardware, SIGCOMM‘19
HPCC: high precision congestion control, SIGCOMM‘19
Offloading Distributed Applications onto SmartNICs using iPipe, SIGCOMM‘19
- Not necessary FPGA, but SmartNICs. The actor programming model seems a good fit. There is another paper from ATC‘19 that optimizes distributed actor runtime.

Database and SQL¶

On-the-fly Composition of FPGA-Based SQL Query Accelerators Using A Partially Reconfigurable Module Library, 2012
Accelerating database systems using FPGAs: A survey, FPL‘18

Storage¶

Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval, ATC‘19
INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive, ATC‘19
LightStore: Software-defined Network-attached Key-value Drives, ASPLOS‘19
FIDR: A Scalable Storage System for Fine-Grain Inline Data Reduction with Efficient Memory Handling, MICRO‘19
CIDR: A Cost-Effective In-line Data Reduction System for Terabit-per-Second Scale SSD Array, HPCA‘19

Machine Learning¶

TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning, HPCA‘16
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA‘15
From High-Level Deep Neural Models to FPGAs, ISCA‘16
Deep Learning on FPGAs: Past, Present, and Future, arXiv‘16
Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC, FPT‘16
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference, FPGA‘17
In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA‘17
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs, FPGA‘17
A Configurable Cloud-Scale DNN Processor for Real-Time AI, ISCA‘18
- Microsoft Project Brainware. Built on Catapult.
A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks, MICRO‘18
DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs, ICCAD‘18
FA3C : FPGA-Accelerated Deep Reinforcement Learning， ASPLOS’19
Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval, ATC‘19

Graph¶

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing, ISCA‘15
Energy Efficient Architecture for Graph Analytics Accelerators, ISCA‘16
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search, FPGA‘17
FPGA-Accelerated Transactional Execution of Graph Workloads, FPGA‘17
An FPGA Framework for Edge-Centric Graph Processing, CF‘18

Key-Value Store¶

Achieving 10Gbps line-rate key-value stores with FPGAs, HotCloud‘13
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached, ISCA‘13
An FPGA Memcached Appliance, FPGA‘13
Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory, HotStorage‘15
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC, SOSP‘17
- This link is also useful for better understading Morning Paper
Ultra-Low-Latency and Flexible In-Memory Key-Value Store System Design on CPU-FPGA, FPT‘18

Bio¶

When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration, HotCloud‘16
FPGA Accelerated INDEL Realignment in the Cloud, HPCA‘19

Consensus¶

Consensus in a Box: Inexpensive Coordination in Hardware, NSDI‘16

Video Processing¶

Quantifying the Benefits of Dynamic Partial Reconfiguration for Embedded Vision Applications (FPL 2019)
Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration (FPL 2018)

FPGA Internal¶

FPGA20: Highlighting Significant Contributions from 20 Years of the International Symposium on Field-Programmable Gate Arrays (1992–2011)

General¶

FPGA and CPLD architectures: a tutorial, 1996
Reconfigurable computing: a survey of systems and software, 2002
Reconfigurable computing: architectures and design methods
FPGA Architecture: Survey and Challenges, 2007
- Read the first two paragraphs of each section and then come back to read all of that if needed.
RAMP: Research Accelerator For Multiple Processors, 2007
Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology, IEEE‘15

Partial Reconfiguration¶

FPGA Dynamic and Partial Reconfiguration: A Survey of Architectures, Methods, and Applications, CSUR‘18
- Must read.
DyRACT: A partial reconfiguration enabled accelerator and test platform, FPL‘14
A high speed open source controller for FPGA partial reconfiguration
Hardware context-switch methodology for dynamically partially reconfigurable systems, 2010