An FPGA Reading List


Version History
Oct 22, 2019Shuffle scheduling section. More focused. Add two more recent fpga-virt papers
Oct 5, 2019More on scheduling. Add NoC. Add Security.
Oct 4, 2019Add more papers extracted from AmophOS
Oct 3, 2019Initial version from Github

This is a list of academic papers that cover all sorts of FPGA related topic, more from a system researcher’s point of view though.



Scheduling is big topic for FPGA. Unlike the traditional CPU scheduling, there are more aspects to consider, e.g., 1) Partial reconfiguration (PR), 2) Dynamic self PR, 3) Preemptive scheduling, 4) Relocation, 5) Floorplanning, and so on.

Preemptive Scheduling

  • Preemptive multitasking on FPGAs, 2000
  • Multitasking on FPGA Coprocessors, 2000
  • Context saving and restoring for multitasking in reconfigurable systems, 2005
  • ReconOS Cooperative multithreading in dynamically reconfigurable systems, FPL‘09
  • Block, drop or roll(back): Alternative preemption methods for RH multi-tasking, FCCM‘09
  • Hardware Context-Switch Methodology for Dynamically Partially Reconfigurable Systems, 2010
  • On-chip Context Save and Restore of Hardware Tasks on Partially Reconfigurable FPGAs, 2013
  • HTR: on-chip Hardware Task Relocation For Partially Reconfigurable FPGAs, 2013
  • Preemptive Hardware Multitasking in ReconOS, 2015

Preemptive Reconfiguration

  • Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing, 2018


  • Github 7-series bitmap reverse engineering
  • PARBIT: A Tool to Transform Bitfiles to Implement Partial Reconfiguration of Field Programmable Gate Arrays (FPGAs), 2001
  • BITMAN: A Tool and API for FPGA Bitstream Manipulations, 2017


  • Context saving and restoring for multitasking in reconfigurable systems, 2005
  • REPLICA2Pro: Task Relocation by Bitstream Manipulation in Virtex-II/Pro FPGAs, 2006
  • Relocation and Automatic Floor-planning of FPGA Partial Configuration Bit-Streams, MSR 2008
  • Internal and External Bitstream Relocation for Partial Dynamic Reconfiguration, 2009
  • PRR-PRR Dynamic Relocation, 2009
  • HTR: on-chip Hardware Task Relocation For Partially Reconfigurable FPGAs, 2003
  • AutoReloc, 2016
  • HTR: on-chip Hardware Task Relocation For Partially Reconfigurable FPGAs, 2013



Network-on-Chip on FPGA.

Memory Hierarchy

Papers deal with BRAM, registers, on-board DRAM, and host DRAM.

Dynamic Memory Allocation

malloc() and free() for FPGA on-board DRAM.

Integrate with Host Virtual Memory

Papers deal with OS Virtual Memory System (VMS). Note that, all these papers introduce some form of MMU into the FPGA to let FPGA be able to work with host VMS. This added MMU is similar to CPU’s MMU and RDMA NIC’s internal cache. Note that the VMS still runs inside Linux (include pgfault, swapping, TLB shootdown and so on. What could really stands out, is to implement VMS inside FPGA.)

Integrate with Host OSs


Power and timing.


Summary on current FPGA Virtualization Status. Prior art mainly focus on: 1) How to virtualize on-chip BRAM (e.g., CoRAM, LEAP Scratchpad), 2) How to work with host, specifically, how to use the host DRAM, how to use host virtual memory. 3) How to schedule bitstreams inside a FPGA chip. 4) How to provide certain services to make FPGA programming easier (mostly work with host OS).

Languages, Runtime, and Framework

Innovations in the toolchain space.

Xilinx HLS

Xilinx CAD

High-Level Languages and Platforms

Integrate with Frameworks

  • Map-reduce as a Programming Model for Custom Computing Machines, FCCM‘08
    • This paper proposes a model to translate MapReduce code written in C to code that could run on FPGA and GPU. Many details are omitted, and they don’t really have the compiler.
    • Single-host framework, everything is in FPGA and GPU.
  • Axel: A Heterogeneous Cluster with FPGAs and GPUs, FPGA‘10
    • A distributed MapReduce Framework, targets clusters with CPU, GPU, and FPGA. Mainly the idea of scheduling FPGA/GPU jobs.
    • Distributed Framework.
  • FPMR: MapReduce Framework on FPGA, FPGA‘10
    • A MapReduce framework on a single host’s FPGA. You need to write Verilog/HLS for processing logic to hook with their framework. The framework mainly includes a data transfer controller, a simple schedule that enable certain blocks at certain time.
    • Single-host framework, everything is in FPGA.
  • Melia: A MapReduce Framework on OpenCL-Based FPGAs, IEEE‘16
    • Another framework, written in OpenCL, and users can use OpenCL to program as well. Similar to previous work, it’s more about the framework design, not specific algorithms on FPGA.
    • Single-host framework, everything is in FPGA. But they have a discussion on running on multiple FPGAs.
    • Four MapReduce FPGA papers here, I believe there are more. The marriage between MapReduce and FPGA is not something hard to understand. FPGA can be viewed as another core with different capabilities. The thing is, given FPGA’s reprogram-time and limited on-board memory, how to design a good scheduling algorithm and data moving/caching mechanisms. Those papers give some hints on this.
  • UCLA: When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration, HotCloud‘16
  • UCLA: Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale, SoCC‘16
    • A system that hooks FPGA with Spark.
    • There is a line of work that hook FPGA with big data processing framework (Spark), so the implementation of FPGA and the scale-out software can be separated. The Spark can schedule FPGA jobs to different machines, and take care of scale-out, failure handling etc. But, I personally think this line of work is really just an extension to ReconOS/FUSE/BORPH line of work. The main reason is: both these two lines of work try to integrate jobs run on CPU and jobs run on FPGA, so CPU and FPGA have an easier way to talk, or put in another way, CPU and FPGA have a better division of labor. Whether it’s single-machine (like ReconOS, Melia), or distributed (like Blaze, Axel), they are essentially the same.
  • UCLA: Heterogeneous Datacenters: Options and Opportunities, DAC‘16
    • Follow up work of Blaze. Nice comparison of big and wimpy cores.

Cloud Infrastructure



Programmable Network



Machine Learning

  • TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning, HPCA‘16
  • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA‘15
  • From High-Level Deep Neural Models to FPGAs, ISCA‘16
  • Deep Learning on FPGAs: Past, Present, and Future, arXiv‘16
  • Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC, FPT‘16
  • FINN: A Framework for Fast, Scalable Binarized Neural Network Inference, FPGA‘17
  • In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA‘17
  • Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs, FPGA‘17
  • A Configurable Cloud-Scale DNN Processor for Real-Time AI, ISCA‘18
    • Microsoft Project Brainware. Built on Catapult.
  • A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks, MICRO‘18
  • DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs, ICCAD‘18
  • FA3C : FPGA-Accelerated Deep Reinforcement Learning, ASPLOS’19
  • Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval, ATC‘19


  • A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing, ISCA‘15
  • Energy Efficient Architecture for Graph Analytics Accelerators, ISCA‘16
  • Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search, FPGA‘17
  • FPGA-Accelerated Transactional Execution of Graph Workloads, FPGA‘17
  • An FPGA Framework for Edge-Centric Graph Processing, CF‘18

Key-Value Store

  • Achieving 10Gbps line-rate key-value stores with FPGAs, HotCloud‘13
  • Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached, ISCA‘13
  • An FPGA Memcached Appliance, FPGA‘13
  • Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory, HotStorage‘15
  • KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC, SOSP‘17
  • Ultra-Low-Latency and Flexible In-Memory Key-Value Store System Design on CPU-FPGA, FPT‘18



  • Consensus in a Box: Inexpensive Coordination in Hardware, NSDI‘16

Video Processing

  • TODO


  • TODO


  • TODO

FPGA Internal

FPGA20: Highlighting Significant Contributions from 20 Years of the International Symposium on Field-Programmable Gate Arrays (1992–2011)


Partial Reconfiguration

Logical Optimization and Technology Mapping

Place and Route