|Jul 3, 2022||Add Pathways|
|Apr 2, 2022||Ported from Craft. Recently, I switched to Craft for technical writing. I’m very happy I made that transition. Craft is great at exporting things to Markdown format.|
This document is organized as follows.
- MLIR at 10,000 feet - an overview.
- Why MLIR was created? - a study of the original MLIR paper.
- How MLIR is used in the wild? - case studies.
Without further ado, let’s get started.
MLIR At 10,000 Feet¶
MLIR is short for Multi-Level Intermediate Representation. MLIR helps to build reusage compiler infrastructure and reduce duplicate codes.
I draw the following figure to show MLIR’s workflow at a very high level.
- MLIR’s input: applications, compilers, C program, etc
- Within MLIR, we can implement multiple Dialects for distinct inputs. For instance, we could use a Dialect to deal with tensors. Further, we can deploy a shared optimization layer to unify things.
- Once we have an optimal IR, MLIR can now lower it onto the backends such as LLVM for CPUs, CIRCT for FPGAs. If you are targeting specialized hardware such as FPGA or TPU, you still need vendor-tools for final compilation (e.g., use Vivado to synthesis Verilog).
I have found several excellent primer readings.
- LLVM Paper from Google, 2020. This paper describes the rationale behind MLIR. Chris L is one of the authord.
- LLVM MLIR Tutorial
- I didn’t understand this image when I first read it. But now it all makes sense. MLIR is something that lies across language AST and LLVM IR.
- ScaleHLS, HPCA‘22 can compile HLS C/C++ or PyTorch model to optimized HLS C/C++ using MLIR.
Motivation from the Google MLIR Paper¶
This is a really nice Intro, pay close attention to how they lay out the storyline. If you are new to PL just like me, I strongly recommend going through the MLIR Toy Example (covered below ) for a better understanding, and then come back, read through this again.
- A common characteristic of popular ML systems is their “one size ﬁts all” approach—a single abstraction level to interface with the system: the LLVM Intermediate Representation (IR) is roughly “C with vectors”, and JVM provides an “object-oriented type system with a garbage collector” abstraction. This “one size ﬁts all” approach is incredibly valuable—and in practice, the mapping to these domains from ubiquitous source languages (C/C++ and Java respectively) is straightforward. (Praise the unified LLVM IR)
- At the same time, many problems are better modeled at a higher- or lower-level abstraction, e.g. source-level analysis of C++ code is very difﬁcult on LLVM IR. We observe that many languages (including e.g. Swift, Rust, Julia, Fortran) develop their own IR in order to solve domain-speciﬁc problems, like language/library-speciﬁc optimizations, ﬂow-sensitive type checking. Similarly, machine learning systems typically use “ML graphs” as a domain-speciﬁc abstraction in the same way. (Point out the issues about LLVM IR)
- While the development of domain speciﬁc IRs is a well studied art, their engineering and implementation cost remains high. … this can lead to lower quality compiler systems. (Point out that developing customized IR framework is challenging)
- The MLIR project aims to directly tackle these programming language design and implementation challenges—by making it very cheap to deﬁne and introduce new abstraction levels, and provide “in the box” infrastructure to solve common compiler engineering problems. MLIR does this by
- standardizing the Static Single Assignment (SSA)-based IR data structures
- providing a declarative system for deﬁning IR dialects (demonstrated below using the Toy example)
- providing a wide range of common infrastructure (including documentation, parsing and printing logic, location tracking, multithreaded compilation support, pass management, etc).
This image shows that most high-level languages have their own AST and associated infrastructure for transforming etc. Though language-specific, these are modules doing similar things. MLIR is a general framework to facilitate the development of such language-specific modules. It allows developers to use a unified codebase/framework to do their optimizations and develop some common, shared optimizations for multiple inputs.
I recommend reading Toy Example Tutorial for a deep understanding.
This image is MLIR’s original motivation. They found that ML graphs have a lot of different compilers. The compilation process is fragmented and some compilers are not following the best practices.
Example 1: MLIR Toy Example¶
While reading through its documentation, I’m starting to get a sense of what problem MLIR is trying to solve. The MLIR paper for sure describes the problem at a high level, but being able to read through the code example and its documentation helps a lot.
The following quote is the same motivation described in the MLIR paper.
Other compilers, like LLVM (see the Kaleidoscope tutorial), offer a fixed set of predefined types and (usually low-level / RISC-like) instructions.
It is up to the frontend for a given language to perform any language-specific type-checking, analysis, or transformation before emitting LLVM IR. <- also mentioned in the MLIR paper.
For example, Clang will use its AST to perform not only static analysis but also transformations, such as C++ template instantiation through AST cloning and rewrite. Finally, languages with construction at a higher level than C/C++ may require non-trivial lowering from their AST to generate LLVM IR.
Consequently, multiple frontends end up reimplementing significant pieces of infrastructure to support the need for these analyses and transformations. MLIR addresses this issue by being designed for extensibility. There are few pre-defined instructions (operations in MLIR terminology) or types.
Like C, Swift, Rust, etc., each language has its own AST optimizers that do some language-specific transformations and analysis. This is quite tedious to do. So:
MLIR is designed to allow all IR elements, such as attributes, operations, and types, to be customized. At the same time, IR elements can always be reduced to the above fundamental concepts. This allows MLIR to parse, represent, and round-trip** IR for any operation.**
This is EXACTLY what I want to say for the APSys submission.
Through dialects, MLIR allows for the representation of many different levels of abstraction; the Toy dialect that we have previously defined is one such example.
Though these different dialects may represent different abstractions, there is often a set of common transformations and analyses that we would like to perform.
The blog builds the Toy Example following these steps:
- It first defines the semantics of this toy language and some simple operations. It then defines an IR for the Toy language in an MLIR dialect. MLIR can transform the source code into its internal IR using the above dialect.
- It then performs “High-level Language-Specific Analysis and Transformation” and other optimizations on the generated IR within MLIR. The transformations are pretty straightforward, such as eliminating duplicated ops. These optimizations, however, would be difficult for LLVM to carry out.
- It then discussed an MLIR internal interface infrastructure that facilitates the above transformations. The rationale is that most transformations used by distinct languages are similar, hence a framework can reduce code duplication and also allow developers to design a set of shared common optimizations/passes.
- Then, the interesting part. It converts this Dialect into other MLIR built-in dialects (e.g., affine, arithmetic), thereby lowering the toy Dialect into more concrete memory accesses, and arithmetic ops, etc.
- Finally, it again lowers the above partially-lowered IR onto the LLVM IR. Once we are here, we can invoke LLVM to generate code (e.g., for x86 or ARM CPUs) or run with the LLVM JIT. Of course, instead of lowering it onto the LLVM IR, one can also lower it onto another IR, e.g., TPU IR (what TensorFlow does).
- MLIR is a generic framework that allows you to define your customized IR using MLIR’s generic primitives (i.e., an indirection layer). From MLIR’s perspective, your IR is just one of the many dialects it supports.
- More importantly, a dialect can fully or partially convert into other dialects. For instance, if you convert your IR into the LLVM IR, you can immediately take advantage of the LLVM’s code-generation framework for CPUs. If you convert your IR into the TPU IR, you can then generate code running on TPUs.
Say I want to build some P4 or FPGA stuff using MLIR, I would do:
- I would first define a language model together with a new IR using MLIR primitives.
- Then, within MLIR, I would do all sorts of language-specific optimizations, transformations, etc. I can also do some conversions among other dialects.
- After all that, say I’ve got an optimized IR. What should I do next? I cannot fully lower it to the LLVM IR, because there is no P4/FPGA backend in the LLVM framework.
- If I target FPGA, I could generate FIRRTL, which is the input of CIRCIT or Chisel.
- If I target P4, I could generate the MLIR IR into something like a P4 IR/backend, which then will do vendor-specific compilation into deployable binaries.
- Is this P4 IR thing already part of the p4 compiler chain? If so, why should I go through all this trouble adding a new MLIR dialect, why not directly use the p4 compile chain? What benefits are we getting out of MLIR though?
- Answer: we will benefit from MLIR only if we are targeting multiple backends at the same time, thus we can share the same optimization infrastructure. In specific, one piece of code can run on top of a set of heterogeneous devices. All the optimizations are nicely done within the MLIR layer.
Example 2: Google IREE¶
I’m not exactly sure what IREE is doing. Overall, it takes an ML program and tries to transform it into scheduling and computation modules run on various hardware components.
- The bottom right part is interesting. You can see that it can lower onto the LLVM IR, further generating codes for various CPUs; it can also lower onto SPIR-V IR, a special IR defined for GPUs. I’m not sure what VMVX is.
Example 3: LLVM CIRCT¶
- Chisel’s FIRRTL
- MLIR’s output
- CIRCT implements its own FIRRTL parser, so it can take an FIR file to generate RTL
- Other than that, CIRCT could also take MLIR outputs to generate RTL.
- Apparently, CIRCT also uses the Dialects concepts.
Example 4: TensorFlow/PyTorch with MLIR¶
- It compiles some Torch operations into a newly defined torch-dialect in MLIR.
- Within MLIR, the torch-dialect is further lowered onto built-in dialects such as affine
Example 5: ScaleHLS, HPCA’22¶
https://raw.githubusercontent.com/hanchenye/scalehls/master/docs/ScaleHLS.svg The whole system is implemented on top of MLIR. They introduced a new
HLSCPP dialect. They take HLS C programs, or TORCH/ONNX graph-level programs, then produce highly-optimized HLS C/C++ programs.
It is a very interesting read. The following image shows its workflow.
My thought: I think we will continue seeing more MLIR-based solutions to help DSA development. It’ll be interesting to see some higher-level, or higher-order primitives constructed in MLIR to help, say, FPGA-based SQL develpoment (or rather, any types of FPGA-based computations). In general, MLIR helps to raise the abstrantion, hence we are able to raise the programmability futher.
Example 6: EQueue, HPCA‘22¶
Compiler-Driven Simulation of Reconfigurable Hardware Accelerators, HPCA‘22.
- The goal is to help simulation.
- Add a new dialect in MLIR to model different accelerators.
- There are two general approaches to do simulation: 1) use RTL-level, which is very precise and also very slow. 2) use high-level simulators, sth like gem5. Fast, but is far away from hardware.
- The goal of this paper is to use MLIR to build sth in the middle. It introduces a new dialect IR, which can describe various accelerator structure (e.g., how many processors, memory, DMA engines etc). Since MLIR can lower IR, their system can model the accelerator at different levels. On one extreme, they can do very high-level simulation (probably just use their new IR). On the other extreme, they can lower their IR to be close to actual hardware.
- Check their Fig3-Fig5 to understand how they can model different accelerators!
Example 7: Pathways, Google¶
From the paper:
“Sec 4.2: The client then constructs a device location-agnostic PATHWAYS intermediate representation (IR) for the program, expressed as a custom MLIR (Lattner et al., 2021) dialect. The IR is progressively “lowered” via a series of standard compiler passes, which eventually output a low-level representation that includes the physical device locations. This low-level program takes into account the network connectivity between physical devices and includes operations to transfer outputs from a source computation shard to the locations of its destination shards, including scatter and gather operations when a data exchange is required.”
Alpa, arXiv‘21 proposes a set of methods to partition an ML training process to best utilize pipeline, data, model parallelism. This seems to be the first one doing all 3 at once. There are, of course, similar papers in the past doing 2 out of 3 (a paper from the Stanford folks).
TVM IP stuff is also highly related.
General PL Related Readings¶
- Saw this paper on twitter today (03/25/2022). It won the ICSE influential award. https://people.inf.ethz.ch/suz/publications/natural.pdf