Triton: OpenAI's Programming Language (everything you need to know)

Prominent artificial intelligence research lab OpenAI LLC released Triton, a specialized programming language that it says will enable developers to create high-speed machine learning algorithms more easily.

This open-source programming language enables researchers to write highly efficient GPU code for AI workloads that is Python-compatible, Triton makes it possible to reach peak hardware performance with relatively little effort and comes with the ability of a user to write in as few as 25 lines, something on par with what an expert could achieve.

Why Do We Need This New Language

According to the OpenAI blog, the motivation for developing the language presented in 2019 is that ML frameworks are, on the one hand, not powerful enough for extensive artificial neural networks (ANN). The hardware-related GPU programming, for example with Nvidia's CUDA, on the other hand, has a high entry hurdle and places high demands on the manual optimization of processes.

The problem Tillet wanted to solve was how to create a language that would be more expressive than the vendor-specific libraries for AI, such as Nvidia's cuDNN, which means that he is able to handle a wide range of operations on matrices involved in neural networks, while also being portable and having performance similar to cuDNN and similar vendor libraries.

Triton Is Aimed From Who?

Triton is aimed specifically at those researchers and developers who are well versed in programming, but are not specialized in writing efficient code directly for graphics cards - still the most important hardware for AI training. It's very similar to GitHub Co-Pilot in the principles of working.

Triton has made it possible for OpenAI researchers with no GPU experience to write screaming-fast GPU code. Making GPU programming easier and lowering the barrier is practically really useful.

The History of Triton

Two years ago, OpenAI scientist Philippe Tillet presented the first version of Triton in an academic paper. For today's launch, the language has been further enhanced with optimizations aimed at enterprise machine learning projects.

Tailored to the GPU

In order to run the software as efficiently as possible on GPUs, the code must be tailored to the architecture. Memory coalescing ensures that memory transfers from DRAM are combined into large transactions. It is also important to optimize the shared memory management for the data stored in the SRAM. Finally, the calculations must be partitioned and scheduled both within individual streaming multiprocessors (SMs) and across SMs.

While developers under CUDA have to undertake the associated compiler optimizations manually, the Triton compiler obviously takes care of memory coalescing, shared memory management and scheduling within the SMs automatically. Manual adjustments are only required for the overall scheduling.

OpenAI Triton: Expanding Access to AI Development

OpenAI promises that even developers with no GPU experience with Triton can tickle top AI performance from graphics cards that are on par with results that otherwise experienced CUDA developers achieve - and with fewer lines of code.

"Triton made it possible for OpenAI researchers with no GPU experience to write blazing-fast GPU code," writes OpenAI co-founder Greg Brockman.

OpenAI Is Promising Two Main Benefits For Software Teams.

The first is that Triton can speed up AI projects since developers have to spend up less time optimizing their code.

The other, according to OpenAI, is that Triton is relatively simple, allowing software teams without extensive CUDA programming experience to create more efficient algorithms than they normally could

The Challenges of GPU Programming

The architecture of modern GPUs can be roughly divided into three major components—DRAM, SRAM, and ALUs—each of which must be considered when optimizing CUDA code:

  • Memory transfers from DRAM must be coalesced into large transactions to leverage the large bus width of modern memory interfaces.
  • Data must be manually stashed to SRAM prior to being re-used and managed so as to minimize shared memory bank conflicts upon retrieval.
  • Computations must be partitioned and scheduled carefully, both across and within Streaming Multiprocessors (SMs), so as to promote instruction/thread-level parallelism and leverage special-purpose ALUs (e.g., tensor cores).

Automate Machine-Learning Code

With Triton 1.0, OpenAI wants to automate the optimal adjustment of this code as much as possible. This should save developers a lot of time. In addition, this new programming language, which resembles Python, should allow more developers without specific knowledge of the CUDA framework to develop efficient algorithms more easily.

According to OpenAI, the software allows AI developers to deliver high performance without too much effort. Among other things, they can write so-called 'FP16 matrix multiplication kernels' that have the same performance as the performance of cuBLAS. Matrix multiplication kernels are software mechanisms that use machine learning algorithms for computation. Few GPU programmers can't do this within 25 lines of code, according to OpenAI.

How Does Triton Improves AI Performance

Triton improves AI performance by optimizing three core steps of the workflow with which a machine-learning algorithm running on an Nvidia chip processes data. GPUs remain incredibly challenging to optimize, especially when it comes to executing instructions in parallel.

The first step is the task of moving data between a GPU’s DRAM and SRAM memory circuits. GPUs store information in DRAM when it’s not actively used and transfer it to the SRAM memory to carry out computations.

The optimization process consists of merging the blocks of data moving from DRAM to SRAM into large units of information. Triton performs the task automatically, OpenAI says, thereby saving time for developers.

The second computational step Triton optimizes is the task of distributing the incoming data blocks across the GPU’s SRAM circuits in a way that makes it possible to analyze them as fast as possible.

the main challenge in this step is memory bank conflicts, where two pieces of software accidentally try to write data to the same memory segment, and this holds up calculations until they’re resolved, and reducing the likelihood of this, gives more speed for the performance of AI algorithms.

The third and final task Triton helps automate involves not GPUs’ memory cells but rather their CUDA cores. A single Nvidia data center GPU has thousands of such circuits. They allow the chip to perform a large number of calculations at the same time.

Triton configures it to spread out calculations across multiple CUDA cores so they can be done at the same time rather than one after another, though only partly sought to give developers the flexibility to manually customize the process for their projects as needed.