Accelerating AI: ...Present and Future

Yesterday I wrote about the first part of Krste Asanović's presentation Accelerating AI: Past, Present, and Future. Although yesterday's post only covered the past. Today, it's time for the present and future. Graphics Processing Units GPUs were originally dedicated fixed function units, cramming workstation-like graphics for PCs. Gradually, in the early 2000s, more programmability was added. The architectures was naturally massively parallel, but with a very constrained programming model. However, people found out how to do general purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations. But it was next to impossible to program, even with the rudimentary languages that started to be developed. The big change came in 2006, when NVIDIA introduced the GeForce 8800 and the new programming language CUDA (for Compute Unified Device Architecture). The industry as a whole subsequently pushed for OpenCL, a vendor-neutral language with similar capabilities to CUDA (CUDA is still proprietary to NVIDIA). The idea was to take advantage of GPU computational performance and memory bandwidth to accelerate some general purpose computing. It was also using the "attached processor model" with a general purpose host CPU to handle the housekeeping, and a specialized data-parallel processor to do the functions that could be highly parallelized. In particular, over time, it became the fastest way of performing neural network training, short of designing a specialized chip. These became known as GP-GPUs (general purpose graphics processing units). Attached Processors This table, from Hennessy & Patterson (for more about them, see my post Hennessy and Patterson Receive the Turing Award ) shows how processor performance has improved, first riding Moore's Law and then running into the power wall. From 1985 to 2003, performance improved 52% per year (about 0.8% per week). Then until 2011 it increased 23% per year. But since then, performance increase fell to 10% per year. Looking forward, the number is anywhere from zero to 2%. This means, in effect, general purpose computers will never get (much) faster, every architectural trick we know has been used, and the semiconductor tricks are limited by power. Going from serial to parallel (with effectively as many cores as you want) is a one-time kicker though. The one flaw in the argument is "general purpose". We know how to make faster processors for almost any given function by optimizing down to the silicon and building a specialized off-load device. If you want to see a more detailed exposition, see my post Are General Purpose Microprocessors Over? As Krste pointed out in his talk, there are 50-60 specialized neural network accelerator startups, and "the list would be out of date by the time I finished writing it." Or, as Chris Rowen likes to say, an AI startup is any startup created since 2015. Chris even has a chart of them all (February 2018 is the latest iteration). Accelerating AI in the Future Algorithms Change Quickly but Patterns Endure There is a famous 2006 paper about the Berkeley Dwarfs, officially The Landscape of Parallel Computing Research: A View from Berkeley which Krste is actually the lead author of by dint of his name being first alphabetically. A dwarf is an essential element of any computational problem, such as sparse and dense linear algebra, convolutions, spectral transforms, graph traversal, dynamic programming, and more (13 in all). If you build parallel hardware that can run the dwarfs efficiently and fast, then they can run any future program fast. They are the periodic table of parallel algorithms. Or, as Krste put it: I don't know what future AI algorithms will look like, but they will use these patterns. So design for flexibile composition of instances of these patterns Moore's Law is Dead but Amdahl's Law Lives Amdahl's Law was an observation by Gene Amdahl that the speedup a parallel computer can get is limited by the part that cannot be parallelized. For example, if 5% of the code cannot be parallelized, then the maximum speedup in the limit is 20X (the parallel stuff runs infinitely fast, leaving just the 5% that could not be parallelized). Amdahl's lesson was that scalar performance in a vector processor is really, really important. The CNS-1 proposal, back in 1992, included: This remains true 25 years later: whatever you don't accelerate will constrain your performance/energy efficiency. The faster the acceleration, the greater this effect. So it is important to work hard on scalar performance and control latencies. Software Matters Most but You Can Never Finish It Domain-specific languages such as TensorFlow are a big boon to new hardware, but you still have to actually map the backend onto your hardware. You have to do this without having access to all the software since maybe less than 1% of the software will be finished before tapeout. There is a tendency just to code the kernel 1% and not the other 99%, but then Amdahl's Law can bite you. But the bigger message is this: If the system is difficult to program, then you will not have software. And if you don't have software, then you don't have an accelerator RISC-V A SiFive presentation by Krste would not be complete without a mention of RISC-V. Originally this was designed as the basis for custom accelerators "so we didn't have to beg MIPS not to sue us." But you can simplify the software by having just one ISA for all cores, whether out-of-order, interrupt responsive, vector extension and so on. Every core will then have the same memory model, same synchronization primitives, same compiler tool flow (with stuff like C-struct packing the same down to the bit level), debugger, tracing, and more. The benefit of doing something slightly better when you lose all this is just not there. Krste's Top 3 List So, distilling an hour of presentation into three bullet points, if you are designing some sort of accelerator for AI then: Design for flexible composition of instances of the dwarf patterns Work hard on scalar performance and control latencies Make it easy to program or nobody will Video Yesterday, Krste forwarded me a link and there is a video of the entire talk: https://youtu.be/8n2HLp2gtYs Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

Accelerating AI: ...Present and Future

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112