Quantcast
Channel: Jason Andrews Blog
Viewing all articles
Browse latest Browse all 33813

Ploughing 1 TB of RAM with Twenty x86 Oxen and 10,000 RISC-V Chickens

$
0
0
OK, that wins the prize for best title of a presentation in the recent RISC-V workshop, or pretty much any workshop. I couldn't resist using it aa a title for this post. Can you say click-bait? You'll have to read almost to the end to find out about RISC-V chickens. That's after finding out about RISC-V minions. This is the second post from the RISC-V workshop. If you missed it, the rather less interestingly titled RISC-V Workshop, Milpitas was yesterday. You should probably start there, if you haven't read it already. This picks up where that post left off. Esperanto I should probably have just used Dave Ditzel as the subheading, not the name of the company that he heads up. He is a legendary name in computer architecture, especially in getting good performance at very low power, and using "code morphing" software (running a different instruction set from the underlying hardware). He did a stint for several years at Intel, but his probably most famous for starting Transmeta (not to mention persuading Linus Torvalds to come and work for him there). Before that, he was at Sun for years, as the CTO of the SPARC side of the house, the lead architect for the 64-bit version. Esperanto is a startup that came out of stealth mode during the workshop (it was founded 3 years ago in November 2014). The motivation for Esperanto is that there is no RISC-V alternative to high end ARM processors. Without that, it is hard for many companies to make an architectural switch, and business arrangements seem to become less favorable if a company doesn't show complete loyalty to the Arm architecture. Dave feels that a high-end core in a leading edge process is the missing piece of the puzzle to be able to make the switch across the entire range. So that's what Esperanto is doing. They are designing a high-performance RISC-V core comparable to the best IP alternatives (I guess that is a veiled way of saying they will have as good as performance as Arm but be a little behind Intel). They are not just doing one core, but two or three. The focus is companies with high teraflop computing needs. The aim is not just parity, but to make RISC-V more compelling than other high-end alternatives, with the best single-thread performance and the best TFLOPS per Watt. The first cores will be in 7nm, straight to the most advanced process available. Customers he has talked to want to see it in a standard Verilog flow, not a Chisel flow. Dave likes Chisel but it takes total commitment since the Verilog it produces is not human comprehensible, and it is hard to make use of tools that manage Verilog, linters, formal verification, and so on. This takes a very big physical design effort, but there is a big payoff. Energy efficiency needs careful tradeoff of architecture, circuit design, and physical design. Dave was careful to point out that he wasn't making any product introductions...but he pretty much did. They are building actual chips, that they will sell. Of the three chips that he talked about, the first is the ET-Maxion, which is obviously the big one. ET-Maxion will be the highest performance 64-bit RISC-V processor. There will no longer be a hole at the high end. The second core is the ET-Minion. The goal of the Minion is to do all the floating point work. It will be full 64-bit but with an integrated vector floating point unit as well as specialized instructions for machine learning and tensors, some support for graphics operations. There will be multiple hardware threads so that processors can do more while waiting for accelerators to finish on a given thread. The third core is BOOM (see below, Chris Celio, BOOM's developer, is joining Esperanto when he graduates from Berkeley). It will be optimized for 7nm CMOS (currently it is an academic project) and made available as a licenseable core. Later he said that BOOM would continue to be free, so I'm not sure precisely what the business model is going to be. He showed a simulation of an "AI supercomputer on a chip" with 16 Maxions and 4096 Minions, each with their own vector floating point unit. That is a lot of teraflops on a single chip. There is a NoC that allows all the processors to reside in the same address space. And it will, of course, have energy efficient design techniques - that's what Dave does. Graphics is another area of interest. They need an on-die graphics solution...but they have thousands of cores and a shader isn't that different. They would need a shader compiler, so they went ahead and built one, along with code to distribute the computation. He had a demo running on RTL of video games Adam and Crysis. In a sense this is the complement of all those NVIDIA GPUs being used for neural network training despite being originally designed for graphics. BOOM BOOM is the Berkeley Out of Order Machine. No prizes for guessing that it is a RISC-V implementation that takes the Berkeley Rocket design and adds out-of-order execution. It consists of 16,000 lines of Chisel. The original BOOM was taped out but he couldn't show a picture due to academic pre-publication rules. He mostly talked about BOOM v2, the second version, done in TSMC 28nm. Why even do a second version? The first version was very academic. No commercial memory, no LVT-based standard cells. For example, the BTB (branch target buffer) was just 30 entries built from flops; put it in SRAM and it could have 200 entries at the same frequency. In some ways the most impressive aspect of BOOM v2 is the subtitle: "taping out an out-of-order processor with a 2-person team in 4 months." The biggest challenge in this sort of agile development is that "I can write bugs faster than I can find them." This is even more complicated in an out-of-order core since sometimes it is genuinely doing something wrong (such as doing speculative execution for a branch prediction that might fail) but it will be fixed up correctly later. The slide below shows the conclusions about this style of agile hardware development. Anything that doesn't make a big change to physical design is fast. But changes to physical design are painful. Like everyone else, he is hoping that wrapping machine learning around physical design will result in a system where "you can go on vacation for a week and come back and there are a dozen good designs to choose from." I picked this presentation as one to write about because of the incredibly short schedule, more than for the processor itself. Since Chris is joining Esperanto, BOOM is going there too, but it will remain open source and is not going away. Celerity Michael Taylor of University of Washington talked about Celerity. Again, I picked this to write about because it is on a dramatic scale in an era when we are told all the time that it costs $100M to design a state-of-the-art SoC. This is in TSMC 16nm with 511 cores (yes, that is the correct number, actually 496 cores in a square array, 10 more special low-voltage cores, and five Linux capable cores). On a full sized reticle they could put 35,000 cores. In fact they increased the number of cores dramatically at the last minute since the magic is in the NoC that makes everything a single address space. Part of what they were doing was the Celerity chip. Part was that they are building DNA for open source ASIC designs. Everyone notices that the more ASICs you build, the easier it gets, since you can just take the same stuff as you used last time and re-use it. But that requires there to be a last time. As part of Celerity they also created several hundred modules, all the conceptual hardware blocks plus testing code. This was funded by DARPA for $1.3M (you'll have to wait to find out what the DARPA's Linton Salmon said about open source IP in his keynote, coming up early in the new year). PicoChip PicoChip is another design that I selected more for its design flow. Tim Edwards of efabless corporation presented PicoSoC: How we created a RISC-V based ASIC processor using a full open source foundry-targeted RTL-to-GDS flow, and how you can, too! That pretty much says it all. It is a full chip ASIC implementation of the PicoRV32 PicoSoC, and it was done entirely using open-source EDA tools. It is in a 180nm process that Tim considers the sweet spot for analog design. As he put it, "it is not so far off the leading edge that the microprocessor takes up the whole chip." By the way, PicoChip is nothing to do with the Bath-based fabless semiconductor company PicoChip making chips for small base-stations, which ended up after a couple of acquisitions inside Intel. But if you want a microprocessor connection, Jamie Urquart, the designer of the first ARM processors, and at one time COO of Arm, was on the board. The tool flow used was: Synthesis: yosys and ABC Static Timing: vesta Placement: graywolf Routing: groute Layout, DRC, LVS: magic Verilog Sumulation: iverilog Co-simulatioin: ngspice with iverilog Mask Generation: magic There are more details on opencircutdesign.com for many of these. I know very little about the true capabilities of these tools, but I suspect they are severe. Magic (the layout system) was written in the 1980s...but it is free, and researchers like free. Graywolf was forked off Timberwolf, which was written only a little bit later. It has always seemed to me that open source doesn't really work for leading edge designs. First, the leading edge designers are not experts on algorithms, and vice versa. Open source clearly works best when people are writing tools for their own use. Secondly, the other area where open source wins is where a closed source product is cloned. But that requires the product being cloned to sit still. I think it is interesting that there are no popular open-source video games (although lots of open source infrastructure for video games). By the time a game is clearly popular, it is too late to clone it. Gamers will have moved on to something else once the clone is ready. Leading edge IC design is like that. However, as Moore's Law slows, non-leading edge processes become more important. The foundries are even going back and updating their processes with the learning that they have gained from advanced process, making cheaper and lower power versions. So maybe those processes are basically sitting still for long enough for good open source design to get some traction in the commercial world. Ploughing 1 TB of RAM with Twenty x86 Oxen and 10,000 RISC-V Chickens OK, I picked this one mostly for the title. It is not even the whole title. That would be Work in Progress Update: GRVI Phalanx.2 on Amazon AWS EC2 F1: Ploughing 1 TB of RAM with Twenty x86 Oxen and 10,000 RISC-V Chickens . It was presented by Jan Gray of Gray research. You probably missed the subtlety in the full title that it was running on Amazon AWS EC2 F1. That F1 is really significant. It means that those servers have huge Xilinx FPGAs for use as offload processors (but I like minion terminology for this). He has built what he calls a phalanx called GRVI (pronounced "groovy"). It is an overlay for the FPGA accelerator to make it easy to build a design with 100s of RISC-V cores using a simple 5-second recompile of the code versus about 5 hours for SP&R on the FPGA. The FPGA architecture is fixed as a Hoplite 2D torus NoC. The simpler RISC-V processor gives more processor engines (PE) per die and so more parallelism. The PE is just 300 LUTs with a handcrafted datapath. The NoC has a 40-bit address and a 32-byte payload. It runs on a large AWS instance with eight processors and Virtex Ultrascale with 1M LUTs (that's big), plus DSP blocks and more. He thinks the first time he built it and laid it out was the first time anyone had built a 1K+ core RISC-V design (it had 1680). He has a bigger one too, with multiple FPGAs (hence the title), but can't yet send messages between the FPGAs, so he admits he is cheating a little on his claims. Today, it has to be programmed by hand, something "only the designer could love". But he is getting OpenCL to work to make it more accessible. There is an SDK coming for 8, 80, 800 and 10,000 core designs. Everything has been enabled by the RISC-V ecosystem. Next Workshop The next workshop will be May 7-10th 2018 in Barcelona (presumably still part of Spain) at the Barcelona Supercomputing Center, co-hosted by Universitat Politècnica de Catalunya, and a company yet to be announced. Every RISC-V workshop has been sold out, so if you plan on going, don't leave it until the last minute. I can't resist using this as an excuse to show you what has to be the most beautiful supercomputing center in the world, combining classical architecture with cutting-edge technology: Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

Viewing all articles
Browse latest Browse all 33813

Trending Articles