Quantcast
Channel: Jason Andrews Blog
Viewing all articles
Browse latest Browse all 33813

The New Tensilica DNA 100 Deep Neural-network Accelerator

$
0
0
Today, at the beautiful Tegernsee resort outside Munich in Germany, Cadence announced their latest processor IP, the Tensilica DNA 100 Deep Neural-network Accelerator. This is a highly scalable processor with a range from 0.5 TMACS (tera multiply-accumulates per second) up to hundreds of TMACS. Neural Network Development I have heard it said that there has been more advance in deep learning and neural networks in the last 3 years than in all the years before. I rejoined Cadence 3 years ago. Coincidence? I think not. Joking aside, neural networks have become increasingly important and I have found myself writing about various aspects of them many times. At first, it was mostly about using 32-bit floating point in the cloud, probably with GPUs, too. It has been a hot area in Universities for both undergraduates (instant hire) and research. Just as a datapoint, I happened to see a tweet from Yan LeCun about the most cited authors in the whole of computer science over the last 3 years: The top researchers are all neural network researchers. Note the units that these are measured in. This is not citations over the whole year, it is citations per day over the whole year. I don't actually know if this should be multiplied by 260 (weekdays) or 365 to get to annual numbers. Even if we use the lower number, in 2018 (annualized) Yoshua Bengio was cited over 34,000 times. Also look at the citation growth over the three years, the rates all basically doubled. Once research had found effective ways to use the cloud and GPUs for training, a new important area for research was how to do on-device inference. There are lots of drivers for this, such as wanting more responsive systems than is possible sending everything up to the cloud and back. But the biggest is that some systems need to operate without permanent connectivity. Most obviously, an autonomous vehicle cannot depend on cellular connectivity being good before it decides if a traffic light is red or green. Another driver is the need for privacy: people are uncomfortable with, for example, their smart TV uploading all their conversations to the cloud to find the occasional command of relevance to the TV among the everyday conversation. On-device inference means doing inference with limited resources. There are two big aspects to this. How to compress the network (weight data) without losing accuracy, and how to architect hardware that can handle on-device inference using the compressed weight data. At the recent HOT CHIPS conference in Cupertino, one of the tutorials was on how to do the compression. I won't cover that ground again here, you can read my post HOT CHIPS Tutorial: On-Device Inference . The bottom line is to reduce everything from 32-bit floating point to 8-bit, and to use techniques to make as many of the weight values zero, and so the matrices involved as sparse as possible. Surprisingly, instead of this being a difficult tradeoff of size and accuracy, the reduced networks seem to end up with slight increases in accuracy. The compression ratios can be as high as 50 times. Having made many of the weights zero, the next step is to build optimized hardware that delivers a huge number of MACS and deals with all those zeros specially. The reason optimizing the zeros is so important is that zero times anything is zero. So not only is it unnecessary to explicitly load zero into a register, nor do the multiply, but the other value in the calculation does not need to be loaded either. The values involved can also be compressed so that fewer bits need to be transferred to and from memory—every memory transfer uses power, and it is not hard to end up with data transfer interface consuming more power than the calculations themselves, as happens with the Google TPU. DNA Architecture The computational requirements (and the power and silicon budgets to pay for them) vary a lot depending on the end market. For example: IoT is less than 0.5 TMACS Mobile is 0.5 to 2 TMACS AR/VR is 1-4 TMACS Smart surveillance is 2-10 TMACS Autonomous vehicles from 10s to 100s of TMACS Every application of a processor like the DNA 100 is going to be different, but one high-end use case is perception and decision making in automotive, with cameras, radar, lidar, and ultrasound. A typical architecture is to have local pre-processing of the different types of data, and then bring it all together to analyze it (is that a pedestrian?) and act upon it (apply the brakes). Cadence has some application specific Tensilica processors such as the Vision C5 suitable for handling the pre-processing, and the new DNA 100 is powerful enough to handle all the decision making. The DNA 100 processor architecture is shown in the block diagram above. The left-hand gray background block is a sparse compute engine with high MAC utilization. The block on the right is a tightly coupled Tensilica DSP that controls the flow of processing, and also future-proofs designs by providing programmability. You can think of these two blocks as the orchestra and the conductor. The DNA 100 architecture is scalable internally, mostly by how many MACs are included. It can easily scale from 0.5 to 12 TMACS. The next level of scaling is to put several DNA 100 cores on the same chip, communicating via some sort of network-on-chip (NoC). If that is not enough, multiple chips (or boards) can be grouped into a huge system. Autonomous driving has been described as requiring a super-computer in the trunk. This is how you build a supercomputer like that. Performance ResNet50 is a well-known network for image classification. A DNA 100 processor in a 4K MAC configuration, running at 1GHz can handle 2550 frames per second. This high number is enabled by both sparse compute and high MAC utilization. It is also extremely power-efficient. In 16nm it delivers 3.4 TMACS/W (in a 4 TMACS configuration, with all the network pruning). Software A complex processor like the DNA 100 is not something where you would consider programming "on the bare metal." There are frameworks such as Caffe, TensorFlow, and TensorFlow Lite that are popular for creating the neural networks. Cadence has the Tensilica Neural Network Compiler that takes the output from these frameworks and maps them onto the DNA 100, performing all the sparseness optimization, and eventually generating the DNA 100 code. Another popular approach is the Android Neural Networks API, which handles some levels of mapping before passing on to the Tensilica IP Neural Network Driver that produces the DNA 100 code. And in a bit of late-breaking news from last Thursday: At Facebook’s 2018 @Scale conference in San Jose, California today, the company announced broad industry backing for Glow, its machine learning compiler designed to accelerate the performance of deep learning frameworks. Cadence, Esperanto, Intel, Marvell, and Qualcomm committed to supporting Glow in future silicon products. The DNA 100 doesn't support Glow yet...but watch this space. For higher level functionality, Cadence partners with specialists such as ArcSoft for facial recognition, or MulticoreWare for face-detection. Summary The Tensilica DNA 100: Can run all neural network layers, including convolution, fully connected, LSTM, LRN, and pooling. Can easily scale from 0.5 to 12 effective TMACS. Further, multiple DNA 100 processors can be stacked to achieve 100s of TMACS for use in the most compute-intensive on-device neural network applications. Also incorporates a Tensilica DSP to accommodate any new neural network layer not currently supported by the hardware engines inside the DNA 100 processor. Complete software compilation flow, including compression for sparsity. Will be available to select customers in December 2018, with general availability in Q1 2019. For more details, see the product page. Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

Viewing all articles
Browse latest Browse all 33813

Trending Articles