Your browser does not support JavaScript!

Author: Tronserve admin

Thursday 5th August 2021 01:17 PM

Nvidia Chip Takes Deep Learning to the Max level


image cap
147 Views

There's no uncertainty that GPU-powerhouse Nvidia would like to have a solution for all size scales of AI—from massive data center jobs down to the always-on, low-power neural networks that listen for wakeup words in voice assistants.

 

Right now, that would take several different technologies, because none of them scale up or down particularly well. It’s clearly preferable to be able to deploy one technology rather than some. So, according to Nvidia chief scientific researcher Bill Dally, the company has been seeking to answer the question: “Can you build something scalable... while still maintaining competitive performance-per-watt across the entire spectrum?”

 

It looks like the answer is yes. Last month at the VLSI Symposia in Kyoto, Nvidia detailed a tiny test chip that can work on its own to do the low-end jobs or be linked tightly together with up to 36 of its kin in a single module to do deep learning’s heavy lifting. And it does it all while achieving roughly the same top-class performance.

 

The individual accelerator chip is created to perform the execution side of deep learning rather than the training part. Engineers usually measure the abilities of such “inferencing” chips in terms of how many operations they can do per joule of energy or millimeter of area. A single one of Nvidia’s prototype chips peaks at 4.01 tera-operations per second (1000 billion operations per second) and 1.29 TOPS per millimeter. Compared to prior prototypes from other groups operating the same precision the single chip was at least 16 times as area efficient and 1.7 times as energy efficient. But linked together into a 36-chip system it achieved 127.8 TOPS. That’s a 32-fold performance boost. (Admittedly, some of the efficiency comes from not having to manage higher-precision math, certain DRAM issues, and other forms of AI besides convolutional neural nets.)

 

Companies have mostly been tuning their technologies to work best for their particular niches. For example, Irvine, Calif.,-startup Syntiant uses analog processing in flash-memory to boost performance for very-low power, low-demand applications. While Google’s original tensor processing unit’s powers would be wasted on anything other than the data center’s high-performance, high-power environment.

 

With this research Nvidia is trying to demonstrate that one technology can function well in all those situations. Or at least it can if the chips are linked together with Nvidia’s mesh network in a multichip module. These modules are generally small printed circuit boards or slivers of silicon that hold multiple chips in a way that they can be managed as one large IC. They are becoming increasingly fashionable, because they allow systems composed of a few of smaller chips—often called chiplets—instead of a single larger and more expensive chip.

 

“The multichip module option has a lot of advantages not just for future scalable [deep learning] accelerators but for building version of our products that have accelerators for various functions,” explains Dally.

 

Key to the Nvidia multichip module’s ability to bind together the new deep learning chips is an interchip network that uses a technology called ground-referenced signaling. As its name implies, GRS uses the difference between a voltage signal on a wire and a common ground to transfer data, while avoiding many of the known pitfalls of that approach. It can transmit 25 gigabits/s using a single wire, whereas most technologies would need a pair of wires to reach that speed. Using sole wires boosts how much data you can load off of each millimeter of the edge of the chip to a whopping terabit per second. What’s more, GRS’s power consumption is a mere picojoule per bit.

 

“It’s a technology that we developed to essentially give the option of building multichip modules on an organic substrate, as opposing to on a silicon interposer, which is much more expensive technology,” says Dally.

 

The accelerator chip presented at VLSI is hardly the last word on AI from Nvidia. Dally says they’ve already accomplished a version that essentially acts this chip’s TOPS/W. “We believe we can do better than that,” he says. His team aspires to find inferencing accelerating techniques that blow past the VLSI prototype’s 9.09 TOPS/W and reaches 200 TOPS/W while still being scalable.



This article is originally posted on IEEESPECTRUM.com


Share this post:


This is the old design: Please remove this section after work on the functionalities for new design

Posted on : Thursday 5th August 2021 01:17 PM

Nvidia Chip Takes Deep Learning to the Max level


none
Posted by  Tronserve admin
image cap

There's no uncertainty that GPU-powerhouse Nvidia would like to have a solution for all size scales of AI—from massive data center jobs down to the always-on, low-power neural networks that listen for wakeup words in voice assistants.

 

Right now, that would take several different technologies, because none of them scale up or down particularly well. It’s clearly preferable to be able to deploy one technology rather than some. So, according to Nvidia chief scientific researcher Bill Dally, the company has been seeking to answer the question: “Can you build something scalable... while still maintaining competitive performance-per-watt across the entire spectrum?”

 

It looks like the answer is yes. Last month at the VLSI Symposia in Kyoto, Nvidia detailed a tiny test chip that can work on its own to do the low-end jobs or be linked tightly together with up to 36 of its kin in a single module to do deep learning’s heavy lifting. And it does it all while achieving roughly the same top-class performance.

 

The individual accelerator chip is created to perform the execution side of deep learning rather than the training part. Engineers usually measure the abilities of such “inferencing” chips in terms of how many operations they can do per joule of energy or millimeter of area. A single one of Nvidia’s prototype chips peaks at 4.01 tera-operations per second (1000 billion operations per second) and 1.29 TOPS per millimeter. Compared to prior prototypes from other groups operating the same precision the single chip was at least 16 times as area efficient and 1.7 times as energy efficient. But linked together into a 36-chip system it achieved 127.8 TOPS. That’s a 32-fold performance boost. (Admittedly, some of the efficiency comes from not having to manage higher-precision math, certain DRAM issues, and other forms of AI besides convolutional neural nets.)

 

Companies have mostly been tuning their technologies to work best for their particular niches. For example, Irvine, Calif.,-startup Syntiant uses analog processing in flash-memory to boost performance for very-low power, low-demand applications. While Google’s original tensor processing unit’s powers would be wasted on anything other than the data center’s high-performance, high-power environment.

 

With this research Nvidia is trying to demonstrate that one technology can function well in all those situations. Or at least it can if the chips are linked together with Nvidia’s mesh network in a multichip module. These modules are generally small printed circuit boards or slivers of silicon that hold multiple chips in a way that they can be managed as one large IC. They are becoming increasingly fashionable, because they allow systems composed of a few of smaller chips—often called chiplets—instead of a single larger and more expensive chip.

 

“The multichip module option has a lot of advantages not just for future scalable [deep learning] accelerators but for building version of our products that have accelerators for various functions,” explains Dally.

 

Key to the Nvidia multichip module’s ability to bind together the new deep learning chips is an interchip network that uses a technology called ground-referenced signaling. As its name implies, GRS uses the difference between a voltage signal on a wire and a common ground to transfer data, while avoiding many of the known pitfalls of that approach. It can transmit 25 gigabits/s using a single wire, whereas most technologies would need a pair of wires to reach that speed. Using sole wires boosts how much data you can load off of each millimeter of the edge of the chip to a whopping terabit per second. What’s more, GRS’s power consumption is a mere picojoule per bit.

 

“It’s a technology that we developed to essentially give the option of building multichip modules on an organic substrate, as opposing to on a silicon interposer, which is much more expensive technology,” says Dally.

 

The accelerator chip presented at VLSI is hardly the last word on AI from Nvidia. Dally says they’ve already accomplished a version that essentially acts this chip’s TOPS/W. “We believe we can do better than that,” he says. His team aspires to find inferencing accelerating techniques that blow past the VLSI prototype’s 9.09 TOPS/W and reaches 200 TOPS/W while still being scalable.



This article is originally posted on IEEESPECTRUM.com

Tags:
gpu powerhouse neural networks