Intel Shipping Nervana Neural Network Processor First Silicon Before Year End
by Nate Oh on October 18, 2017 8:00 AM ESTThis week at the Wall Street Journal’s D.Live 2017, Intel unveiled their Nervana Neural Network Processor (NNP), formerly known as Lake Crest, and announced plans to ship first silicon before the end of 2017. As a high-performance ASIC custom-designed and optimized for deep learning workloads, the NNP is the first generation of a new Intel product family, oriented for neural network training. From the beginning, the NNP and its Nervana Engine predecessor have aimed at displacing GPUs in the machine learning and AI space, where applications can range from weather prediction and autonomous vehicles to targeted advertising on social media.
Under development for the past three and a half years, the NNP originated as the Nervana Engine deep learning ASIC, which was announced in May 2016 and had all the marquee features of the NNP: HBM2, FlexPoint, cacheless software memory management, and a high-speed interconnect. Not too long after, Nervana was acquired by Intel in August 2016. Later that November during Intel’s AI Day, the Nervana Engine rematerialized to the public as Lake Crest, with first silicon due in 1H 2017. In that sense, the product has been delayed, although Intel noted that preliminary silicon exists today. Nevertheless, Intel commented that the NNP will be initially delivered to select customers, of which Facebook is one. In fact Intel has outright stated that they collaborated with Facebook in developing the NNP.
In terms of the bigger picture, while the past year has seen many announcements on neural network hardware accelerators, it is important to note that these processors and devices operate at different performance levels with different workloads and scenarios, and consequently machine learning performance consists of more than a single operation or metric. Accelerators may be on the sensor module or device itself (also known as on the ‘edge’) or farther away in the datacenters and the ‘cloud.’ Certain hardware may be training deep neural network models, a computationally intensive task, and/or running inference, applying these trained network models and putting them into practice. For Intel's NNP today, the coprocessor is aimed at the datacenter training market, competing with solutions like NVIDIA’s high-performance Volta-based Tesla products.
This segmentation can be seen in Intel’s own AI product stack, which includes Movidius hardware for computer vision and Altera for FPGAs, as well as Mobileye for automotive. The offerings are bisected again with the datacenter, which formally encompasses Xeon, Xeon Phi, Arria FPGAs, and now the NNP. For the NNP family, although the product announced today is a discrete accelerator, the in-development successor Knights Crest will be a bootable Xeon processor with integrated Nervana technology. While Intel referred to an internal NNP product roadmap and mentioned multiple NNP generations in the pipeline, it is not clear whether the next-generation NNP will be based on Knights Crest or an enhanced Lake Crest.
On the technical side of matters, the details remain the same from previous reports. Intel states that the NNP does not have a "standard cache hierarchy," however it does still have on-chip memory for performance reasons (I expect serving as registers and the like). Managing that memory is done by software, taking advantage of deep learning workloads where operations and memory accesses are mostly known before execution. Subsequently, the lack of cache controllers and coherency logic frees up die space. Otherwise for off-die memory, the processor has 32GB of HBM2 (4 8-Hi 1GB stacks) on the shared interposer, resulting in 8 terabits/s of access bandwidth.
Bringing to mind Google's TPU and NVIDIA's Tensor Cores, the NNP's tensor-based architecture is another example of how optimizations for deep learning workloads are reflected in the silicon. The NNP also utilizes Nervana’s numerical format called FlexPoint, described as in-between floating point and fixed point precision. Essentially, a shared exponent is used for blocks of data so that scalar computations can be implemented as fixed-point multiplications and additions. In turn, this allows the multiply-accumulate circuits to be shrunk and the design made denser, increasing the NNP’s parallelism while reducing power. And according to Intel, the cost of lower precision is mitigated by the inherent noise in inferencing.
The focus on parallelism continues with the NNP’s proprietary high-bandwidth low-latency chip-to-chip interconnect in the form of 12 bi-directional links. Additionally, the interconnect uses a fabric on the chip that includes the links, such that inter-ASIC and intra-ASIC communications are functionally identical "from a programming perspective." This permits the NNP to support true model parallelism as compute can be effectively distributed, taking advantage of the parallel nature of deep neural networks. Additional processors can combine to act as a single virtual processor with near linear speedup, where, for example, 8 ASICs could be combined in a torus configuration as shown above.
Presumably, the NNP will be fabricated on the TSMC 28nm process that Lake Crest was intended for; just after the acquisition, the Nervana CEO noted that production of the 28nm TSMC processors was still planned for Q1 2017. In any case, 16nm was explicitly mentioned as a future path when the Nervana Engine was first announced, and the CEO had also expressed interest in not only Intel’s 14nm processes, but also its 3D XPoint technology.
Source: Intel
25 Comments
View All Comments
Yojimbo - Wednesday, October 25, 2017 - link
"The big players (Intel, AMD, and Nvidia) already have optimized backends for popular frameworks like Caffe, Caffe2, TensorFlow, etc. The smaller players (Movidius, for one) have proprietary frameworks that let you import models from the standard frameworks."I'm talking about development libraries for training, not about inferencing. TensorFlow has to call libraries to perform the math operations involved with the training algorithms. The implementation of these libraries can make or break the performance of TensorFlow on a particular architecture. It takes expertise, hard work, and time to make math libraries that optimally take advantage of a particular architecture in a wide range of use cases. You need someone who has the technical knowledge of this type of programming, intimately knows the architecture, and intimately knows the mathematics.
XiroMisho - Saturday, October 21, 2017 - link
What's it's HPS capabilities and when this comes out will miners finally start buying these and leave our damn video cards alone?! XDmarvdmartian - Monday, October 23, 2017 - link
Neural Net Processors? Isn't that what they put in the Terminators??So Intel is actually Cyberdyne Systems???
Time to stock up on ammo!
mode_13h - Wednesday, October 25, 2017 - link
No, it's not time to stock up on ammo. Generalized AI is still a ways off.We're unlikely to see it in less than 10 years, but I'd give it a bit longer.
Yojimbo - Thursday, October 26, 2017 - link
I think generalized AI will take a new technology. I don't think we will achieve abstract reasoning or complex contextual understanding with current methods. Current methods are good at pattern matching. So it could be 10 years later or it could be 100 years later.