bit.ly/3PA2a87
Preview meta tags from the bit.ly website.
Linked Hostnames
2Thumbnail

Search Engine Appearance
Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
In this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a simple and deterministic processor with producer-consumer stream programming model enables precise reasoning and control of hardware components, achieving good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level, memory concurrency, data and model parallelism, while guaranteeing determinism by eliminating all reactive elements in the hardware (e.g. arbiters, and caches). Early ResNet50 image classification results demonstrate 20.4K processed images per second (IPS) with a batch-size of one— a $4 \times$ improvement compared to other modern GPUs and accelerators [44]. Our first ASIC implementation of the TSP architecture yields a computational density of more than 1 TeraOp/s per square mm of silicon for its $25 \times 29$ mm 14nm chip operating at a nominal clock frequency of 900 MHz. The TSP demonstrates a novel hardware-software approach to achieve fast, yet predictable, performance on machine-learning workloads within a desired power envelope.
Bing
Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
In this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a simple and deterministic processor with producer-consumer stream programming model enables precise reasoning and control of hardware components, achieving good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level, memory concurrency, data and model parallelism, while guaranteeing determinism by eliminating all reactive elements in the hardware (e.g. arbiters, and caches). Early ResNet50 image classification results demonstrate 20.4K processed images per second (IPS) with a batch-size of one— a $4 \times$ improvement compared to other modern GPUs and accelerators [44]. Our first ASIC implementation of the TSP architecture yields a computational density of more than 1 TeraOp/s per square mm of silicon for its $25 \times 29$ mm 14nm chip operating at a nominal clock frequency of 900 MHz. The TSP demonstrates a novel hardware-software approach to achieve fast, yet predictable, performance on machine-learning workloads within a desired power envelope.
DuckDuckGo
Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
In this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a simple and deterministic processor with producer-consumer stream programming model enables precise reasoning and control of hardware components, achieving good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level, memory concurrency, data and model parallelism, while guaranteeing determinism by eliminating all reactive elements in the hardware (e.g. arbiters, and caches). Early ResNet50 image classification results demonstrate 20.4K processed images per second (IPS) with a batch-size of one— a $4 \times$ improvement compared to other modern GPUs and accelerators [44]. Our first ASIC implementation of the TSP architecture yields a computational density of more than 1 TeraOp/s per square mm of silicon for its $25 \times 29$ mm 14nm chip operating at a nominal clock frequency of 900 MHz. The TSP demonstrates a novel hardware-software approach to achieve fast, yet predictable, performance on machine-learning workloads within a desired power envelope.
General Meta Tags
12- titleThink Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads | IEEE Conference Publication | IEEE Xplore
- google-site-verificationqibYCgIKpiVF_VVjPYutgStwKn-0-KBB6Gw4Fc57FZg
- DescriptionIn this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector
- Content-Typetext/html; charset=utf-8
- viewportwidth=device-width, initial-scale=1.0
Open Graph Meta Tags
3- og:imagehttps://ieeexplore.ieee.org/assets/img/ieee_logo_smedia_200X200.png
- og:titleThink Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
- og:descriptionIn this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a simple and deterministic processor with producer-consumer stream programming model enables precise reasoning and control of hardware components, achieving good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level, memory concurrency, data and model parallelism, while guaranteeing determinism by eliminating all reactive elements in the hardware (e.g. arbiters, and caches). Early ResNet50 image classification results demonstrate 20.4K processed images per second (IPS) with a batch-size of one— a $4 \times$ improvement compared to other modern GPUs and accelerators [44]. Our first ASIC implementation of the TSP architecture yields a computational density of more than 1 TeraOp/s per square mm of silicon for its $25 \times 29$ mm 14nm chip operating at a nominal clock frequency of 900 MHz. The TSP demonstrates a novel hardware-software approach to achieve fast, yet predictable, performance on machine-learning workloads within a desired power envelope.
Twitter Meta Tags
1- twitter:cardsummary
Link Tags
9- canonicalhttps://ieeexplore.ieee.org/document/9138986/
- icon/assets/img/favicon.ico
- stylesheethttps://ieeexplore.ieee.org/assets/css/osano-cookie-consent-xplore.css
- stylesheet/assets/css/simplePassMeter.min.css?cv=20250520_00000
- stylesheet/assets/dist/ng-new/styles.css?cv=20250520_00000
Links
17- http://www.ieee.org/about/help/security_privacy.html
- http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html
- https://bit.ly/Xplorehelp
- https://bit.ly/Xplorehelp/overview-of-ieee-xplore/about-ieee-xplore
- https://bit.ly/Xplorehelp/overview-of-ieee-xplore/accessibility-statement