Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study

Al Hasib, Abdullah; Natvig, Lasse; Kjeldsberg, Per Gunnar; Cebrian, Juan Manuel

dc.contributor.author	Al Hasib, Abdullah
dc.contributor.author	Natvig, Lasse
dc.contributor.author	Kjeldsberg, Per Gunnar
dc.contributor.author	Cebrian, Juan Manuel
dc.date.accessioned	2018-08-22T11:09:58Z
dc.date.available	2018-08-22T11:09:58Z
dc.date.created	2018-01-09T14:49:03Z
dc.date.issued	2017
dc.identifier.issn	2079-9268
dc.identifier.uri	http://hdl.handle.net/11250/2558826
dc.description.abstract	Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi- and many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel® CoreTM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel® Xeon PhiTM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction sets. Results using a single-threaded execution on the Haswell and Sandy Bridge systems show that performance and EDP (Energy Delay Product) can be improved through data reuse transformations on the scalar code by a factor of ≈3× and ≈6×, respectively. Compared to scalar code without data reuse optimization, the SSE/AVX2 version achieves ≈10×/17× better performance and ≈92×/307× better EDP, respectively. These results can be improved by 10% to 15% using data reuse techniques. Finally, the most optimized version using data reuse and AVX512 achieves a speedup of ≈35× and an EDP improvement of ≈1192× on the Xeon Phi system. While single-threaded execution serves as a common reference point for all architectures to analyze the effects of data reuse on both scalar and vector codes, scalability with thread count is also discussed in the paper.	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	MDPI	nb_NO
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study	nb_NO
dc.type	Journal article	nb_NO
dc.type	Peer reviewed	nb_NO
dc.description.version	publishedVersion	nb_NO
dc.source.volume	7	nb_NO
dc.source.journal	Journal of Low Power Electronics and Applications	nb_NO
dc.source.issue	1	nb_NO
dc.identifier.doi	10.3390/jlpea7010005
dc.identifier.cristin	1538999
dc.description.localcode	(C) 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).	nb_NO
cristin.unitcode	194,63,10,0
cristin.unitcode	194,63,35,0
cristin.unitname	Institutt for datateknologi og informatikk
cristin.unitname	Institutt for elektroniske systemer
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: jlpea-07-00005.pdf
Størrelse:: 1.150Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6544]
Institutt for elektroniske systemer [2286]
Publikasjoner fra CRIStin - NTNU [37175]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal