Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques

Jahre, Magnus

dc.contributor.advisor	Natvig, Lasse	nb_NO
dc.contributor.author	Jahre, Magnus	nb_NO
dc.date.accessioned	2014-12-19T13:32:03Z
dc.date.available	2014-12-19T13:32:03Z
dc.date.created	2010-09-03	nb_NO
dc.date.issued	2007	nb_NO
dc.identifier	347565	nb_NO
dc.identifier	ntnudaim:3263	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/250592
dc.description.abstract	Chip Multiprocessors (CMPs) or multi-core architectures are a new class of processor architectures. Here, multiple processing cores are placed on the same physical chip. To reach the performance potential of these architectures with a single application, it must be multi-threaded. In these applications, the processing cores cooperate to solve a single task, and this requires a large amount of inter-processor communication in many cases. Consequently, CMPs need to support this communication in an efficient manner. To investigate inter-processor communication in CMPs, a good understanding of the state-of-the-art of CMP design options, interconnect network design and cache coherence protocol solutions is required. Furthermore, a good computer architecture simulator is needed to evaluate both new and conventional architectural solutions. The M5 simulator is used for this purpose and has been extended with a generic split transaction bus, a crossbar based on the IBM Power 5 crossbar, a butterfly network and an ideal interconnect. The unrealistic ideal interconnect provides an upper bound on the performance improvement available from enhancing the interconnect. In addition, a directory-based coherence protocol proposed by Stenström has been implemented. The performance of 2-, 4- and 8-core CMPs with crossbar and bus interconnects, private L1 caches and shared L2 caches is investigated. The bus and the crossbar are the conventional ways of implementing the L1 to L2 cache interconnect. These configurations have been evaluated with multiprogrammed workloads from the SPEC2000 benchmark suite and parallel, scientific benchmarks from the SPLASH-2 benchmark suite. With multiprogrammed workloads, the crossbar interconnect configurations perform nearly as well as a configuration with an ideal interconnect. However, the performance of the crossbar CMPs is similar to the performance of the bus CMPs when there is intensive L1 to L1 cache communication. The reason is limited L1 to L1 bandwidth. The bus CMPs experience a severe performance degradation with some benchmarks for all processor counts and workload classes. A butterfly interconnect is proposed to alleviate the L1 to L1 communication bottleneck. The butterfly CMP performs on average 3.9 times better than the bus CMP and 3.8 times better than the crossbar CMP when there are 8 processor cores. These numbers are based on the performance of the WaterNSquared, Raytrace, Radix and LUNoncontig benchmarks. The reason is that the other SPLASH-2 benchmarks had issues with the M5 thread implementation for these configurations. For the multiprogrammed workloads, the butterfly CMPs are a bit slower than the crossbar CMPs.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.subject	ntnudaim	no_NO
dc.subject	SIF2 datateknikk	no_NO
dc.subject	Komplekse datasystemer	no_NO
dc.title	Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	312	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO

Tilhørende fil(er)

Filnavn:: 347565_COVER01.pdf
Størrelse:: 47.60Kb
Format:: PDF

Åpne

Filnavn:: 347565_FULLTEXT01.pdf
Størrelse:: 3.524Mb
Format:: PDF

Åpne

Filnavn:: 347565_ATTACHMENT01.zip
Størrelse:: 3.854Mb
Format:: Ukjent

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6788]

Vis enkel innførsel