Auto-tunable GPU BLAS

Steinsland, Jarle Erdal

dc.contributor.advisor	Elster, Anne Cathrine	nb_NO
dc.contributor.author	Steinsland, Jarle Erdal	nb_NO
dc.date.accessioned	2014-12-19T13:37:29Z
dc.date.available	2014-12-19T13:37:29Z
dc.date.created	2011-09-28	nb_NO
dc.date.issued	2011	nb_NO
dc.identifier	444222	nb_NO
dc.identifier	ntnudaim:5840	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/252550
dc.description.abstract	OpenCL is fast becoming the preferred framework used to make programs for heterogeneous platforms consisting of at least one CPU and one or more accelerators. The GPU being readily available in almost all computers, it is the most common accelerator in use.Good libraries are important to reduce development time and to make particular development environments, such as OpenCL, useful for the masses. All OpenCL programs can execute on any device that have support for it, however to achieve optimal performance, a OpenCL program must be optimized for a specific device.Auto-tuning is a strategy to automatically generate and find a good performing program for a specific device, without requiring the user to perform optimizations manually.BLAS contains routines that are useful for many algorithms suited for GPUs, and is a good candidate for a library that can prove useful for many OpenCL programmers.We have chosen, in this thesis, to implement the matrix multiplication routine from BLAS as it is important for the performance of many higher-level linear algebra algorithms to have a fast implementation of matrix multiplication. The exact operation we have implemented is $C = alpha * A * B + beta * C$, were A, B and C are MxK, KxN and MxN matrices respectively.In this thesis, we implement an auto-tuning framework that generates source code for OpenCL kernels and find the best one for the device it is being executed on.We compare our version with ViennaCL, a OpenCL BLAS library, and the vendor provided BLAS libraries provided by AMD and NVIDIA. Our version provides approximately 85% of the performance of the vendor specific library provided by NVIDIA, in general, and gives a speedup over the native library provided by AMD. This speedup is usually between 1.5 and 2. On both platforms our version outperforms ViennaCL by a large margin.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.subject	ntnudaim:5840	no_NO
dc.subject	MTDT datateknikk	no_NO
dc.subject	Komplekse datasystemer	no_NO
dc.title	Auto-tunable GPU BLAS	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	76	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO

Files in this item

Name:: 444222_FULLTEXT01.pdf
Size:: 1.756Mb
Format:: PDF

Locked

Name:: 444222_COVER01.pdf
Size:: 46.83Kb
Format:: PDF

Locked

This item appears in the following Collection(s)

Institutt for datateknologi og informatikk [6552]

Show simple item record