Hadoop, and Its Mechanisms for Reliable Storage

Midthaug, Ingvild Hovdelien

dc.contributor.advisor	Gligoroski, Danilo
dc.contributor.advisor	Kralevska, Katina
dc.contributor.author	Midthaug, Ingvild Hovdelien
dc.date.accessioned	2018-12-06T15:02:36Z
dc.date.available	2018-12-06T15:02:36Z
dc.date.created	2018-06-05
dc.date.issued	2018
dc.identifier	ntnudaim:19764
dc.identifier.uri	http://hdl.handle.net/11250/2576513
dc.description.abstract	Nowadays the global amount of digital data increases rapidly. Internet-connected devices generate massive amounts of data through various interactions such as digital communication and file sharing. In a world surrounded by such interactions every day, this results in Big Data sets. The datasets can be analyzed and further be used for various purposes such as personalized marketing and health research. In order to analyze and utilize the data, it has to be transferred and stored reliably. Failures in storage systems happen frequently, so mechanisms for reliable data storage are needed. The Hadoop software provides a distributed file system that achieves reliable data storage through different coding techniques. This thesis presents different mechanisms for reliable data storage in Hadoop and gives a practical implementation of an experimental Hadoop environment. The mechanisms include erasure coding (Reed-Solomon codes) and triple-replication. Further, the performance of the mechanisms is tested and compared. The performance parameters considered are the time of file recovery and the amount of network traffic during file recovery. Factors affecting the performance, such as file size and block size, are also considered. The test setup includes wired Ethernet connection, a configured multi-node Hadoop cluster, a managed network switch and a network analysis tool. The obtained results show the impact of different factors on the Hadoop cluster performance during node failure. In general, the results confirm theory. Both the time of recovery and the network traffic during recovery increase with the file size. For erasure coding, the time of recovery increases with the code length, and block size of 128 MB gives the best overall performance. Moreover, optimized erasure coding variants for improving the cluster performance are presented in related work and then suggested as future work.
dc.language	eng
dc.publisher	NTNU
dc.subject	Kommunikasjonsteknologi, Informasjonssikkerhet
dc.title	Hadoop, and Its Mechanisms for Reliable Storage
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: 19764_FULLTEXT.pdf
Størrelse:: 1.773Mb
Format:: PDF

Åpne

Filnavn:: 19764_COVER.pdf
Størrelse:: 1.556Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for informasjonssikkerhet og kommunikasjonsteknologi [2590]

Vis enkel innførsel