Smart Data Placement for Big Data Pipelines with Storage-as-a-Service Integration

Khan, Akif Quddus

dc.contributor.advisor	Soylu, Ahmet
dc.contributor.advisor	Roman, Dumitru
dc.contributor.advisor	Matskin, Mihhail
dc.contributor.advisor	Prodan, Radu Aurel
dc.contributor.author	Khan, Akif Quddus
dc.date.accessioned	2022-07-16T17:21:28Z
dc.date.available	2022-07-16T17:21:28Z
dc.date.issued	2022
dc.identifier	no.ntnu:inspera:106263327:50172406
dc.identifier.uri	https://hdl.handle.net/11250/3006199
dc.description.abstract
dc.description.abstract	Big data pipelines are developed to process big data and turn it into useful information. They are designed to support one or more of the three big data features commonly known as the three Vs (Volume, Velocity, and Variety). With big data pipelines, massive volumes of data may be extracted, transformed, and loaded, unlike with smaller data pipelines. The implementation of a big data pipeline includes several aspects of the computing continuum such as computing resources, data transmission channels, triggers, data transfer methods, integration of message queues, etc., making the implementation process difficult. The design gets even more complex if a data pipeline is coupled to data storage, such as a distributed file system, which comes with additional challenges such as data maintenance, security, scalability, etc. On the contrary, many cloud storage services, such as Amazon \textit{Simple File Service} offer nearly infinite storage with great fault tolerance, giving possible solutions to handle big data storage concerns. Thus, moving data to cloud storage/storage-as-a-service (StaaS) will move the extra overhead of data redundancy, backup, scalability, security, etc. to the cloud service provider, which makes the implementation process of a big data pipeline relatively easier. The work presented in this thesis aims to 1) realize big data pipelines with hybrid infrastructure, i.e., computation on a local server but integration with storage-as-a-service; 2) develop a ranking algorithm to find the most suitable storage facility in real-time based on the user's requirements; 3) develop and entail the use of a domain-specific language to deploy a large data pipeline using StaaS. A novel architecture is proposed to realize the big data pipeline with hybrid infrastructure. In addition, an evaluation matrix is proposed to rank all available storage options based on five parameters; cost, physical distance, network performance, impact of server-side encryption, and user weights. Further, to simplify the deployment process, a new domain-specific language has been developed with an extensive vocabulary to cover all major aspects of the cloud continuum in general, and cloud storage in particular. This thesis proves the effectiveness of cloud storage by implementing the big data pipeline using the newly proposed architecture. Moreover, it justifies the importance of individual parameters involved in the evaluation matrix, such as cost, physical distance, network performance, and the impact of server-side encryption by the execution of a number of experiments. It also talks about different ways that the new ranking algorithm or evaluation matrix can be shown to work.
dc.language	eng
dc.publisher	NTNU
dc.title	Smart Data Placement for Big Data Pipelines with Storage-as-a-Service Integration
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.ntnu:inspera:106263327:5017 ...
Størrelse:: 10.47Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6768]

Vis enkel innførsel