Collecting massive amounts of location data in a NoSQL database - Identifying best practices for high write throughput

Midtgård, Kristian

dc.contributor.advisor	Bratsberg, Svein Erik
dc.contributor.author	Midtgård, Kristian
dc.date.accessioned	2017-10-25T14:00:59Z
dc.date.available	2017-10-25T14:00:59Z
dc.date.created	2017-07-15
dc.date.issued	2017
dc.identifier	ntnudaim:17407
dc.identifier.uri	http://hdl.handle.net/11250/2462187
dc.description.abstract	The amount of internet-connected devices is rapidly expanding. Embedded with various sensors, these devices are generating ever increasing amounts of data, causing a shift towards more write-intensive workloads for the underlying database systems. In order to see the extent that existing database systems are able to ingest data at scale, this thesis considers a world where \textit{everything} is connected and continuously sending its location to a central database. From this world, three applications are defined in order to obtain a set of requirements for the database. The applications are designed so that the amount of data points generated exceeds the rate of giants like Facebook and Google. At peak load, around 600 hundred million write requests to the database are generated every second. This thesis examines the NoSQL landscape, a paradigm with focus on high performance, availability and scalability for databases, in an attempt to identify the best practices for achieving high write throughput. Several different types of NoSQL databases are widely used in production today. Some of the most promising ones will be examined in more detail and evaluated against the requirements for the applications. In addition to these general purpose database, a specific type of database intended for time series data will also be considered. The most notable similarity between the top NoSQL databases is the use of the LSM-tree data structure. LSM-trees are able to obtain high write throughput by transforming small random writes into larger sequential writes, minimizing the need for expensive disk operations. The performance of these databases grow beyond the capacity of a single machine by partitioning data by rows and automatically distribute the partitions among nodes in a dynamic cluster. Although most NoSQL databases promise linear scalability, most will encounter practical limitations due to the use of a master server for central coordination. Fully decentralized and eventually consistency NoSQL databases like Cassandra are the candidates most likely able to scale to accommodate 600 million writes per second. Time series databases are shown to provide a much higher write throughput per node than general purpose databases. However, the lack of query functionality and indexing on other attributes than time, makes time series databases less suitable if the applications require a significant portion of reads.
dc.language	eng
dc.publisher	NTNU
dc.subject	Datateknologi, Databaser og søk
dc.title	Collecting massive amounts of location data in a NoSQL database - Identifying best practices for high write throughput
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: 17407_FULLTEXT.pdf
Størrelse:: 1.435Mb
Format:: PDF

Åpne

Filnavn:: 17407_COVER.pdf
Størrelse:: 1.556Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6552]

Vis enkel innførsel