Vis enkel innførsel

dc.contributor.advisorBratsberg, Svein Erik
dc.contributor.authorMidtgård, Kristian
dc.date.accessioned2017-10-25T14:00:59Z
dc.date.available2017-10-25T14:00:59Z
dc.date.created2017-07-15
dc.date.issued2017
dc.identifierntnudaim:17407
dc.identifier.urihttp://hdl.handle.net/11250/2462187
dc.description.abstractThe amount of internet-connected devices is rapidly expanding. Embedded with various sensors, these devices are generating ever increasing amounts of data, causing a shift towards more write-intensive workloads for the underlying database systems. In order to see the extent that existing database systems are able to ingest data at scale, this thesis considers a world where \textit{everything} is connected and continuously sending its location to a central database. From this world, three applications are defined in order to obtain a set of requirements for the database. The applications are designed so that the amount of data points generated exceeds the rate of giants like Facebook and Google. At peak load, around 600 hundred million write requests to the database are generated every second. This thesis examines the NoSQL landscape, a paradigm with focus on high performance, availability and scalability for databases, in an attempt to identify the best practices for achieving high write throughput. Several different types of NoSQL databases are widely used in production today. Some of the most promising ones will be examined in more detail and evaluated against the requirements for the applications. In addition to these general purpose database, a specific type of database intended for time series data will also be considered. The most notable similarity between the top NoSQL databases is the use of the LSM-tree data structure. LSM-trees are able to obtain high write throughput by transforming small random writes into larger sequential writes, minimizing the need for expensive disk operations. The performance of these databases grow beyond the capacity of a single machine by partitioning data by rows and automatically distribute the partitions among nodes in a dynamic cluster. Although most NoSQL databases promise linear scalability, most will encounter practical limitations due to the use of a master server for central coordination. Fully decentralized and eventually consistency NoSQL databases like Cassandra are the candidates most likely able to scale to accommodate 600 million writes per second. Time series databases are shown to provide a much higher write throughput per node than general purpose databases. However, the lack of query functionality and indexing on other attributes than time, makes time series databases less suitable if the applications require a significant portion of reads.
dc.languageeng
dc.publisherNTNU
dc.subjectDatateknologi, Databaser og søk
dc.titleCollecting massive amounts of location data in a NoSQL database - Identifying best practices for high write throughput
dc.typeMaster thesis


Tilhørende fil(er)

Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel