Statistisk modellering for aritmetisk kompresjon av tekst

Morkemo, Johannes Herman Havstad; Søreide, Hanne-Sofie Marie Scisly

dc.contributor.advisor	Hafting, Helge
dc.contributor.author	Morkemo, Johannes Herman Havstad
dc.contributor.author	Søreide, Hanne-Sofie Marie Scisly
dc.date.accessioned	2024-07-02T17:23:06Z
dc.date.available	2024-07-02T17:23:06Z
dc.date.issued	2024
dc.identifier	no.ntnu:inspera:233962665:234001286
dc.identifier.uri	https://hdl.handle.net/11250/3137494
dc.description	Full text not available
dc.description.abstract	Denne oppgaven tar for seg aritmetisk kompresjon og hvordan det kan nyttegjøre seg av statistiske modeller for å øke kompresjonseffekten. En enkel statistisk modell for tekstkompresjon kan bruke sannsynligheten til et ord tekst. Mer avanserte modeller tar for seg også annen informasjon og kjennetegn ved teksten for å estimere sannsynligheten for at ord forekommer med større presisjon. Her ser vi på bestemte mønstre i skriftlig tekst og forsøker å utnytte disse til å prediktere neste ord med høyere sannsynlighet og dermed oppnå større grad av kompresjon med en aritmetisk koder. Det å forutse neste ord er et kunstig intelligens problem som vi her forsøker å løse ved å komme opp med relativt enkle algoritmer og se hvilken effekt vi kan oppnå med disse. Arbeidet er knyttet til en gitt implementasjon av en aritmetisk koder skrevet i programmeringsspråket C++ og eksisterende strukturer i programmet samt verktøy fra et tredjeparts bibliotek har vært viktige utgangspunkt for å utarbeide våre statistiske modeller. Resultatet er en statistisk modell som tar i bruk et sett ulike statistikker basert på selve innholdet en ønsker å komprimere. Disse statistikkene er eksempelvis statistikk over hvilke ord som typisk forekommer på ulike posisjoner i en setning, som f.eks sist, først, etter komma og etter tallord, og statistikk over typiske ord-par, der visse ord har relativt høy sannsynlighet for å forekomme etter et gitt ord. Modellen har vi testet på kjente referansefiler for datakompresjon og oppnådd økt kompresjon med flere prosent. Effekten av relevante parametre for modellen undersøkes, både med tanke på oppnåelig grad av kompresjon og kostnaden modellen har med tanke på at den skal sendes fra koder til dekoder, og denne balansegangen diskuteres i lys av de kvantitative resultatene.
dc.description.abstract	This thesis is about arithmetic coding and how this compression method utilizes statistical models to improve its achievable degree of compresion. A simple statistical model for text compression may utilize the probability of a word in a given text. More advanzed models, on the other hand, may make use of additional information and characteritstics of the text to estimate the probability of a word with higher degree of precision. The current work looks into certain patterns of written text, and attempts to exploit these to predict the next word in the text with a higher probability, and hence achieve a higher degree of compression using an arithmetic coder. The problem of predicting the preceeding word in a text is a problem of artificial intelligence, which we aim to solve by relatively simple algorithms. The work is tied to an existing implementation of an arithmetic coder written in the programming language C++, and estructures included in this program, as well as tools provided by a third party library have been indispensable in the work of developing the statistical model presented in this work. The model makes use of a set of different statistics based on the text itself that one intends to compress. As an example, these statistics may use positional information of a word in a sentence, i.e. if it's the last word or last word of the sentence or if it occurrs before or after a comma or a number. Another example applies information of known word-pairs, that is, when certain words has stronger tendency to appear directly after a certain word. The modell has been tested using known benchmark files for data compression, with sizes up to one giga byte, and demonstrated that it does achieve improved compression by several percent. The effect of relevant parameters of the model are also studied, in terms of the achievable compression and the introduced overhead of the model that has to be passed from koder to decoder in the current implementation. This balancing act is discussed in the light of the quantitative results presented.
dc.language	nob
dc.publisher	NTNU
dc.title	Statistisk modellering for aritmetisk kompresjon av tekst
dc.type	Bachelor thesis

Tilhørende fil(er)

Filer	Størrelse	Format	Vis

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6704]

Vis enkel innførsel