Unsupervised Anomaly Detection based on Machine Learning

Heggelund, Simone

dc.contributor.advisor	Jørn Vatn
dc.contributor.author	Heggelund, Simone
dc.date.accessioned	2019-10-04T14:00:17Z
dc.date.available	2019-10-04T14:00:17Z
dc.date.issued	2019
dc.identifier.uri	http://hdl.handle.net/11250/2620410
dc.description.abstract	Satsningen på digitalisering og implementering av konsepter som inngår i Industri 4.0 har aldri vært større, hvor tilstandsovervåkning, implementering av sensorteknologi og prediktive strategier for vedlikehold står i fokus. Denne oppgaven tar sikte på å utvikle en unormalitetsdetektor basert på maskin læring, gjennom analyse av kontinuerlige tidsseriedata. Systemet som studeres er en tre-trinns sentrifugalkompressor for kompresjon av luft, utstyrt med totalt 21 sensorer som overvåker systemet kontinuerlig. Kritisk produksjonsutstyr monitoreres kontinuerlig med intensjon om å detektere unormaliteter før systemsvikt oppstår, noe som gjør at tilgangen på data som beskriver feilhistorikk og degraderingsmekanismer er svært begrenset. For implementering av prediktivt vedlikehold kreves dermed en unormaldetektor som ikke baserer seg på feilhistorikk og indikatorer som sier noe om utstyrets degraderingsnivå, såkalt uassistert unormalitetsdeteksjon. Med utgangspunkt i fundamentale maskin lærings konsepter, en rekke maskinlæringsmodeller og litteratur som dekker ulike tilnærminger til uassistert unormalitetsdeteksjon, presenteres et rammeverk for implementering av følgende tilnærminger; unormalitetsdeteksjon basert på residualer og unormalitetsdeteksjon basert på gruppering. Begge de presenterte tilnærmingene baserer seg på å modellere systemets normale atferd ved hjelp av maskin læring, for så å detektere atferd som avviker fra denne som unormal. For den residualbaserte tilnærmingen gjennomgås og implementeres tre state-of-the-art maskinlæringsmodeller, Decision Tree Model, Random Forest Model og Feedforward Neural Networks. Disse modellene tar sikte på å predikere lufttrykket basert på et lært forhold med de andre systemvariablene. Størrelsen på residualene mellom det predikerte og faktiske trykket avgjør om atferden klassifiserer som normal eller unormal, med utgangspunkt i et forhåndsbestemt konfidensintervall. For den klyngebaserte tilnærmingen gjennomgås og implementeres K-Means clustering algoritmen, som tar sikte på å gruppere data i klynger med lignende atferd. De grupperte klyngene danner et referansemønster for systemets normale oppførsel, hvor data som ikke overlapper med referansemønsteret klassifiseres som unormal, avhengig av en forhåndsbestemt fraksjonsprosent. I tillegg presenteres et grunnleggende rammeverk knyttet til maskinlæring, hvor trening, validering og testing av maskinlæringsmodeller gjennomgås, samt grunnleggende konsepter knyttet til overtilpasning, undertilpasning og optimalisering av modellenes hyperparametre. Videre gjennomføres en sammenligning av den den oppnådde prestasjonen til både Decision Tree Model, Random Forest Model og Feedforward Neural Networks. Det konkluderes med at Random Forest Model presterer best, med evne til å predikere trykket med en nøyaktighet på 0,98. Resultatene som er oppnådd fra denne modellen brukes videre til å beregne residualene, og for en testperiode på 6 dager detekteres totalt 180 unormaliteter, ut av totalt 7233 mulige. For den klyngebaserte unormalitetsdetektoren oppnås samme resultat med en fraksjonsprosent på 0,02. Ved sammenligning av de to presenterte detektorene konkluderes det med at tilnærmet de samme partiene klassifiseres som unormale, som i hovedsak er store uregelmessige partier som avviker sterkt fra det som antas å være normal driftsadferd. For å optimalisere beslutningsreglene for klassifisering av unormaliteter kreves det domenekunnskap og kunnskap om mulige feilmoder og unormaliteter, da svikthistorikk ikke er tilgjengelig. Beslutningsgrensene må optimaliseres med sikte på å redusere tilfeller av falske positive og falske negative deteksjoner, som både reduserer reliabiliteten til detektoren og øker risikoen for systemsvikt. Denne oppgaven bidrar med en tverrfaglig tilnærming til analyse av kontinuerlig tidsseriedata, hvor tradisjonelle fremgangsmåter fra et RAMS perseptiv kombineres med ny forskning og litteratur fra et IT og Kunstig Intelligens perspektiv.
dc.description.abstract	In the context of Industry 4.0, new emerging technologies have enabled a shift within the manufacturing sector, where data extracted from all relevant sources is the key driver to create value. Deriving value from these streams of continuous time series data is the main focus of this master thesis, with the objective of building an unsupervised anomaly detector based on machine learning. The system under study is a three-stage Centrifugal Air Compressor system, equipped with in total 21 sensors monitoring the system continuously. The current status in the industry today in relation to data availability is examined, concluding that failure history and data associated with labels indicating the health of the equipment is hard to obtain for complex systems. Hence, an unsupervised anomaly detector is required, not relying on labeling or historical failures. Based on machine learning fundamentals, state-of-the-art machine learning models and several anomaly detection surveys, it is concluded that the following approaches will solve the objective; unsupervised anomaly detection based on residuals and unsupervised anomaly detection based on clustering. A framework for how these approaches can be implemented is presented, both building on the principle of modelling the normal behavior of the system, and flagging samples deviating from this behavior as anomalous. For the residuals based approach, three state-of-the-art machine learning models are reviewed and implemented, namely the Decision Tree Model, the Random Forest Model and a Feedforward Neural Network. These models are aiming to predict the target variable, which in this case is the pressure, based on a learned relationship with the input features. The magnitude of the residuals between the predicted and actual target variable determines if a sample is classified as normal or abnormal, depending on a chosen confidence interval. For the clustering based approach, the K-Means clustering algorithm is reviewed and implemented. This model is aiming to group the data into clusters with similar patterns, forming a reference pattern for the normal behavior of the system. Any new sample falling outside this reference pattern is classified as anomalous, depending on a predefined outlier fraction. Along with this reviews, the basic framework associated with machine learning is presented, involving training, validating and testing the algorithms. The objective of building models that perform well on never-before-seen data is emphasized, mitigating the effects of overfitting and underfitting. For the residuals based approach, a benchmark is performed between the Decision Tree Model, the Random Forest Model and the Feedforward Neural Network. It is concluded that the Random Forest Model has to overall best performance both on the validation and testing set, predicting the pressure with an accuracy of 0,98. The results gained from this model is used to calculate the residuals, giving a result of 180 detected abnormal samples out of 7233 in total. For the clustering based anomaly detector, the same result was obtained using an outlier fraction of 0,02. By comparison of the presented detectors, it becomes evident that approximately the same samples are classified as anomalous, which are big irregular patterns deviating strongly from what is assumed to be normal operating behavior. Due to the lack of failure history, it is concluded that the determination of the decision boundaries for abnormal classification requires expert judgment and system knowledge, such that the risk of false positives and false negatives are minimized. This thesis aims to pride the gap between theory and practice, introducing an interdisciplinary collaboration between research promoted within the fields of RAMS and Health Management and IT and Artificial Intelligence. By such interdisciplinary collaboration, it is believed that the value derived from the continues arriving streams of data can be maximized.
dc.language	eng
dc.publisher	NTNU
dc.title	Unsupervised Anomaly Detection based on Machine Learning
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.ntnu:inspera:2529439.pdf
Størrelse:: 3.331Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for maskinteknikk og produksjon [4019]

Vis enkel innførsel