Data Analytics for HUNT: Recognition of Physical Activity on Sensor Data Streams
MetadataShow full item record
Human Activity Recognition (HAR) is the field of recognizing activities by analyzing measurements of a subject s movement and environment. A major application of HAR systems is medical research. Th e Nord-Trøndelag Health Study is one of the largest health studies in the world, containing health data on 120 000 subjects. The fourth version ofthe study (HUNT4) commenced in the fall of 2017, where activity data for the fi rst time is collected through physical measurements and not by questionnaires. Subjects are asked to wear two accelerometers for a week to record their activities. To analyze this data an effective and accurate HAR system is needed. With the large amounts of data, a manual analysis is not feasible. Prior studies have developed promising HAR systems, classifying activities with a high degree of accuracy (Hessen and Tessem , Vågeskar ). Th is thesis aims to make improvements to the HAR system presented in Vågeskar  by increasing the e fficiency of the system and adding a sensor no-wear time classi fier. Three goals were de fined for this thesis: Goal 1 was to explore the state of the art machine learning methods and datasets that are commonly used in HAR research. This was to be explored in a systematic literature review. Goal 2 was to increase the e ffectiveness of the HAR system presented in Vågeskar  while maintaining the accuracy of 94 percent based on the results of the specialization project preceding this thesis (Reinsve ), which indicated that the 138 features used to train the HAR classi er in Vågeskar  could be signi cantly reduced while maintaining the accuracy. Goal 3 was to develop a classi fier that was able to detect instances of sensor no-wear time (SNT) by classifying the con figuration of sensors a ttached to a subject at any given time. In this thesis, a systematic literature review on machine learning methods and publicly available datasets used in HAR research, is presented. Th e feature importances for the 138 di fferent features were presented. It was shown that when tested on the TFL dataset, a model with the 5 most important features was su fficient in order to achieve an accuracy of 90.0 percent, while a model using the 27 most important features was capable of reaching 94.0 percent accuracy. By calculating only the most important features, an increase in e ffectiveness of 5.9 times for the feature calculation step of the HAR system was achieved using 27 features. With 5 features a speedup of 23 times was achieved. The SNT classi er achieved an accuracy of 95.6 percent using 2 minute windows and and a random forrest classi er, when tested on the SNT dataset.