Human action recognition using attention based LSTM network with dilated CNN features

Muhammad, Khan; Mustaqeem, .; Ullah, Amin; Imran, Ali Shariq; Sajjad, Muhammad; Kiran, Mustafa Servet; Sannino, Giovanna; Alburguerque, Victor Hugo C., de

dc.contributor.author	Muhammad, Khan
dc.contributor.author	Mustaqeem, .
dc.contributor.author	Ullah, Amin
dc.contributor.author	Imran, Ali Shariq
dc.contributor.author	Sajjad, Muhammad
dc.contributor.author	Kiran, Mustafa Servet
dc.contributor.author	Sannino, Giovanna
dc.contributor.author	Alburguerque, Victor Hugo C., de
dc.date.accessioned	2022-10-04T06:27:37Z
dc.date.available	2022-10-04T06:27:37Z
dc.date.created	2021-07-02T13:32:42Z
dc.date.issued	2021
dc.identifier.citation	Future generations computer systems. 2021, 125 820-830.	en_US
dc.identifier.issn	0167-739X
dc.identifier.uri	https://hdl.handle.net/11250/3023452
dc.description.abstract	Human action recognition in videos is an active area of research in computer vision and pattern recognition. Nowadays, artificial intelligence (AI) based systems are needed for human-behavior assessment and security purposes. The existing action recognition techniques are mainly using pre-trained weights of different AI architectures for the visual representation of video frames in the training stage, which affect the features’ discrepancy determination, such as the distinction between the visual and temporal signs. To address this issue, we propose a bi-directional long short-term memory (BiLSTM) based attention mechanism with a dilated convolutional neural network (DCNN) that selectively focuses on effective features in the input frame to recognize the different human actions in the videos. In this diverse network, we use the DCNN layers to extract the salient discriminative features by using the residual blocks to upgrade the features that keep more information than a shallow layer. Furthermore, we feed these features into a BiLSTM to learn the long-term dependencies, which is followed by the attention mechanism to boost the performance and extract the additional high-level selective action related patterns and cues. We further use the center loss with Softmax to improve the loss function that achieves a higher performance in the video-based action classification. The proposed system is evaluated on three benchmarks, i.e., UCF11, UCF sports, and J-HMDB datasets for which it achieved a recognition rate of 98.3%, 99.1%, and 80.2%, respectively, showing 1%–3% improvement compared to the state-of-the-art (SOTA) methods.	en_US
dc.language.iso	eng	en_US
dc.publisher	Elsevier	en_US
dc.title	Human action recognition using attention based LSTM network with dilated CNN features	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	This version of the article will not be available due to copyright restrictions by Elsevier	en_US
dc.source.pagenumber	820-830	en_US
dc.source.volume	125	en_US
dc.source.journal	Future generations computer systems	en_US
dc.identifier.doi	10.1016/j.future.2021.06.045
dc.identifier.cristin	1920069
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	2

Files in this item

Name:: HAR+DL.pdf
Size:: 2.336Mb
Format:: PDF
Description:: Muhammad

Locked

This item appears in the following Collection(s)

Institutt for datateknologi og informatikk [6808]
Publikasjoner fra CRIStin - NTNU [38484]

Show simple item record