3D Skeleton-based Action Recognition Using LSTM Network

Mega Cynthia Wishnu, Lukas, Duma Kristina Yanti Hutapea, Wen-Nung Lie

Mega Cynthia Wishnu, Lukas, Duma Kristina Yanti Hutapea, Wen-Nung Lie

Abstract

Human action recognition is an interesting Artificial Intelligence research in
recognizing human action pattern. It is very usefull in many application such such as
video surveillance, human-machine interaction and video analysis. One of data
modalities that can be used is 3D skeleton data that extracted from depth sensor. Deep
learning method is picked to solve this problem. Furthermore, Long-Short Term Memory
(LSTM) network is applied to human action recognition problem. The methods are
divided into two parts: data pre-processing and network modelling. Data pre-processing
is applied so that the data can be fed to neural network. Then, the network modelling that
are used in this experiment are: 1-layer LSTM and second network is 2-layer LSTM. Both
networks are following by a Fully Connected layer and a softmax classifier to predict the
action label. This network is evaluated on NTU RGB-D 120 dataset with two evaluation
benchmark: cross-setup and cross-subject. Based on the experiment, the second network
has better result than the first network. Cross-setup training accuracy is 43.64% and
testing accuracy is 13.69%. Cross-subject training accuracy is 46.97% and testing
accuracy is 26.02%. Human action recognition is an interesting Artificial Intelligence research in
recognizing human action pattern. It is very usefull in many application such such as
video surveillance, human-machine interaction and video analysis. One of data
modalities that can be used is 3D skeleton data that extracted from depth sensor. Deep
learning method is picked to solve this problem. Furthermore, Long-Short Term Memory
(LSTM) network is applied to human action recognition problem. The methods are
divided into two parts: data pre-processing and network modelling. Data pre-processing
is applied so that the data can be fed to neural network. Then, the network modelling that
are used in this experiment are: 1-layer LSTM and second network is 2-layer LSTM. Both
networks are following by a Fully Connected layer and a softmax classifier to predict the
action label. This network is evaluated on NTU RGB-D 120 dataset with two evaluation
benchmark: cross-setup and cross-subject. Based on the experiment, the second network
has better result than the first network. Cross-setup training accuracy is 43.64% and
testing accuracy is 13.69%. Cross-subject training accuracy is 46.97% and testing
accuracy is 26.02%.