Aim of this project is to automatically recognize human actions based on analysis of the body landmarks using pose estimation.
Analysis of people’s actions and activities in public and private environments are highly necessary for security. This cannot be done manually as the number of cameras for surveillance produce lengthy hours of video feed every day. Real-time detection and alerting of suspicious activities or actions are also challenging in these scenarios. This issue can be solved by applying deep learning-based algorithms for action recognition.
The following are the major tasks performed:
- Implementation of Convolutional Neural Network based pose estimation for body landmark detection
- Implementation of pose features-based action recognition and its improvement using graphical feature representation and data augmentation of body landmarks
- Preparation and preprocessing of image datasets
- Fine tuning and improvement of the action recognition model with better feature representation and data augmentation
- Development, error analysis and deep learning model improvement
A subset of Frames Labeled in Cinema (FLIC). Training images and their annotations in train_joints_coords.csv and action_joints.csv consists of 7 joints - 'left shoulder', 'left elbow', 'left wrist', 'right shoulder', 'right elbow', 'right wrist', 'left eye', 'right eye', 'nose'. The dataset is available at '/Action_dataset' and '/Pose_dataset'
Pose estimation is achieved through the implementation of a CNN model. Transfer learning has been utilized for detection of body joints. VGG16 pre-trained model is used as the convolution base. The top model is customized for the problem. Training takes place in 2 groups. I was able to obtain a R2 score of 0.92 on the test data, which seems accurate enough.
I created two action recognition models.
-
xy_actions_model.h5: Model that trains on the (X,Y) coordinates of joints.
This Neural network was built by directly considering x and y coordinates as features and action label as target. This model gave me a validation accuracy of 100%. But humans can appear on any scale in a real scenario, considering raw coordinates as features is not a good idea. Therefore, I developed the below model.
-
dist_actions_model.h5: Model that trains on the Euclidean distance of the joint coordinates
For this model I have extracted the distance between every joint and further normalized the distance features. These distance features were considered for training the action recognition model. I was able to achieve a validation accuracy of 100%. However, this model wasn't stable, ie.. this model occasionally did return a 50% accuracy.
Therefore, for action recognition in Video, I have used the first model (xy_actions_model.h5)
The 'pose_estimation_model.h5' has estimated Pose fairly accurately and 'xy_actions_model.h5' has recognized the action correctly as 'Namaste'.