Computer Vision Analysis of the Dynamic Object State

Computer Vision Analysis of the Dynamic Object State

If you were following close to our blog then you likely to know quite a lot already about modern Computer vision, AI, and its capabilities. If not, then it’s probably the time you should start — the world of IT develops too fast, and it’s better to hold your finger on its pulse not to fall out of life.

In the previous articles, we’ve discussed how the use of Computer Vision in sports can help improve the way we know the sport and our approach to it. Here we’ll go into the details about soccer especially.

Landmark Localization — Computer Vision Analysis

The object’s state in a video can be described by the key points. These points are called landmarks and give great insight into the analyzed object structure. For example, face landmarks can be very useful for a wide range of applications like face recognition, face animation, emotion recognition, blink detection, photography, and more.

It also applies to body landmarks — they too have a wide range of applications: recognition of person gestures and posture, contactless game controllers interaction, environment interactions in AR-applications, and sports analytics — sportsmen movements.

Object Landmark Localization, or Object Alignment, is the process of a set of key points extrapolation from a given object image. For face alignment, we are only interested in the landmarks that describe the shape of face attributes like eyes, eyebrows, nose, mouth, and chin. Or it can be the shape, for example, of a body: head, shoulders, elbows, hands, hips, knees, and feet.

The main methods of Computer Vision Analysis

There are a lot of methods that are able to detect these points but let’s take a closer look at the main three of them:

  • The first one achieves superior accuracy and robustness by analyzing a 3D face model extracted from a 2D image;
  • The second relies on the power of CNN’s (Convolutional Neural Networks) or RNN’s (Recurrent Neural Networks);
  • The third one — ERT — utilizes simple, but fast features to estimate the location of the points.

Dlib, a famous machine learning library written in C++, offers the third one. It implements a wide range of algorithms that can be used either on the desktop or mobile platforms. The Face Landmark Detection algorithm offered by Dlib is an implementation of the Ensemble of Regression Trees — ERT.

ERT is a cascade of high capacity regression functions learned via gradient boosting. This technique uses pixel intensities differences, a simple and fast feature, to directly estimate the landmark positions. These estimated positions are subsequently refined with an iterative process, which is done by a cascade of regressors and learning through gradient boosting.

The regressors produce a new value from the previous one, trying to reduce the alignment error of the estimated points at each iteration. The algorithm is really fast — it takes from 1 to 3 ms on a desktop platform to align a set of 68 landmarks.

Basically, a shape predictor can be generated from a set of images, annotations, and training options. A single annotation consists of an object region and labeled points, that we want to localize. An object region can be easily obtained by any detection algorithm, but points have to be labeled only manually.

Haar Cascade — HOG in combination with SVM detector, or single CNN detector, can automatically detect such objects as people or their faces. This stage produces a bounding box around the object.

We can solve the problem of high-precision analysis of soccer players’ movements during training and practice, using this approach in sports analytics.

To do this, we need to set a range of key points for a human body and points for a ball, as at the principle above. A set of marking key points is formed based on training videos. A base of 300 tagged images is enough for a high precision point recognition.

The next step is teaching the ERT model and analyzing the accuracy of execution. If the analysis is not correct enough, it is necessary to expand the data set and adjust the parameters of the training model.

A fully trained model processes new videos in two steps:

  • First step — the object detection model forms a bounding box around the analyzed objects – the player and the ball, on the new frame.
  • Second step — the ERT model calculates the coordinates of key points inside this bounding box.

By connecting these key points we get the skeleton of the player and the position of the ball at the moment of impact.

The movement trajectory of each point of a player’s body and a ball is formed for the entire video. The model also calculates the parameters of their speed throughout the entire video and acceleration at specific time horizons.

Thanks to the received information, coaches can make adjustments to training or exercises to improve players’ performance.

Looking ahead

As you can see there are almost endless opportunities for using Computer Vision analysis in soccer because tracking and analyzing winger’s, forward’s or full-back’s stats are just a beginning. It also can be used to track a goalkeeper’s performance, measure the speed of stricken balls, and more. We have several projects based on Computer Vision.

This case is just one of the many — don’t hesitate to learn more about our experience from the Softarex portfolio. We are always eager to share our best practices and wide open to learn something new, so if you have any questions or ideas — feel free to write to us. Let’s develop the world together!