In our webinar ‘Machine vision: learning increasingly complex real-world scenarios with limited annotated data’, computer vision expert John Beuving explained how neural networks based on deep learning are increasingly able to understand images and videos. In this summary of the webinar, we outline the key facts and lessons.
John Beuving knows what he is talking about: he has been in the computer vision business for almost two decades. He started at Siemens Mobile in 2003 and then went on to work at Dacolian and DySI Analytics. Finally, he joined SmarterVision, where he develops socially relevant computer vision solutions as a computer vision expert and CTO. He obtained his PhD in Delft with a thesis about model-free tracking; recently, he has been building machine vision solutions for wildlife conservation as a volunteer at the NGO Sensing Clues.
Mission: understanding images and videos
In the webinar, Beuving first explains how SmarterVision uses computer vision to, for example, monitor bridges, detect smuggling to prison based on camera images, and detect epileptic seizures in young children in a non-invasive way.
The software stack SmarterVision uses for this purpose includes:
- Image classification: classifying an entire image.
- Object detection: identifying objects and their position.
- Semantic segmentation: which pixels belong to an object?
- Instance segmentation: identifying different instances of objects.
- Pose estimation: analysing the various poses of objects.
- Tracking: tracing objects within a series of images.
Video presents greater challenges than single images, Beuving explains. ‘You not only have to determine what is happening in a video, but also when exactly it begins and ends.’ The particularly difficult part is determining how you create software or detection for events that happen infrequently. Smuggling items over a prison wall is a good example: it rarely happens, so there is hardly any footage of it.
Deep learning revolution
Deep learning – a form of machine learning based on artificial neural networks – offers a solution. Deep learning has improved dramatically since 2014, mainly due to better graphics processors, larger available datasets (for example, ImageNet contains millions of images that have been manually annotated for visual object recognition software) and advanced learning methods for deep learning, including dropout and batch normalisation.
Beuving: ‘Deep learning models are data inefficient. Do you want to improve performance? Then you need a lot of annotated data, but there is only a limited amount available. It’s true that gigantic amounts of data are coming in – more than 80 years of video per day on YouTube alone – but that is mainly unannotated. We want to learn from data, but we can’t label everything. Fortunately, the boundaries are being stretched further and further thanks to research. New techniques and data sets allow us to have a better understanding of video with the same amount of data.’
The data paradox
The only thing is that a lot of data is still needed. Ideally, we would like to understand even more with even less (or even no) data. This is where the data paradox comes in: the more we understand about an image or video, the harder it is to get more data. A complicating factor here is that many visual materials are unique: it’s difficult to get more data for rare or difficult situations.
Solving a data problem: more data
The first step is basically always to have more annotated data. The go-to method for this at the moment is supervised learning, in which a person labels all data points. A classifier is then trained based on these data points.
Data generation via GANs and game engines
The supervision can be limited by generating data. This can be done, for example, by using game engines and generative adversarial networks (GANs). You can use game engines – the software development environments used to build video games – to do things like generate synthetic data that can be used to train for foreground detection and tracking. The game engine Unity in particular offers fantastic results.
A GAN learns to generate images that have the same characteristics as the images in the training set, allowing you to create high-quality images. This is also the underlying technique for making deepfakes.
Semi-automatic supervised learning
Semi-automatic supervised learning methods such as pseudo-labelling, active learning and point annotations do not require users to analyse all data points themselves. Pseudo-labelling means that you reduce or simplify supervision by finding the most informative data. You train the model with a batch of labelled data, the trained model predicts labels for the unlabelled data, and then the model is trained with the pseudo-labelled and labelled datasets together.
Active learning occurs when human experts label all difficult data points. The classifier is then retrained with the new data points.
The point annotations method for video is based on object detection, which allows you to localise actions with limited supervision. All kinds of objects are followed over time in a video, and with one click you can annotate an object over the whole series of separate images. All these semi-automatic methods have one big disadvantage: unfortunately, they don’t work well for unbalanced data sets.
Meta learning is a relatively new concept that allows fast learning with less data. The idea behind it is that people and animals learn so quickly by observing contexts, including other senses and physical properties of objects. This can involve few-shot learning and zero-shot learning.
With few-shot learning, you need a few samples per class in your data to allow the machine to learn about the class. The model looks for similarities between the different classes, among other things. This is a very promising technique, not least because you can also use the neural networks for other support sets.
Zero-shot learning means that you have no samples at all in your training set. The model classifies categories that have not yet been seen; the data is classified based on unlabelled examples.
According to Beuving, the hottest trend in computer vision at the moment is self-supervised learning. Unannotated data is automatically annotated so that a non-supervised dataset can be trained in a supervised manner. The core is self-labelling: data annotates itself and learns from itself.
This can be done in various ways, such as:
- Requiring the neural network to predict what the missing part of the image looks like.
- The exemplar technique: multiple examples are generated by making adjustments such as scaling, rotating and changing contrast or colours.
- Jigsaw: the image is divided into puzzle pieces; by placing the pieces correctly, the network learns visual concepts.
- With videos, you can predict the future by learning from the past.
The detection of anything that deviates from certain values is the focus of anomaly detection, an alternative method of solving the data problem. The generator learns concepts from the real world, while the discriminator detects abnormalities based on the input.
‘What we are currently using a lot is online active learning with a human in the loop’, Beuving concludes in the webinar. ‘The foundation is laid by self-supervised pre-training; then a human expert does the active learning part.’
The computer vision veteran’s standard advice for companies is to always get more data first. ‘If you have enough data available, try self-supervised learning because it’s so promising. Then fine-tune this with meta learning methods or active learning. Anomaly detection can be used as a fallback option.’
Getting started as an organisation
The entry level for organisations that want to get started with machine vision is quite high, Beuving said after the webinar: ‘There are no out-of-the-box solutions, and you need a lot of experience and knowledge. Fortunately, you can start at a relatively high level thanks to Facebook and Google. They do a lot of research on this and make all kinds of open-source standard frameworks for machine vision available on GitHub. This allows you as an organisation to start your own research where Facebook and Google ended up. That’s how we do it at SmarterVision and Sensing Clues too.’
The biggest problem you have to tackle as an organisation is getting the maximum results from your data. Beuving: ‘You have to get a feel for it, and you can only do that if you have a lot of experience because every type of data is different.’ To illustrate this, he refers to Facebook PyTorch, a framework that consists of building blocks that you must combine to create a solution. ‘The difficult part is not only putting that combination together perfectly, but also the need to make the best use of the available data.’
Also read our interview with John Beuving about applications and trends in the field of machine vision.