Since 2014, machine vision has received a huge boost: neural networks understand images and videos better than ever, allowing countless (new) applications to become viable. John Beuving, machine vision specialist and CTO at SmarterVision, outlines the current landscape. What are the most important applications, developments and trends? And what should companies take into account when they want to get started with machine vision?
He's been working with computer vision since 2003, graduated in model-free tracking, and currently develops socially relevant computer vision solutions at SmarterVision – John Beuving can truly be called a ‘computer vision veteran’. On June 23, he gave a webinar for Spilberg: 'Machine vision: learning increasingly complex real-world scenarios with limited to no annotated data’. During the webinar Beuving explained techniques for enabling neural networks to better understand images and – above all – videos. After the webinar we talked to him about machine vision and the real-life applications of computer vision.
From quality checks to keeping an eye on elephants
There are a growing number of situations where machine vision systems are able to handle tasks faster, cheaper or better than humans, Beuving explains. “These are usually where repetitive tasks are involved, ones that we do regularly and that can be completed in just a second; there are many opportunities for this within healthcare, security and infrastructure. But also when it comes to self-driving cars, assembly lines where apples are sorted or where quality checks are carried out, all tasks that take place on the basis of so-called ‘automatic anomaly detections’. More apparent examples are drones that inspect bridges, or crops; or the software we develop at Sensing Clues, where I volunteer, where elephant populations can be observed using satellite images.”
But however rapid progress might be, there are still many limitations to the use of machine vision. “As a rule, you can't use it for tasks that last longer than a second, or for social interactions. Group dynamics are very complicated; first of all because of the amount of data that is involved, but also because of the complex way people communicate with each other. This will improve over the coming years, not necessarily because of the increased amount of annotated data, but because of the improved technology.”
‘Machine vision can replace more and more service professions’
Beuving expects an increasing number of human tasks – even entire functions – to be able to be performed by machine vision systems over the next few years. “Consider for example, taxi drivers, and truck drivers, whose work could be made redundant by self-driving cars and trucks. It iss also interesting to look at service professions such as hairdressing and helpdesks, where interactions are largely with just one person; these types of interactions can be increasingly replaced.”
Trends: edge computing and combined data
The advance of machine vision is facilitated by edge computing. Because the data is processed at its source, rather than in a data center, response time is improved and bandwidth reduced. This is especially interesting for time-critical machine vision applications in, for example, security, production environments and in the area of self-driving cars. Beuving: “The edge is getting cheaper, with lower power consumption and smaller devices. As a result, more and more is possible on the edge, so on the device itself. With drones, for example, videos are often only processed after the event. But if that can be done on the device itself, this would enable real-time applications.”
The large-scale availability of data other than images and videos also supports the machine vision revolution. “Think, for example, of security or wildlife cameras that are triggered by sounds. Instead of just triggering, other types of data could also be combined with images to serve as input for a neural network. If you have videos with sound when birdwatching, then you know for sure what kind of bird it is. A photo of the Eiffel Tower combined with positioning data makes it clear whether it is it the tower in Paris or the replica in Las Vegas. Self-driving cars also combine machine vision with other types of data.”
Deepfakes: both a blessing and a worry
Distance teaching, creating movie characters, reconstructing crime scenes – there are plenty of positive use cases for deepfakes. However, these are currently being drowned by the potential negative consequences. “Deepfakes are based on machine vision techniques. Because both the technology behind them and the data are improving, it is becoming increasingly difficult to tell what is real and what is deepfake. I know what to look for, but with the better deepfakes most people don't see the difference anymore," says Beuving.
Beuving expects that, as the quality of deepfakes increases and it becomes easier to make them, the already heated discussion about deepfakes will become even more intense in the coming years. And there are indeed already countless examples of deepfake incidents, from CEO fraud to revenge porn.
Step one: more data
The emergence of greatly improved graphics processors, as well as learning methods for deep learning, combined with larger data sets, has meant that machine vision has been on the rise since 2014. As a result, investments in machine vision are paying off in an increasing number of situations.
If you as an organization want to get started with machine vision, Beuving has some advice. Despite the spectacular techniques he describes in the webinar for getting more out of data, he always advises companies to try to collect as much data as possible first. “But we're facing a data problem because of the enormous amount of data that is being generated. On YouTube for example, the total time of videos uploaded every day amounts to around 80 years; most of that is unannotated. Algorithms, at least traditionally, have required properly annotated data for an in-depth understanding of video.”
Getting more from less data
Self-supervised learning, according to Beuving, is the go-to method for most companies when it comes to solving the data problem (in the best way possible). Provided that data is available. In supervised learning, a person has to label all the data points, which are then used to train the neural network. In addition to being slow and expensive, this is also prone to errors. Beuving: “Self-supervised learning, currently a hype in machine vision land, takes a revolutionary approach in that data no longer needs to be labeled; it uses self-labeling i.e. the data annotates itself and learns from itself. With few resources you get very rich neural networks.”
According to the machine vision expert, the best way to fine-tune the result is with meta-learning or active learning. In meta-learning, the model learns from few or even no samples in the training set. Active learning means that people only label the difficult data points, after which the model is retrained using the new data points. Beuving: “This method offers a solution for organizations that struggle with a limited amount of annotated data because they work with robots for example. But you can also think of the medical world, where images are too diverse and availability is problematic because of privacy regulations.”
Harness the power of Facebook and Google
According to Beuving, the entry level for organizations that want to get started with machine vision is quite high: “Out-of-the-box solutions are not available and you need a lot of experience and knowledge to get going. Fortunately, Facebook and Google allow you to step in at a relatively high level. They both carry out a lot of research in this area and make all kinds of open-source standard frameworks for machine vision available on GitHub. As a result, you as an organization can start your own research where Facebook and Google left off. That is how we do it at SmarterVision and Sensing Clues.”
The hardest part of course is to make sure you get the most out of the available data.
“That's just a case of experience," says Beuving. “You have to get a feeling for it and you can only sharpen that by doing it often. Every type of data is different. For example, look at Facebook PyTorch. That is a framework that consists of a kind of Lego blocks that you have to combine into a solution. The difficulty is not only that you have to put that combination together perfectly, but also that you have to make the best possible use of the available data.”