One of the hottest fields of technology is the rise of computer vision. Other people might call it image processing or machine vision - essentially, it is about giving computers the power of sight so that they can perceive and understand visual information. Our eyesight is something that many people take for granted, yet by some estimates, 80-85% of our sensory information comes from vision. Researchers have been trying to help computers perceive the world in the same way as humans and other animals for over half a century. Recent advances in computer vision have drawn inspiration from biological vision and cognition, leading to a clash between the fields of computer vision and artificial intelligence / machine learning. In the words of Stanford University Professor Fei-Fei Li, “if we want machines to think, we need to teach them to see.”
Digital images are made up of pixels, which represent one data point with some brightness or colour information at a spatial position. If we have more pixels, then we have more detail about the image, which is represented by “resolution”. Having more detail means that we can make the image physically larger before we get blurry or pixelated effects. Having pixels means that we can have a numerical representation of the image, and then we can do math on those numbers! Once we figure out what math needs to be done, we can get computers to do a lot of calculations very quickly, far quicker than any human, and process all the data automatically.
There are two more related fields that tell us a little bit more about how computer vision is achieved - academic conferences showcasing the latest in computer vision technologies might use the terms “signal processing” or “pattern recognition”. You might normally think of a “signal” as a squiggly line that moves up and down over time, but there’s no requirement that the x-axis has to be time. A signal is made up of data representing one variable changing over any dimension - in the case of an image, we could measure the brightness changing as we move horizontally or vertically across an image. We can then apply all sorts of signal processing techniques like filters or compressors to our images, which means we don’t have to re-invent the wheel and can borrow from an existing body of knowledge. “Pattern recognition” gives us a hint that a lot of computer vision processing is about finding patterns in the signals. A pattern is just a defined sequence of data points - the pattern should be identifiable as being a particular pattern. For example, we might have an image of a shoe - the pixels that represent that shoe could be called a pattern. The important feature of the pattern is that if we have another photo of the shoe, then we want to find the same pattern so that we can say that the pixels correspond to the same object. This turns out to be quite hard, because the patterns are rarely exactly the same - there needs to be some room for variation and changes in lighting conditions, photo angles, lens differences, and so on. So for example, instead of looking for exact matches of pixel values, we might use signal processing to find the edges of an object, and then match the shapes instead.
While the physics of biological perception are relatively well understood, the cognition part is still somewhat of a mystery and there are a lot of unknowns in neuroscience. Instead of trying to copy how the brain works exactly, there has been an explosion of artificial intelligence and machine learning over the last decade or so. The good news is that these methods have delivered very impressive results, allowing for detection and recognition of objects in images with high levels of accuracy. The bad news is that these methods are often treated as “black boxes” because the level of complexity is so large that it is pretty much impossible for any human to understand what exactly is happening. These methods optimise or “learn” automatically, modifying an internal model in order to improve performance without human intervention. A lot of the researchers and developers who are designing these systems are doing it in a trial-and-error way, just trying to squeeze out better accuracy rates. After Microsoft won a top international computer vision challenge in 2015, one of their executives asked the research team why their system had 152 layers of computation instead of 151 or 153 - the lead engineer said “we don’t know, we tried them all and 152 had the best result”. For a lot of developers, the AI models are abstracted away even further with new tools and APIs that allow them to deploy an AI model with one function call, further hiding the complexity.
So what does this all mean? The point is that computer vision is a relatively complicated and difficult area of technology, even though it is just about finding patterns in numbers. But we have come a really long way with the technology, and it’s already in our everyday lives. For example, most cameras and smartphones have automatic face detection technology that helps the camera focus on faces rather than the background or other objects. Some cameras even have smile detection to make sure that photos are taken at the best time!
Optical character recognition (OCR) is another well-developed and integrated application of computer vision - essentially extracting printed text from images. They are very common for reading number plates on vehicles (using Automated Number Plate Recognition (ANPR)), converting receipts and other documents into searchable text, and digitising old books. Recognising handwriting is a bit trickier, but is being done with relatively high accuracy - most postal systems around the world are now automatically reading postcodes to help sort the mail.
You’ve probably also seen computer vision be used in sports and entertainment. Tracking players or objects like balls can be quite challenging for a computer, but it is now done with sufficient accuracy that computer vision models can be used to inform refereeing decisions. The Hawk-Eye system is set up in major stadiums around the world, and uses multiple cameras to help localise and track objects, and has a kinematics simulation to reconstruct events like a ball landing in or out of court.
These are just a few examples of how computer vision is already being successfully used in a diverse range of applications. Of course, there are a lot more - fashion and retail, interactive gaming, augmented reality, autonomous vehicles, person tracking and surveillance, medical imaging, and industrial robotics are just a couple more areas where there are active efforts in deploying computer vision techniques. The research is still ongoing as well, and every year we see more advanced and complicated algorithms appearing at international conferences.
If we can teach computers to see as well as humans, then we will unlock millions of applications and be able to use the technology in ways that simply cannot be predicted now. But achieving human-level vision is still awhile away - the next article, we'll explore why computer vision is so hard in a metacognitive way (thinking about thinking), and why biological vision is simply amazing.
Element X is an artificial intelligence agency on a mission to make AI more accessible. We specialise in language, vision and data to accelerate your business, streamline processes and uncover meaningful insights through data.