When computers were first developed, the only way they could interact with the outside world was through the input that people wired or typed into them. Digital devices today often have cameras, microphones and other sensors through which programs can perceive the world we live in automatically. Processing images from a camera, and looking for interesting information in them, is what we call computer vision.
With increases in computer power, the decrease in the size of computers and progressively more advanced algorithms, computer vision has a growing range of applications. While it is commonly used in fields like healthcare, security and manufacturing, we are finding more and more uses for them in our everyday life, too.
For example, here is a sign written in Chinese:
If you can’t read the Chinese characters, there are apps available for smartphones that can help:
Having a small portable device that can "see" and translate characters makes a big difference for travellers. Note that the translation given is only for the second part of the phrase (the last two characters). The first part says “please don’t”, so it could be misleading if you think it’s translating the whole phrase!
Recognising of Chinese characters may not work every time perfectly, though. Here is a warning sign:
My phone has been able to translate the “careful” and “steep” characters, but it hasn’t recognised the last character in the line. Why do you think that might be?
Giving users more information through computer vision is only one part of the story. Capturing information from the real world allows computers to assist us in other ways too. In some places, computer vision is already being used to help car drivers to avoid collisions on the road, warning them when other cars are too close or there are other hazards on the road ahead. Combining computer vision with map software, people have now built cars that can drive to a destination without needing a human driver to steer them. A wheelchair guidance system can take advantage of vision to avoid bumping into doors, making it much easier to operate for someone with limited mobility.
Digital cameras and human eyes fulfill largely the same function: images come in through a lens and are focused onto a light sensitive surface, which converts them into electrical impulses that can be processed by the brain or a computer respectively. There are some differences, however.
Human eyes have a very sensitive area in the centre of their field of vision called the fovea. Objects that we are looking at directly are in sharp detail, while our peripheral vision is quite poor. We have separate sets of cone cells in the retina for sensing red, green and blue (RGB) light, but we also have special rod cells that are sensitive to light levels, allowing us to perceive a wide dynamic range of bright and dark colours. The retina has a blind spot (a place where all the nerves bundle together to send signals to the brain through the optic nerve), but most of the time we don’t notice it because we have two eyes with overlapping fields of view, and we can move them around very quickly.
Digital cameras have uniform sensitivity to light across their whole field of vision. Light intensity and colour are picked up by RGB sensor elements on a silicon chip, but they aren’t as good at capturing a wide range of light levels as our eyes are. Typically, a modern digital camera can automatically tune its exposure to either bright or dark scenes, but it might lose some detail (e.g. when it is tuned for dark exposure, any bright objects might just look like white blobs).
It is important to understand that neither a human eye nor a digital camera --- even a very expensive one --- can perfectly capture all of the information in the scene in front of it. Electronic engineers and computer scientists are constantly doing research to improve the quality of the images they capture, and the speed at which they can record and process them.
One challenge when using digital cameras is something called noise. That’s when individual pixels in the image appear brighter or darker than they should be, due to interference in the electronic circuits inside the camera. It’s more of a problem when light levels are dark, and the camera tries to boost the exposure of the image so that you can see more. You can see this if you take a digital photo in low light, and the camera uses a high ASA/ISO setting to capture as much light as possible. Because the sensor has been made very sensitive to light, it is also more sensitive to random interference, and gives photos a "grainy" effect.
Noise mainly appears as random changes to pixels. For example, the following image has "salt and pepper" noise.
Having noise in an image can make it harder to recognise what's in the image, so an important step in computer vision is reducing the effect of noise in an image. There are well-understood techniques for this, but they have to be careful that they don’t discard useful information in the process. In each case, the technique has to make an educated guess about the image to predict which of the pixels that it sees are supposed to be there, and which aren’t.
Since a camera image captures the levels of red, green and blue light separately for each pixel, a computer vision system can save a lot of processing time in some operations by combining all three channels into a single “grayscale” image, which just represents light intensities for each pixel.
This helps to reduce the level of noise in the image. Can you tell why, and about how much less noise there might be? (As an experiment, you could take a photo in low light --- can you see small patches on it caused by noise? Now use photo editing software to change it to black and white --- does that reduce the effect of the noise?)
Rather than just considering the red, green and blue values of each pixel individually, most noise-reduction techniques look at other pixels in a region, to predict what the value in the middle of that neighbourhood ought to be.
A mean filter assumes that pixels nearby will be similar to each other, and takes the average (i.e. the mean) of all pixels within a square around the centre pixel. The wider the square, the more pixels there are to choose from, so a very wide mean filter tends to cause a lot of blurring, especially around areas of fine detail and edges where bright and dark pixels are next to each other.
A median filter takes a different approach. It collects all the same values that the mean filter does, but then sorts them and takes the middle (i.e. the median) value. This helps with the edges that the mean filter had problems with, as it will choose either a bright or a dark value (whichever is most common), but won’t give you a value between the two. In a region where pixels are mostly the same value, a single bright or dark pixel will be ignored. However, numerically sorting all of the neighbouring pixels can be quite time-consuming!
A Gaussian blur is another common technique, which assumes that the closest pixels are going to be the most similar, and pixels that are farther away will be less similar. It works a lot like the mean filter above, but is statistically weighted according to a normal distribution.
Open the noise reduction filtering interactive below and experiment with the settings.
Mathematically, this process is applying a special kind of matrix called a convolution kernel to the value of each pixel in the source image, averaging it with the values of other pixels nearby and copying that average to each pixel in the new image. In the case of the Gaussian blur, the average is weighted, so that the values of nearby pixels are given more importance than ones that are far away. The stronger the blur, the wider the convolution kernel has to be and the more calculations take place.
For this activity, investigate the different kinds of noise reduction filter and their settings (grid size, type of blur) and determine:
- how well they cope with different levels of noise (you can set this in the interactive or upload your own noisy photos).
- how much time it takes to do the necessary processing (the interactive shows the number of pixels per second that it can process)
- how they affect the quality of the underlying image (a variety of images + camera)
You can take screenshots of the image to show the effects in your writeup. You can discuss the tradeoffs that need to be made to reduce noise.
Another technique that is sometimes useful in computer vision is thresholding. This is something which is relatively simple to implement, but can be quite useful.
If you have a greyscale image, and you want to detect regions that are darker or lighter, it could be useful to simply transform any pixel above a certain level of lightness to pure white, and any other pixel to pure black. In the Data Representation chapter we talked about how pixels are all represented as sets of numbers between 0 and 255 which represent red, green and blue values. We can change a pixel to greyscale by simply taking the average of all three of these values and setting all three values to be this average.
Once we have done this, we can apply a threshold to the image to decide whether certain pixels are in certain regions. The following interactive allows you to do this to an image you upload. Try using different thresholds on different images and observing the results.
The next interactive lets you do the same thing, but on the original colour image. You can set up more complicated statements as your threshold, such as "Red > 127 AND Green < 127 OR Blue > 63". Does applying these complex thresholds gives you more flexibility in finding specific regions of colour?
Thresholding on its own isn't a very powerful tool, but it can be very useful when combined with other techniques as we shall see later.
Recognising faces has become a widely used computer vision application. These days photo album systems like Picasa and Facebook can try to recognise who is in a photo using face recognition --- for example, the following photos were recognised in Picasa as being the same person, so to label the photos with people's names you only need to click one button rather than type each one in.
There are lots of other applications. Security systems such as customs at country borders use face recognition to identify people and match them with their passport. It can also be useful for privacy --- Google Maps streetview identifies faces and blurs them. Digital cameras can find faces in a scene and use them to adjust the focus and lighting.
There are some relevant articles on the cs4fn website that also provide some general material on computer vision.
First let's manually try some methods for recognising whether two photographs show the same person.
- Get about 3 photos each of 3 people
- Measure features on the faces such as distance between eyes, width of mouth, height of head etc. Calculate the ratios of some of these.
- Do photos of the same person show the same ratios? Do photos of different people show different ratios? Would these features be a reliable way to recognise two images as being the same person?
- Are there other features you could measure that might improve the accuracy?
You can evaluate the effectiveness of facial recognition in free software such as Google’s Picasa or the Facebook photo tagging system, but uploading photos of a variety of people and seeing if it recognises photos of the same person. Are there any false negatives or positives? How much can you change your face when the photo is being taken to not have it match your face in the system? Does it recognise a person as being the same in photos taken several years apart? Does a baby photo match of a person get matched with them when they are five years old? When they are an adult? Why or why not does this work?
Try using face recognition on this website to see how well the Haar face recognition system can track a face in the image. What prevents it from tracking a face? Is it affected if you cover one eye or wear a hat? How much can the image change before it isn't recognised as a face? Is it possible to get it to incorrectly recognise something that isn't a face?
A useful technique in computer vision is edge detection, where the boundaries between objects are automatically identified. Having these boundaries makes it easy to segment the image (break it up into separate objects or areas), which can then be recognised separately.
For example, here's a photo where you might want to recognise individual objects:
And here's a version that has been processed by an edge detection algorithm:
Notice that the grain on the table above has affected the quality; some pre-processing to filter that would have helped!
Earlier, we looked at how we could use a convolutional kernel to blur an image. One of the common techniques in edge detection also requires a convolutional kernel. If we multiply the values of pixels on one side of each point on the image by a negative amount, and pixels on the other side by a positive amount, then combine the results, we'll discover a number which represents the difference between pixels on the two sides. This technique is called finding the image gradient. The following interactive allows you to do that, then apply a threshold to the result so that you can begin to spot likely edges in a picture.
There are a few commonly used convolutional kernels that people have come up with for finding edges. After you've had a go at coming up with some of your own, have a look at the Prewitt operator, the Roberts cross and Sobel operator on wikipedia. Try these out in the interactive. What results do you get from each of these?
There are a number of good edge detections out there, but one of the more famous ones is the Canny edge detection algorithm. This is a widely used algorithm in computer vision, developed in 1986 by John F. Canny. You can read more about Canny edge detection on Wikipedia.
You could extend the techniques used in the above interactive by adding a few more processing stages. If you were to apply a gaussian filter to the image first, then do some work to favour edges that were connected to other edges, then you would be on your way to implementing the Canny edge detector.
With the canny edge detection interactive above, try uploading different images and determining how good different convolutional kernels are at detecting boundaries in the image. Capture images to put in your report as examples to illustrate your experiments with the detector.
- Can the detector find all edges in the image? If there are some missing, why might this be?
- Are there any false edge detections? Why did they system think that they were edges?
- Does the lighting on the scene affect the quality of edge detection?
- Does the system find the boundary between two colours? How similar can the colours be and still have the edge detected?
- How fast can the system process the input? Does this change with more complex kernels?
- How well does the system deal with an image with text on it?
The field of computer vision is changing rapidly at the moment because camera technology has been improving quickly over the last couple of decades. Not only is the resolution of cameras increasing, but they are more sensitive for low light conditions, have less noise, can operate in infra-red (useful for detecting distances), and are getting very cheap so that it's reasonable to use multiple cameras, perhaps to give different angles or to get stereo vision.
Despite these recent changes, many of the fundamental ideas in computer vision have been around for a while; for example, the "k-means" segmentation algorithm was first described in 1967, and the first digital camera wasn't built until 1975 (it was a 100 by 100 pixel Kodak prototype).
(More material will be added to this chapter in the near future)
We also have a Viola-Jones Face Detection interactive available, supporting text will be added at a later date.