Machine Vision

The RoboRealm application provides a simple and quick interface for experimentation into Machine Vision using a simple WebCam. Machine Vision is typically a complex technology involving insight into how images are represented within digital computers. A deep understanding of lines, edges, colors, textures, etc. is often needed with supported background in linear algebra, differential equations, eigenvalues, etc. This requirement often dissuades people from getting involved in machine vision technologies. RoboRealm's task is to provide a fun environment into exploring machine vision and related techniques without requiring significant technical and mathematical backgrounds.

Machine Vision is the process of interpreting pixels that are captured using a digital device such as a webcam or digital camera. Typically, machine vision can be broken into the following parts:

1. Image acquisition - how you get the image
2. Image conversion - converting the image into a usable format
3. Image processing - tweaking pixels
4. Statistical Analysis - understanding the image through numbers
5. Machine control - moving, rotating, etc. a hardware device based on the image

Image Acquisition

Image Acquisition is how you get the image into your computer. While machine vision does not strickly indicate live or saved images it is often more useful to have video capture devices provide images for a machine vision application due to the need for realtime controlling of machines. During development or prototyping one can use saved images for processing which removes much of the difficultly of acquiring images until the final deployment stage.

For experimentation image formats such as gif, jpeg, etc that can be easily acquired via the web can provide a large test bed of images surrounding a specific topic.

For realtime acquisition of images, Windows does provide a more standardized technique called "Video For Windows" or VFW. This technique allows you to create a single program that can then interface with most if not all video capture devices that are supported under windows. RoboRealm supports this technique and we have successfully interfaced to many video capture devices such as webcams, TV capture devices and digitizers.

Realtime acquisition of images in the Unix environment is somewhat more arcane and requires understanding of a specific device's interface. For many capture cards the device exposes its interface via a file based handle, as much of other Unix based device drivers are. However, the specific commands used to set capture images require specific commands that are usually not transferable to different devices. Nevertheless, many machine vision applications are built custom towards a specific purpose and therefore do not need to interoperable between a large number of devices.

The result of the image acquisition process is to have in memory or on disk a sequence of bytes that represent the image in some format.

Image conversion

The image conversion process takes the results of the image acquisition and converts the image into a format that is easier to use during the next couple of machine vision phases. The conversion process focuses on how the image is formatted (i.e. is it a compressed image file like GIF or JPEG) and how the pixel values are stored or packed (i.e. what are the color value ranges, is the color value an integer or floating point number?).

Many hundreds of image formats exist but some are more popular and in wide use than others. For web related applications the typical image formats of GIF, JPEG or PNG comprise the majority of the image accessible on the web. For other machine vision applications TIFF or PPM are more widely used. If your image acquisition process leaves you with an encoded or compressed file you will need to decode the file in order to gain access to the raw pixel color values before proceeding to the next machine vision phase.

If you have chosen to skip the image acquisition phase you will most likely be dealing with images that are encoded as JPEG. To access the raw pixel values you will need a package such as FreeImage to load and provide the pixel values to your application. Note that Windows also has built in JPEG decoding routines but FreeImage supports more formats and works in multiple environments.

Traditionally, image acquisition devices will provide you with a sequence of bytes that are packed in a specific way. If you're lucky you will have a sequence of bytes that are stored as 0-255 values in sets of 3. The first byte will traditionally be the red value, followed by the blue and then green. The fourth byte will then be the next pixel's red value, and so on. This is known as an 8 bit RGB packed pixel. The 8 bit refers to how many bits are allocated for each component of the pixel. This format is also referred to as RGB 24, the 24 being the sum of 8+8+8 which represents the 3 values that comprise the pixel. This format is usually the easiest from a conceptual point of view to process and thus is usually the desired format at the end of the image conversion process.

There are many other pixel packing formats such as I420 or RGB 555, etc. All of the formats just represent the image pixel values in a different way. For example, RGB 555 uses just 5 bits (instead of the previously mentioned 8) to store color values in an RGB sequence. The main different is that the RGB 555 format is less precise than the RGB 24 format since fewer bits are used. The benefit is that an image is much smaller and can be transmitted much faster through smaller bandwidth pipes (note that RGB 24 is 3 bytes per image whereas RGB 555 is 2 bytes with one bit being unused). In addition, the loss of image color while perceptible is not significant and will usually not alter and machine vision processing results. If image size and transmission is not and issue RGB 24 is normally preferred since it is easier to work with byte aligned (8 bit values) rather than have to split bytes into bits to extract the RGB triplet.

Image Processing

Once you have the image in a usable format (RGB 24 is in use in RoboRealm) you can start the image processing phase of the machine vision application. Image processing is perhaps the most time consuming and difficult phase of a machine vision application. This stage requires you to convert the image using any number of many many techniques to change the image into a desired view.

For example, if you have a need to track an object then you first need to decide what characteristic about that object can be used to detect the object as robustly as possible. These characteristics are often referred to as image featured. Note that more than one feature and sometimes thousands can be used to detect or track and object. These features can range from simple color or intensity (object brightness) to more complex features such as edges or shapes. Some features to use when determining your approach can be

color - does the object have a unique color (i.e. neon green, bright purple, etc)
intensity - is the object brighter or darker than other objects in the image
object location - is the object always in the top of image, right corner of image, etc.
movement - does the object move in a specific way, i.e. does it wiggle, sway, move slowly, stationary
texture/pattern - does the object have a unique texture (i.e. tree bark, red bricks, pebble stones)
edges - does the object have well defined edges that are straight or circular
structure - given simpler blobs or parts of the image can the object be composed of simpler objects arranged in a specific manner?

Obviously, many features can be extracted from an image. Most of image processing is about using the right features out of the millions that can be extracted. Sometime this process can be automated but given limited time and resources a human decision on which features to use can help considerably.

Statistical Analysis

Processing an image for specific features can enhance and extract aspects of the image that can be used to focus on the final target. However, feature extraction through image processing can result it many MANY numbers that need to be reduced substantially to be useful. For example, when driving a robot based on an image thousands of image pixels will reveal hundreds of pixel edges. Statistical Analysis of those resulting pixel edges can reduce the hundreds of pixel edges to a single Center of Gravity number that can be used to turn the robot to the left or to the right. There are a number of statistics that can be deduced from the image (some even without any image processing) that can be used to tie an image into a usable few or single numbers that represent the image.

Several statistical analysis numbers can include

center of gravity - determines where the resulting pixels could be balanced on a single point. Also referred to as center of mass.
pixel count - count the resulting pixels that are left. High numbers can mean an object is present, a low number means it is absent.
blob - an area of connected pixels - size, location, number, etc can be computed
max, min, mean, mode - basic pixel value calculations

Machine Control

Once the appropriate numbers have been calculated they need to be translated into motor or servo movements in order for the robot or machine to react to what it sees. Controlling a machine from a PC computer typically requires a board that translates serial or parallel commands into PWM (Pulse Width Modulation) or actual electrical current via an H-bridge. Many servo boards meant to control traditional servos exist. Have a look at Parallax for examples of such boards. Sending commands to such a board requires that the statistical numbers calculated from an image processed image be translated into left and right motor commands in the case of steering a robot. This translation can be in the form of simple condition statements that turn a motor on or off based on some thresholded value. For example, if the COG (Center of Gravity) of an image is left of the image center send the servo board a 255 for the left motor. Likewise send a 255 position command to the servo controller board for the right servo to move the robot right.

Often, controlling several motors based on an image may require inverse kinematics to determine the angular values for each of the servos required to position an arm or other complex robotic device to its desired position. For example, if a robotic arm has two servos that function like a human arm to position the hand at a certain point requires the two servos to have specific angles that will position the hand at a certain X, Y point. Calculating angular values for these servos is dependent on the resulting X, Y coordinate and on the length of the arm parts involved. Refer to the subject of Inverse Kinematics to find out more on how to calculate the appropriate angular values for the servos.