Machine Vision
The RoboRealm application provides a simple and quick interface for
experimentation into Machine Vision using a simple WebCam. Machine
Vision is typically a complex technology involving insight into how
images are represented within digital computers. A deep understanding
of lines, edges, colors, textures, etc. is often needed with supported
background in linear algebra, differential equations, eigenvalues, etc.
This requirement often dissuades people from getting involved in
machine vision technologies. RoboRealm's task is to provide a fun
environment into exploring machine vision and related techniques without
requiring significant technical and mathematical backgrounds.
Machine Vision is the process of interpreting pixels that are captured
using a digital device such as a webcam or digital camera. Typically,
machine vision can be broken into the following parts:
1. Image acquisition - how you get the image
2. Image conversion - converting the image into a usable format
3. Image processing - tweaking pixels
4. Statistical Analysis - understanding the image through numbers
5. Machine control - moving, rotating, etc. a hardware device based on the image
Image Acquisition
Image Acquisition is how you get the image into your computer. While machine
vision does not strickly indicate live or saved images it is often more
useful to have video capture devices provide images for a machine vision
application due to the need for realtime controlling of machines. During
development or prototyping one can use saved images for processing which
removes much of the difficultly of acquiring images until the final deployment
stage.
For experimentation image formats such as gif, jpeg, etc that can be
easily acquired via the web can provide a large test bed of images
surrounding a specific topic.
For realtime acquisition of images, Windows does provide a more standardized
technique called "Video For Windows" or VFW. This technique allows you to
create a single program that can then interface with most if not all video
capture devices that are supported under windows. RoboRealm supports this
technique and we have successfully interfaced to many video capture devices
such as webcams, TV capture devices and digitizers.
Realtime acquisition of images in the Unix environment is somewhat more
arcane and requires understanding of a specific device's interface. For
many capture cards the device exposes its interface via a file based
handle, as much of other Unix based device drivers are. However, the
specific commands used to set capture images
require specific commands that are usually not transferable to different
devices. Nevertheless, many machine vision applications are built custom
towards a specific purpose and therefore do not need to interoperable between
a large number of devices.
The result of the image acquisition process is to have in memory or on disk
a sequence of bytes that represent the image in some format.
Image conversion
The image conversion process takes the results of the image acquisition
and converts the image into a format that is easier to use during the
next couple of machine vision phases. The conversion process focuses on how
the image is formatted (i.e. is it a compressed image file like GIF or JPEG)
and how the pixel values are stored or packed (i.e. what are the color value ranges, is the color value an integer or floating point number?).
Many hundreds of image formats
exist but some are more popular and in wide use than others. For web
related applications the typical image formats of GIF, JPEG or PNG comprise
the majority of the image accessible on the web. For other machine vision
applications TIFF or PPM are more widely used. If your image
acquisition process leaves you with an encoded or compressed file you will
need to decode the file in order to gain access to the raw pixel color
values before proceeding to the next machine vision phase.
If you have chosen to skip the image acquisition phase you will most likely
be dealing with images that are encoded as JPEG. To access the raw pixel
values you will need a package such as FreeImage to load and provide
the pixel values to your application. Note that Windows also has built in
JPEG decoding routines but FreeImage supports more formats and works in
multiple environments.
Traditionally, image acquisition devices will provide you with a sequence
of bytes that are packed in a specific way. If you're lucky you will have
a sequence of bytes that are stored as 0-255 values in sets of 3. The
first byte will traditionally be the red value, followed by the blue and then
green. The fourth byte will then be the next pixel's red value, and so on.
This is known as an 8 bit RGB packed pixel. The 8 bit refers to how many
bits are allocated for each component of the pixel. This format is also
referred to as RGB 24, the 24 being the sum of 8+8+8 which represents the
3 values that comprise the pixel. This format is usually the easiest from
a conceptual point of view to process and thus is usually the desired
format at the end of the image conversion process.
There are many other pixel packing formats such as I420 or RGB 555, etc. All
of the formats just represent the image pixel values in a different way. For
example, RGB 555 uses just 5 bits (instead of the previously mentioned 8) to
store color values in an RGB sequence. The main different is that the RGB 555
format is less precise than the RGB 24 format since fewer bits are used.
The benefit is that an image is much smaller and can be transmitted much
faster through smaller bandwidth pipes (note that RGB 24 is 3 bytes per image whereas RGB 555 is 2 bytes with one bit being unused). In
addition, the loss of image
color while perceptible is not significant and will usually not alter
and machine vision processing results. If image size and transmission is not
and issue RGB 24 is normally preferred since it is easier to work with byte
aligned (8 bit values) rather than have to split bytes into bits to
extract the RGB triplet.
Image Processing
Once you have the image in a usable format (RGB 24 is in use in RoboRealm)
you can start the image processing phase of the machine vision application.
Image processing is perhaps the most time consuming and difficult phase
of a machine vision application. This stage requires you to convert the image
using any number of many many techniques to change the image into a
desired view.
For example, if you have a need to track an object then you first need to
decide what characteristic about that object can be used to detect the
object as robustly as possible. These characteristics are often
referred to as image featured. Note that more than one feature
and sometimes thousands can be used to detect or track and object.
These features can range from simple color or intensity (object brightness)
to more complex features such as edges or shapes. Some features to use
when determining your approach can be
- color - does the object have a unique color (i.e. neon green, bright purple, etc)
- intensity - is the object brighter or darker than other objects in the image
- object location - is the object always in the top of image, right corner of image, etc.
- movement - does the object move in a specific way, i.e. does it wiggle, sway, move slowly, stationary
- texture/pattern - does the object have a unique texture (i.e. tree bark, red bricks, pebble stones)
- edges - does the object have well defined edges that are straight or circular
- structure - given simpler blobs or parts of the image can the object be composed of simpler objects arranged in a specific manner?
Obviously, many features can be extracted from an image. Most of image processing is about
using the right features out of the millions that can be extracted. Sometime this process
can be automated but given limited time and resources a human decision on which
features to use can help considerably.
Statistical Analysis
Processing an image for specific features can enhance and extract aspects of the image
that can be used to focus on the final target. However, feature extraction through
image processing can result it many MANY numbers that need to be reduced substantially to
be useful. For example, when driving a robot based on an image thousands of image pixels
will reveal hundreds of pixel edges. Statistical Analysis of those resulting pixel edges
can reduce the hundreds of pixel edges to a single Center of Gravity number that can be
used to turn the robot to the left or to the right. There are a number of statistics
that can be deduced from the image (some even without any image processing) that can
be used to tie an image into a usable few or single numbers that represent the image.
Several statistical analysis numbers can include
- center of gravity - determines where the resulting pixels could be balanced on a single
point. Also referred to as center of mass.
- pixel count - count the resulting pixels that are left. High numbers can mean an object is
present, a low number means it is absent.
- blob - an area of connected pixels - size, location, number, etc can be computed
- max, min, mean, mode - basic pixel value calculations
Machine Control
Once the appropriate numbers have been calculated they need to be translated into
motor or servo movements in order for the robot or machine to react to what it
sees. Controlling a machine from a PC computer typically requires a board that translates
serial or parallel commands into PWM (Pulse Width Modulation) or actual electrical current via an
H-bridge. Many servo boards meant to control traditional servos exist. Have a look at Parallax
for examples of such boards. Sending commands to such a board requires that the statistical numbers
calculated from an image processed image be translated into left and right motor commands in the case
of steering a robot. This translation can be in the form of simple condition statements that turn a
motor on or off based on some thresholded value. For example, if the COG (Center of Gravity) of an
image is left of the image center send the servo board a 255 for the left motor. Likewise send a 255
position command to the servo controller board for the right servo to move the robot right.
Often, controlling several motors based on an image may require inverse kinematics to determine the
angular values for each of the servos required to position an arm or other complex robotic device
to its desired position. For example, if a robotic arm has two servos that function like a human
arm to position the hand at a certain point requires the two servos to have specific angles that
will position the hand at a certain X, Y point. Calculating angular values for these servos is
dependent on the resulting X, Y coordinate and on the length of the arm parts involved. Refer
to the subject of Inverse Kinematics to find out more on how to calculate the appropriate
angular values for the servos.
|