Real-Time Imaging 5, 49]62 Ž1999. Article No. rtim.1998.0130, available online at http:rrwww.idealibrary.com on
Real-Time Object Specific Recognition © ia Raster Scan Video Processing ost real-time video recognition systems have historically relied upon region of interest processing based on low level properties such as pixel color to quickly reduce the computational load required for real-time performance. Because these low level properties are quite sensitive to lighting variations, they are limited in their application scope. This paper presents an approach which segments objects of interest via higher level relational properties overcoming such limitations. These relations are autoassociatively and incrementally constructed through a processing structure called raster scan video processing in a hierarchical manner. Relational properties of interest are then obtained with local object specific processes. The compactness of this processing structure has allowed it to be implemented on one TMS320C40 digital signal processing chip at a processing rate of three frames per second. The approach is applied to the problem of road sign recognition in realistic outdoor scenes. Video frame recognition results for stop and pedestrian signs are presented along with details of the DSP implementation.
M
q 1999 Academic Press
L. W. Estevez and N. Kehtarnavaz Dept of Electrical Engineering, Texas A & M Uni¨ ersity, College Station, TX 77843-3128, USA Email:
[email protected],
[email protected]
Introduction Real-time imaging systems have historically been composed of dedicated pre-processing hardware and a parallel processing secondary system. Recent advances in DSP technology are making high frequency single chip solutions realizable. Because video processing systems will still be frequency limited by external memory reads and writes, these single chip video recognition systems will require compact processing structures to take advantage of the higher processing frequencies. This paper presents an approach which is capable of recognizing stop and pedestrian signs in realistic outThe images for Figures 2, 7 and 17 in this paper are viewable in color at http:rree.tamu.edur; dsplabrthesis.htme 1077-2014r99r010049 q 14 $30.00
door scenes in real time. This methodology is based on a compact processing structure called raster scan video processing ŽRSVP. which enables the entire system to be implemented on a single DSP. RSVP requires little memory because it is hierarchically associative. This enables various parts of the image to be at different levels of processing simultaneously. Unlike most conventional approaches, which divide the recognition process into stages through which all image data is processed, RSVP multitasks low and high level processes as the image is scanned. Hierarchical processing systems are commonly found in real-time applications, since they are characterized by processing on demand at each level. Object specific approaches represent the highest level of abstraction q 1999 Academic Press
50
L. W. ESTEVEZ AND N. KEHTARNAVAZ
for object recognition problems and are therefore the most easily formulated. It is no surprise, then, to find the majority of previous work to be the integration of an object specific scene description language ŽSDL. and some processing hierarchy. Yihong Gong et al. w1x have described an object specific SDL within a four-level hierarchy. The foundational level is based on the prominent colors in the scene; the next level defines regions of interest, the third level defines objects of interest, and the highest level defines the scene where the video properties to be detected are declared. Day et al. w2x have presented an object specific hierarchical approach to video modeling in which directed graphs are used to relate groups of objects into a scene of interest. Del Bimbo w3x has also presented an object specific approach for static and dynamic object recognition. Although this approach is at a lower level of abstraction, automated stages of image processing are still required for pre-processing. Previous work in traffic sign recognition has focused on color segmentation followed by shape recognition. Kehtarnavaz et al. w4x have proposed an approach to stop sign recognition based on color segmentation followed by shape recognition using the Hough transform technique. Estevez and Kehtarnavaz w5x have presented a real-time approach to traffic sign recognition using color space relationships to segment red signs followed by a histographic mask for shape recognition. A group of researchers from Daimler Benz have devised a realtime traffic sign recognition system based on a complex color segmentation scheme followed by color contextual shape recognition w6]8x. The methodology presented in this paper differs from these approaches in that shape recognition precedes color processing. This reduces the computational complexity involved in processing each image. The proposed methodology segments objects of interest via higher level relational properties which are autoassociatively and incrementally constructed through RSVP.
Raster Scan Video Processing ( RSVP) Considering that RSVP represents a processing structure which requires a small amount of memory, it can be implemented entirely on many of today’s DSPs, providing greater throughput to many image processing
applications. RSVP applications are constructed with auto associatively segmented object relations and subsequent locally associated object properties. There can be two processing modes: auto associative and objectoriented. Auto associative processing modes are discussed in w9x. Although this paper provides an overview of auto associative segmentation, it mainly addresses object specific processing modes. Object specific processing modes are application dependent and are triggered by object specific classes. The only requirements to be met by these processing modes are that they are locally identifiable by other processing modes and that they comply with the RSVP structural definition.
RSVP structure The RSVP structure is based on a horizontal line of edge classes spanning the processed image. It is implemented on-chip as a circular buffer and is characterized by the following six properties: 1. Digitized video data is processed in a raster scan sequence with no more than two scan lines of simultaneous image interdependencies. 2. Equidistant color edge classes spanning a single horizontal scan line at subsampled intervals propagate downward as the image is scanned. 3. Each edge class is characterized and associated by type, color, and location. 4. Each edge class also retains both local and regional associative properties. 5. Object recognitions are based on regional associations of locally associated edge classes. 6. Processing modes are dependent upon previous line edge class types. Figure 1 depicts the propagating edge class structure as it might appear on an image that is being scanned. Each edge class represented by a downward arrow possesses intensive and associative properties. Intensive properties include color, location, and type. Associative properties are application specific but should include some associative structural information such as relative location. As the image is scanned, relative edge class properties are autoassociatively segmented based on some previously defined relational definitions. Once a given
REAL-TIME OBJECT SPECIFIC RECOGNITION
51
level properties which are less susceptible to noise. One of the challenges in doing this is to overcome desaturation problems while maintaining a high processing speed. Although many real-time recognition systems achieve high performance by first segmenting objects based on their color, these systems fail when there is little or no color present in the image; Figure 2 illustrates an example of such desaturation.
Figure 1. Raster scan video processing structure.
relational definition has been auto associatively determined, the associated edge class types may be modified to those of specialized object specific classes.
Auto associat i ve object segmentation The purpose of auto associative segmentation is to segment potential objects of interest based on higher
RSVP uses both intensive and associative information to segment objects with relational properties of interest. Although such an auto associative system would normally be quite computationally intensive, RSVP operates quite quickly through a dynamic hierarchical system of association. To ensure a high performance of auto associative segmentation, recognition problems are formulated to segment relatively unique relational properties auto associatively and then to investigate common properties locally. To reduce the amount of processing required in this auto associative scheme, RSVP hierarchically associates edge classes. Only edge classes enjoying multiple local associations are evaluated for regional associations. An example of how this reduces the effective
@A#2 33,11X23,1
Figure 2. Desaturated color image.
52
L. W. ESTEVEZ AND N. KEHTARNAVAZ
Figure 5. Double association edge classes. Figure 3. Color edge class image.
processing load may be seen in the following figures. Figure 3 depicts an image of color edge classes. Each of these color edge classes is defined by a horizontal sequence of similar and dissimilar colored pixels, with the average of the associated pixels defining the edge class color property and the location of the dissimilar pixel pairs defining the edge class location property. Figure 4 depicts the association of all proximal classes sharing similar color. Figure 5 depicts only the associated edge classes containing at least one double associ-
ation. Figure 6 shows only the edge classes containing at least one triple association. Notice that edge classes which are multiply associated inherently belong to color classes containing more structural information than those containing edge classes without multiply associated edge classes. This observation is applied to reducing the number of color classes evaluated for vertically propagating structural information at the lowest levels of the recognition scheme.
Object specific processing modes Object specific processing modes are triggered by previous line edge class types. When an interesting relation is determined, the scan head edge class type is modified to trigger object specific processing on subsequent scan
Figure 4. Single association edge classes.
Figure 6. Triple association edge classes.
REAL-TIME OBJECT SPECIFIC RECOGNITION
Figure 7. Original image.
Figure 8. Edge class image.
53
54
L. W. ESTEVEZ AND N. KEHTARNAVAZ
lines. As the image is scanned, the previous line’s edge class types are evaluated. If an object specific class type is found, the relevant object specific processing is initiated. This processing mode may be executed instantaneously or continuously until another edge class terminates the process. Although edge class properties are mutually exclusive between class types since there is a fixed amount of memory allocated for the processing structure, pointers are used to dynamically allocate memory for object specific properties.
General Auto Associative Relation ( GAAR) RSVP auto associative relations are incrementally constructed via three levels of association. Edge classes are associated on a local, regional, and interframe level. Local associations are based on vertical propagations between the scan head class and previous line edge classes, as depicted in Figure 1. Regional associations are based on horizontal and vertical associations of compatible type edge classes. Interframe associations are based on catalogued edge classes from previous frames. The GAAR incrementally constructs object relations in the following five levels:
Figure 9. Double associated edge class image.
Level Level Level Level Level
1: 2: 3: 4: 5:
Auto associative Edge Detection Local Vertical Edge Class Association Regional Horizontal Edge Class Association Regional Vertical Edge Class Association Interframe Edge Class Association
The operation of the GAAR through the first four of these levels may be seen by processing the image in Figure 7. The pedestrian sign in this figure is segmented through a series of local and regional associations. The GAAR first performs vertical edge detection on the image by looking for horizontal sequences of three subsampled pixels with similar and dissimilar colored pairs. Each detected edge forms an edge class of a specific subsampled location and color. Edge classes also possess a polarity property to distinguish left and right bounding edges. Figure 8 illustrates the results of this operation with right polarized edges at greater intensities. These edge classes are then vertically associated with proximal classes of similar color on the previous line. If an association is made, the difference in location between the two edges is stored as a structural property of the scan head class along with the number of vertical
REAL-TIME OBJECT SPECIFIC RECOGNITION
55
Figure 10. Regionally associated edge class image.
associations made for this class. If an association is not made, the edge class may still be vertically propagated for a number of lines until it is overwritten or extinguished. Extinction is based on each class’s association history. Edge classes enjoying multiple vertical associations are evaluated for relevant structural information which might make them candidates for regional association. Figure 9 depicts these edge classes. Edge classes containing positive and negative sloping diagonal structural properties on the same scan line are then regionally associated based on color. Figure 10 depicts the edge classes containing this structural information. The higher intensity pixels in this image represent the regionally associated right positive sloping and left negative sloping edge classes. Notice that these edge classes belong to the top of the sign. Once the top of the sign has been regionally associated, its properties and location are catalogued in a dedicated memory location. When the bottom part of the sign is horizontally associated, the properties of catalogued objects are evaluated to determine whether or not a vertical association can be made.
Figure 11. Interframe association.
Interframe association Interframe associations enable the RSVP structure to recognize objects when only parts of them are visible at a time. This methodology is powerful in recognizing objects which are subject to partial occlusions. Figure 11 depicts an example of interframe association. The lower half of the stop sign is occluded in the first frame and its top half in the second frame. Because RSVP
56
L. W. ESTEVEZ AND N. KEHTARNAVAZ
incrementally associates objects between frames, the object may be recognized in the second frame by associating the top half of the sign from the previous frame. To keep track of how long ago part of an object was identified, frame counters are stored with each catalogued object. Because many applications such as road sign recognition have expected trajectories of objects through the field of view Ži.e. left to right., the frame counter can also be used to limit the location of interframe associations. After an object’s frame counter has reached a given permanence limit, the object is deleted from the catalogue.
Object Contour Tracker ( OCT) Object contour trackers are a type of object specific class which enables the contour of an object to be tracked in time. Because the required operation of OCTs for left and right boundary tracking is different, left and right OCTs are assigned different class types. As the image is scanned in the auto associative mode, the previous line’s edge class types are evaluated. If an OCT class type is encountered, auto associative processing is interrupted with object specific processing. Figure 12 illustrates how OCT processing is imbedded within the RSVP structure. Since OCT processing is locally associative, the process may be completed immediately without affecting the normal scan sequence of auto associative segmentation. For right bounding OCTs, the associative order is from right to left. For left bounding OCTs, the associative order is from left to right. OCT association also differs from the GAAR in that pixels are associated
with OCTs while only previously detected edges are associated in the GAAR. The object contour is tracked through the associative sequence on corresponding left and right boundaries. Assume, for example, that the object contains a negative sloping right boundary. The OCT will propagate downward vertically until the subsampled pixel to the right is associated with the previous line’s OCT. The OCT will then be propagated to the corresponding edge class location. If this right boundary turns into a positive sloping contour, the OCT will fail to associate to the right and directly below on subsequent scan lines, propagating the OCT to the left. The OCT would therefore ‘‘hug’’ the right boundary and the record of its movements are propagated as its properties. The path of the OCT would be similar for the left boundary, except as the associative sequence is reversed, left boundary OCTs would hug the left boundary.
OCT extinction and containment To ensure that the object contour is traced within a specified vertical range, the class type of an OCT is incremented upon each propagation and compared to an extinction threshold to ensure the OCT is confined to the associated object. This extinction threshold is based on the expected maximim size of the objects of interest. If an OCT is not associated over a defined propagation period, it may also become extinct. This associative extinction may be implemented with an extinction counter as one of the class properties. Because OCTs can interfere with the GAAR process, it may be necessary to implement some mechanism for terminating OCTs which are associating beyond their objects of origin. This may be implemented with an outward propagation counter. If the OCT propagates continuously outward for a number of times, it may be safely assumed that it is associating beyond its object of origin and should therefore be terminated.
OCT properties
Figure 12. Object contour tracking propagation.
OCT class properties occupy the same memory locations allocated for auto associative edge classes. Three counters corresponding to left, center, and right propagation are used to record the propagation history of the
REAL-TIME OBJECT SPECIFIC RECOGNITION
57
OCT. The RGB color of the object and the class type are also stored as class properties. As the OCT is vertically associated, its relative position counters are incremented. If the OCT is associated vertically, for example, the center counter is incremented. If it is associated to the left, its left counter is incremented. Recognition of an object’s contour is based on these three counters.
OCT calibration Local association is based on the sum of absolute color differences ŽSAD. between the propagating OCT and its candidates. A candidate is elected for propagation when its SAD is less than a specific associative sensitivity threshold ŽAST.. This AST is determined by a calibration procedure. This procedure comprises the following steps: 1. Collect a representative set of digitized images containing the objects to be identified under various lighting conditions. 2. Divide the images into groups based on the objects they contain. 3. Begin with a high detection AST at which none of the objects in the image set are detected and set all of the other associative parameters appropriately Žthe recognition parameters should already be set.. 4. Perform recognition on the image decreasing the detection AST and all of the other parameters until recognition of the object is made. 5. Continue decreasing the detection AST and associative parameters until recognition of the object is no longer made. 6. If recognition is never made, the image should not be used for calibration. 7. Repeat this same process for each of the images in each object set. 8. Use the midpoint of each of these recognition ranges to determine the detection AST recognition range. 9. Determine the mean of the midpoints as the best detection AST. Figure 13 illustrates this procedure. Once the best detection AST ŽDAST. has been determined, OCT ASTs are determined. The center and inside candidate ASTs should be more general than the detection AST while the outside candidate AST should be less general. This is necessary in order to contain the OCT. The inside and center propagation positions should then be associated with an AST of twice the calibrated
Figure 13. AST calibration.
detection AST, and the outside propagation positions should be associated with an AST of half the calibrated detection AST. Although experiments with adaptive trackers looked promising, the computational costs of these techniques were prohibitive.
OCT example Consider the pedestrian sign in Figure 7. Once the GAAR has auto associatively segmented the left and right top boundaries of the sign, the type of the edge class on the right boundary of the sign is modified to identify it as a right OCT. The contour of the right boundary of the sign is traced on subsequent scan lines. The propagation of this OCT can be seen in Figure 14.
OCT road sign recognition results Both the GAAR and OCTs were applied to the problem of road sign recognition. Relations for stop and pedestrian signs were defined considering that these signs are most important to personal safety. Based on the calibration methodology above, a detection AST of 30 was obtained for this application. The formulation for these two signs is illustrated in Figure 15. From this figure, it can be seen that the top part of the sign was first auto associatively segmented with the GAAR. The right bounding edge class forming this relation was then converted to a right OCT. Inside associations for the right OCT were not made to elimi-
58
L. W. ESTEVEZ AND N. KEHTARNAVAZ
Figure 14. Right OCT segmented contour.
Figure 16. Pedestrian sign edge class image.
Figure 15. GAAR-OCT formulation.
nate potential interference in the auto associative segmentation of the bottom of these signs. The vertical component of stop signs were identified after five consecutive vertical associations. OCTs not associated for five scan lines were extinguished. Several scenes containing these signs under various
weather conditions were acquired with a video camera mounted in the center dash of a moving truck. Figure 7 and Figures 16]18 depict sample pedestrian and stop sign images with their corresponding edge class images. Failed recognition analysis To provide a quantitative measure of performance, 76 image frames of the worst case video were processed.
REAL-TIME OBJECT SPECIFIC RECOGNITION
59
Figure 17. Stop sign original image.
The vehicle in this study was moving between 30 and 45 miles per hour. The failed recognition results of this study are presented in Table 1. The failed recognition images were analysed to determine the cause of failure in each case. These were found to lie in three categories: edge detection, regional association, and OCT containment. The two stop signs missed were due to failed edge detection of signs with excessive motion blurring. Although the GAAR addresses motion discontinuities through its incremental construction of small segments of the object, smaller signs are more susceptible to motion blurring since there is less structural information available. The direction of the motion blurring is also of key importance. The algorithm copes best with horizontal motion because the image data is horizontally processed. Strong vertical vibrations such as those which appear while driving over intersection dips most greatly deform the shapes of processed objects. The four pedestrian signs missed were due to
regional association failures. Although the top and bottom halves of the sign were properly segmented, the pedestrian symbol in the sign also contained structural information which interfered with this association. This problem can be overcome by allocating additional memory for objects to be associated regionally. If multiple memory slots were allocated per regional zone, interfering edges of similar geometry but varying colors could be recorded without replacing each other. Note that only one object of each defined geometry may be defined per regional zone in the proposed system. Dynamic memory allocation would be required to enable a greater number of geometric entries per region. The six stop signs which were recognized as pedestrian signs were the result of OCT containment problems. Because the OCT must be general enough to track the entire contour of the sign, it may overgeneralize to neighboring pixels of similar colors in the scene. Once this occurs, the OCT is no longer tracking the
60
L. W. ESTEVEZ AND N. KEHTARNAVAZ
Figure 18. Stop sign edge class image.
Table 1. Failed recognition analysis
Pedestrian Stop
Edge detection
Object association
OCT
0 2
4 0
0 6
original object. Although this problem may be improved by intersample investigation, it is not anticipated that OCTs would be capable of performing as well as the GAAR under considerable noise since it is locally continuous in its associations.
Real-Time Considerations The proposed system was implemented on Spectrum’s MDC40IC, which hosts a TMS320C40 operating with a clock frequency of 40 Mhz. The GAAR and OCTs were implemented in assembly language and the algorithm and processing structure were both stored
on-chip. The algorithm processes NTSC video at approximately three frames per second with this configuration. Figure 19 depicts the implementation of the system as it pertains to an image. The most recent color edge information is maintained in the register file. The RSVP processing structure is kept in on-chip cache and an image frame of VRAM is used externally. Two primary features of RSVP enable applications to achieve high processing speeds. These are caching and filtering. Because the processing structure requires such a small amount of memory to operate, it may be stored entirely on-chip. Because RSVP relations are hierarchical, higher levels of processing may be filtered when irrelevant. Because information is conditionally loaded upon edge detection, irrelevant regularly sampled information is also filtered.
REAL-TIME OBJECT SPECIFIC RECOGNITION
61
mance of RSVP. Since the system moves up and down the hierarchy for each edge class as the image is scanned, irrelevant higher level processing is foregone for some edge classes. This is realized through the repeated associative segmentation of edge classes at each level of the hierarchy. Data filtering is also used to reduce the amount of external memory reads and writes. Subsampling enables the system to extract high resolution edge information in specific areas of interest. Resolution is not sacrificed, since subsampled edge detections are locally investigated. Asynchronous processing
Figure 19. DSP implementation.
Caching An efficient caching system greatly accelerates a processor’s speed, since external memory read and writes are limited by external bus frequencies. Because the RSVP structure operates without redundant loads, the structure provides an optimal means of caching. Let us consider the amount of memory required to implement the GAAR and OCTs for this application. Each edge class required 7 bytes of memory allocation for the GAAR. For the subsampling interval of four pixels across a 640 pixel wide image demanded Ž640r4 s 160 q 2.)7 s 1134 bytes, or assigning a 32bit word to each piece of information: 162)5 s 810 words for an optimal cache Žoptimal meaning no redundant loads.. Considering that most DSPs have at least 2K words of on-chip memory, this left at least 1190 words for code. Since OCTs replaced GAAR edge classes without requiring any additional memory, OCTs were virtually transparent to the processing structure. In addition to data caching, RSVP provides an optimal means for the placement of instruction memory on the system. Because RSVP is a hierarchical system, higher-level processes may be implemented off-chip with guaranteed less access. Filtering Filtering also plays a major role in the real-time perfor-
Because the RSVP structure is dependent only on local information and data is processed in the same way it is acquired, video data is asynchronously processed. In other words, the video buffer is continuously overwritten at 30 frames per second even though the data is processed at only 3 frames per second. The main advantage of processing video in this way is that multiple RSVP structures are used to process the same video data at independent rates.
Conclusions The RSVP structure provides a high speed implementation which may require little or no external memory making it an attractive methodology for commercial real-time recognition applications such as on-board warning systems. The recognition results indicate good performance in the presence of a wide range of lighting conditions, motion blurring, and partial occlusion. Because OCTs are auto associatively initiated and disassociatively extinguished, they may be included within the RSVP structure without interfering with auto associative segmentation. It is this transparent property that makes OCTs suitable object-specific RSVP classes. Future research might be directed towards the use of pointers to a dynamically allocated heap of object properties. These dynamic properties might then be used in determining appropriate processing structures for local regions as the image is scanned. Although such a processing structure might require increased power, the amount of memory required for a complex recognition system would be maintained quite small.
62
L. W. ESTEVEZ AND N. KEHTARNAVAZ
Acknowledgements This work was funded by Texas Instruments as part of the DSP program at Texas A & M University.
References 1. Gong, Y., Chuan, C., Yongwei, Z. & Sakauchi, M. Ž1996. Generic videoparsing system with a scene description language ŽSDL.. Real-Time Imaging, 2: 45]59. 2. Day, Y. F., Dagtas, S., Khokar, A. & Ghafoor, A. Ž1995. Object-oriented conceptual modeling of video data. 1995 11th International Conference on Data Engineering, pp. 401]408. 3. Del Bimbo, A. Ž1991. An object-oriented framework for static and dynamic object recognition. COMPUEURO ’91 Ad¨ anced Computer Technology, and Reliable Systems and
Applications, pp. 58]62. 4. Kehtarnavaz, N., Griswold, N. C. & Kang, D. S. Ž1993. Stop-sign recognition based on colorrshape processing. Machine Vision and Applications, 6: 206]208. 5. Estevez, L. & Kehtarnavaz, N. Ž1996. A real-time histographic approach to road sign recognition. Southwest Symposium on Image Analysis and Interpretation, pp. 95]100. 6. Ritter, W. Ž1992. Traffic sign recognition in color image sequences. Intelligent Vehicles Symposium, pp. 12]17. 7. Estable, S., Schick, J., Stein, F., Janssen, R., Ott, R. et al. Ž1994. A real-time traffic sign recognition system. Intelligent Vehicles Symposium, pp. 213]218. 8. Priese, L., Lakmann, R. & Rehrmann, V. Ž1995. Ideogram identification in a realtime traffic sign recognition system. Intelligent Vehicles Symposium, pp. 310]314. 9. Estevez, L. Ž1997. Auto associative object recognition via raster scan video processing. Dissertation, Texas A & M University.