Comput. & Graphics Vol. 16, No. 4, pp. 355-362, 1992 Printed in Great Britain.
0097-8493/92 $5.00 + .00 © 1992 Pergamon Press Ltd.
C o m p u t e r s a n d Graphics in Spain
AN AUTOMATIC ROTOSCOPY SYSTEM FOR HUMAN MOTION BASED ON A BIOMECHANIC GRAPHICAL MODEL YUHUA LUO and FRANCISCOJ. PERALESLOPEZ Department of Mathematics and Computer Science, University of Balearic Islands, 07071 Palma de Mallorca, Spain and JUAN J. VILLANUEVAPIPAON Department of Computer Science, Autonomic University of Barcelona, 08193 Bellaterra, Barcelona, Spain Abstract--A system for analysis and synthesisof human motion is presented. The whole system consists of an analysis part and a synthesis part. The analysis part includes the preprocessingphase, modeling phase, matching phase, and interpretation phase. The synthesis part reconstructs the same motion of the human body. A biomechanic graphical model is defined to represent the human body in 3D space which is matched with the real imagesto discoverthe motion. The synthesispart usesthe resultof the analysispart to reconstruct the same motion of the human body which can be viewed from any viewpoint and from any distance.
1. I N T R O D U C T I O N
There are many fields where the human motion analysis and synthesis are of major interest, such as sport analysis, dancer training, medical rehabilitation, etc. There are different kinds of systems existing for this purpose. The most typical systems currently being used manually select points from the motion image sequence [ l ]. The 3D positions, velocities, accelerations, and trajectories of the manually selected points are then computed [ 2 ]. A more sophisticated system attaches markers to the human body and uses template matching to detect the joints automatically. The markers may be of a different nature. Normally, semi-spherical markers or infrared sensors[3, 4] are used. In the manual system, the precision is very dependent on the experience of the user that picks the 2D points. Also, this is an extremely tedious job to pick up points from hundreds or thousands of frames. In the marker system, the inconvenience caused by the markers for the person being tested is obvious. Also, attaching markers makes it impossible for the system to extend to many other applications, such as real sport competitions. The systems mentioned above are kinematic analysis systems. Other systems include dynamics based systems, which deal with torques and forces acting on mass[5], mixed systems combining kinematic and dynamic approaches[6, 7]. Our system is based on kinematic analysis and is noninvasive. The person can move freely without interference from the system. The markers are not necessary for the system. It can be considered a rotoscopy system since it traces the actual motion of the human body and then synthesizes it. To detect motion from arbitrarily many objects and from an arbitrary kind of motion in the image sequences is extremely complicated [ 8, 9 ]. In order to
simplify the problem for our system, which is still suitable for a broad range of applications, the number of objects is reduced to one, just one person. The motion of the human is restricted to a certain kind. We consider only 3D articulated rigid motion with rotation. In addition, the motion should always start with a predefined starting position. The whole system consists of the analysis part and the synthesis part (Fig. 1 ). The input of the system is a sequence of two-view, simultaneously-taken perspective gray level images. The output of the analysis part of the system gives the 3D positions of the body parts at each moment of time from which the trajectories, speeds, and accelerations of each part are computed. The 3D positions are then visualized by simple reconstruction of the human body by the synthesis part. The final output is the synthesized motion of the human body, which should be the same motion as the input image. This can be viewed from any viewpoint with all the motion parameters computed for further use. The analysis part includes the preprocessing phase, modeling phase, matching phase, and interpretation phase (Fig. 2). The preprocessing phase obtains the binary images from the gray level images in the sequences. Static segmentation and temporal segmentation are used for this purpose. The modeling phase defines a biomechanic graphical model for the person being tested. All the information measured from the person is stored in the model. Two levels of detail are defined for the model. Level one is a stick figure tree with the nodes representing the body segments and the arcs for the body joints. Level two of the model is a volumetric model in which the body segments are represented as cylinders. The matching phase is to match the projections of 355
356
YUHUALUO, FRANCISCOJ. PERALESLOPEZand JUAN J. VILLANUEVAPIPAON INPUTIMAGE SEQUENCES
ANALYSIS PART
1
SYNTHESIS~ PART
)
Fig. 1. The system diagram.
the biomechanic model with the real two-view image sequence. Our main strategy is to match the tree model as an entity with the binary region of the body detected. The interpretation phase uses the information of the preceding phases and computes the motion parameters for different parts of the human body. All the information obtained is stored in the database for future use, such as visualization of the motion, further analysis of the performance, etc. The synthesis part includes data input and display that will animate and visualize the motion. Section 2 introduces the analysis part of the system in detail. Section 3 will briefly describe the synthesis part. The experimental results from a sequence of synthetic jump and a real sequence of a walking person are shown in Section 4.
2. THE ANALYSIS PART 2.1. The preprocessing phase As shown in Fig. 2, there are two image inputs for the analysis part. One is the model image and the other is the image sequence. Both inputs are taken from two different viewpoints, from two different angles, under well-controlled lighting conditions, with a static and uniform background. The model images are actually the starting position of motion. Both inputs need to be preprocessed before entering the matching process. The main goal of this phase is to segment the object in movement from the image. The result is expected to be only two areas, one corresponding to the object-the human body--and the other to the background. The traditional static segmentation and motion segmentation techniques are applied in this phase. The static segmentation chooses a threshold to obtain the
binary image corresponding to the two areas. The motion segmentation offers the possibility to group related regions into one object, which should be the moving parts. In the ideal case, the moving areas between two consecutive frames can be obtained by doing a difference operation of them[ l 0 ]. Due to the nonuniform illumination condition and the noise, simple difference operation cannot satisfy the goal. However, in our system we consider only one object. This means that small non-connected areas can be considered noise, and eliminated. Also, the intensity change with small variations can be considered constant. This gives us satisfactory object-background segmentation result. The result of this phase is a sequence of segmented images that are binary images. 2.2. The modeling phase The modeling phase is in charge of representing the 3D human body information into a structure that should best suit the need of matching process. There are two kinds of information the model should contain: one is the general structural information of a human body that is common for all the persons being tested. The other is the specific information obtained from actual measurement of the human body being tested. This is different from person to person. We have chosen a biomechanic model to organize these two types of information. The biomechanic model of the human body is defined as a set of rigid segments of fixed length and mass distribution connected by hinges that allow movements in appropriate directions [ 1 l]. The first kind of information is represented as a relational tree structure for the biomechanic model. Figure 3 shows the tree structure in detail. The nodes of the tree represent the body parts, while the arcs of the
]PERSONINMOVEMENT [
I~AOESE0t~CES I (~s.oL.,.o)
• [........ MODELB%ARY IMAGE]
[ BIN2YIMAGES I I
( MODELING--PROCESS)
(MATCHING PROCESS~
TREEMODEL
[STICKFIGURESEQUENCE[ C INTERPRETATIONPROCESS)
¼
[ KINEMATICIN]FQRMATIQN ]
~DATA
d,,,,s~
-,
Fig. 2. The analysis part.
An automatic rotoscopy system
ARCS = I O ~ I $ NODES = SE(31VI~rs LT = Lowe~ Torso, UT = Upper T ~ , LUL = L d t U p p ~ L~g. RUL = Right U p p ~ Leg, LUA= L~ft Upper Arm, RUA = Right Upp¢~ Arm.
L I ~ = Left Low= Area, RLA : Right L o w ~ Aim, LH= LeflHmd. RH= Right Hand, FLH= Fingers of Left Hand, FRH = Fingers of Right Hind,
LLL = Lttt Lowf¢ L e ~ LF = 1~1~Foot. T I ~ = T o ~ of l.Mt Foot, RLL= Right L o w ~ L ~ , RF= Right Foot, TRF= Toes of Right Fool
Fig. 3. Tree representation of the biomechanic model. tree represent the joints of the body parts. This tree structure is common to all the personnel being tested. The root of the tree has been chosen to represent the
357
lower torso of the body, considering the stability and the hierarchical structure of the body. The structure of the tree will serve as a matching control mechanism. The number of nodes defines how fine we model the human body[12, 13] and the complexity of the system. We have chosen 20 nodes for the system. The human body is roughly modelled as 20 body parts and 19 body joints. The joints are modelled as mechanical spherical joints with three degrees of freedom (the three rotation angles). Even some special joints have less than three degrees of freedom (the knees, elbows, etc.), for homogeneity the structure has been chosen the same for all, but not the values. In every node, two levels of information are stored. Level one information includes the name of the node, the 3D position of the starting point and ending point of the segment in local and global coordinate systems, the transformation matrix to the parent node, the physical angular and distance restrictions for this body part. Level two information includes the radius of each body part as a cylinder. Some conditions needed in the matching process are also stored in the node. The following shows the structure of a node.
typedef float point [ 4 ]; typedef int point2D [ 2 ]; typedef char name [ 9 ]; typedef float matrix [ 4 ] [ 4 ]; typedef struct node body; struct node { name Name; /* name of node */ body *parent; /* pointer to the parent */ short no-of-so /* pointers to sons*/ body *child2; body *child3; matrix trans; /* trans matrix 4 X 4 */ /* 3D starting point position in local coordinates */ point start; point end; /* 3D ending point position in local coordinates */ point2D vl; /* 2D position of the point in viewl */ point2D v2; /* 2D position of the point in view2 */ point3D start; /* 3D position of the starting point of the joint in global coordinates */ point3D end; /* 3D position of the ending point of the joint in global coordinates */ short in 1; /* Inclusion condition in viewl */ short in2; /* Inclusion condition in view2 */ short pro 1; /* change indication flag in viewl */ short pro2; /* change indication flag in view2 */ /* Angular condition, minimum and maximum values */ int anminx; int anmaxx; float angx; /* Actual angle value about X axis*/ int anminy; int anmaxy; float angy; /* Actual angle value about Y axis*/ int anminz; int anmaxz; float angz; /* Actual angle value about Z axis*/
} The tree model is a hierarchical mechanical artistulated chain. The 3D position of any node in the global
reference coordinate system is computed by the cascade product of matrix transformations from the root node
358
YUHUALUO, FRANCISCO J. PERALES LOPEZ and JUAN J. VILLANUEVA PIPAON
Fig. 4. Level one and level two of the model--the stick figure tree and volumetric tree.
down to this node. This process is called direct kinematic problem[ 14 ]. The matching process will be controlled by such a hierarchical order. Figure 4 shows level one and level two of the model. 2.3. The matching phase The main task of this phase is to find all the 3D positions of the body parts at each moment of time at which the image frames are taken in the sequence. From these, the motion of the person is then discovered. The idea of matching is to match the tree model of the human body onto each of the frames in the sequence by changing the perspective, positions, and orientations of the body parts of the model accordingly. When a match is reached on the pair of two view images at the same moment of time, the 3D position of the model is assumed to represent the position of the human body in the image at that moment. The result of matching the model with each pair of the frames is a sequence of model positions that uniquely represents the motion that the person performs. The result is viewpoint independent, which can then be displayed or used in any manner. The two-view images are taken by two synchronized static cameras with a fixed angle and distance. The position of the cameras and their relation is known by the system. There are two choices for the matching order. The first is to match all the nodes of the model in one view first then confirm the matching of all the nodes in the other view. The matching of the second view will be performed without losing the matching result in the first view. To do this, the system computes the projection of the model using the perspective parameters in each view. The second choice is to match each node in both of the views at the same time, then match the next node in both of the views, etc. The matching is repeated until all the nodes in the tree are exhausted. The first order has been used in our system. The matching process begins from matching the model with the starting position of the motion. The root, lower torso, of the tree is to be matched first.
Then the model tree is to match with the two view images in a top down manner. The model nodes are superimposed onto the binary images by applying perspective transformation until fitting into the body part. A set of matching criteria are applied to the matching process. During the adjustment for matching, the system changes the angle for every node hierarchically with a ~ small variation. Although the joints have three degrees of freedom globally, during the change only rotations about local X and Z axis are done, since rotating about the third axis of the segment itself is meaningless. A set of conditions is defined as matching criteria that must be satisfied by all the correct matching. The conditions are:
1. Inclusion Condition C 1: For all the segments of the 3D stick model projected onto the 2D images, in both views it is necessary that all the segments are within the human body area obtained in the preprocessing phase. If(x~, y~, z~) is the 3D position of a starting point or ending point of a segment in 3D space and (xp~l, yp~l ), (xpi2, ypi2) are the projected points of (x~, Yi, zi) on viewl and view2 from the model; Cview 1 = TRUE if V(xpil, ypil)I(xpil, Ypil) = L; Cview2 = TRUE if V(xpi2, ypi2)I(xpt2, ypi2) = L; I is the intensity value and L is the intensity value of the body area. C I = TRUE ( = = ) Cviewl = TRUE f3 Cview2 = TRUE. If the number of matched joints exceeds a predefined threshold, the configuration of transformations is classified to be satisfactory for C I. 2. Distance condition C2: The segments of the body parts should have equal distance to both sides of the boundary of that body part, assuming that the 3D segments are cylinders. This condition is used
I
I VISUALIZATION
Fig. 5. The synthesis pan.
An automatic rotoscopy system
359
to adjust the position of the segment being matched by condition C 1. This can avoid accumulation of error when traversing down the tree. In the case of overlapping, we need to detect partial boundary of some body parts, which are supposed to be straight line segments. In this case, the system goes back to the gray level images to detect the overlapped area. 3. Position condition C3: All the joints (starting points of the segments) should satisfy the physical conditions imposed by the biomechanic model. The system only searches for the area within the physical angle constraints. A joint must be within its angle range for a correct matching. If the joint i has the rotation angle value ax, a y , az
C3 = TRUE if Vjoint
Fig. 7, The input image sequence and binary images for example 2, the walking person.
aXmi n< = ax < = aXmax
a ymin < = a y < = ayma~r
aZmin< = a z < = a z , , ~ 4. T e m p o r a l c o n d i t i o n C4: The position of the same
node matched should be within a small area from the position it was matched in the preceding frame. This is also named the continuity condition. For every frame and segment we can calculate the Euclidean distance from one joint to the same joint in the next frame. The condition C4 is TRUE if
D(ti+ I_ti)<=Dlirnit
for this joint.
The Dti,,it is specified for a particular kind of movement. Here in our examples, it is set to jumping and walking respectively.
i
Fig. 6. The input image sequence for example I, the synthetic jump.
The temporal condition can also be used to search the overlapping body segments, assuming the overlapping situation is a result from non-overlapping situation. 5. T h e collision condition: There should be one-to-one correspondence between nodes and their positions in the 3D space. If more than one joint is matched at the same 3D position, a collision is detected in the matching. A match should not have collision. A simple, straightforward method to detect the collision is proposed. That is, for each segment pair being examined, compute the minimal distance of them. If the distance is smaller than the sum of the radiuses of the two with the closest points within the segments, a collision is detected. Otherwise, no collision is reported for the segment pair. The system solves a set of linear equations and classifies the different cases. The process is very fast. A match is reached, or partially reached, if all or some conditions are satisfied. There will be no match if the system reports the conditions are not satisfied. Regarding the matching criteria, please see [ 15 ] for more details. The result of the matching process is a sequence of stick figures for the input images with all the 3D information of each body part specified. 2.4. T h e interpretation p h a s e This is the last phase in the analysis part. The matching result gives all the 3D positions of each joint on the body at each moment of time that images were taken. In this phase, the system derives all the motion parameters from the matching result. All the correspondence between key points in the sequence is clear by knowing which body part is at which position in each frame in the matching result. The time information can be obtained from the equal time interval between frames when the images were
360
YUHUA LUO, FRANCISCOJ. PERALES LOPEZ and .JUAN J. VILLANUEVAP1PAON
'
'7
Fig. 8. The processing phases for example 1.
Fig. 9. The processing phases tor example 2.
An automatic rotoscopy system taken. The kinematic analysis of movement of joints is obtained from the derivatives of the parametric equations. In order to have a complete, continuous description of the motion, motion information between frames is obtained by using B-spline function for interpolation. It smooths and interpolates the 3D spatial positions of the joints along the time axis. The interpretation part provides all necessary information for the motion description such as trajectory, speed, acceleration of each body part. 3. THE SYNTHESIS PART The emphasis of our system is the analysis part. However, as a complete system, the synthesis part has been developed. In this part, the information from numerical or graphical representation for a particular part of the human body can be visualized. Simple animation is also possible to animate the same motion that the input image sequence represents. This can be a vivid medium for the human observer to analyze the motion in any desired modes, such as from an angle that is usually not common to take from a camera. Figure 5 shows the major function of the synthesis part. The functions include displaying the images in the sequence, displaying the model superimposed onto the real images, displaying the model with desired transforms, displaying the synthesized sequence in high resolution form, and recording the synthesized images. Its functions also include manipulation of the data files and image files in the database. The following modules in the synthesis part realize the above-mentioned functions:
~Tg7i~i
" "~........
~ j
361
1. movement generation module; 2. viewing control module; 3. model visualization module. The first module manipulates the 3D position of the joints in the body structure that controls the motion. The system allows to address any specific node by its name and manipulate the transformation matrices down to this node. The kinematic chain structure makes it necessary to do the concatenate transformations if any node performs a motion. Our intention here is not to generate a full and complete animation of the human body. The original purpose is to adjust the model in order to do the matching; and, of course, it can serve the display purpose for movement control. The system does not define a model of the movement and the user has to control the transformations to reach the desired modification. For more sophisticated human body animation systems, please refer to other references, such as [ 16 ]. The second module serves the purpose of controlling the viewpoint of the virtual cameras when displaying the synthesized motion. This set of functions includes the typical modification of the viewpoint of the camera, the relation between the cameras (the real system uses two but the synthesis can use more), the projection planes, etc. These facilities allow the synthesized motion to be viewed from any viewpoint and from any angle, which can be very different from the real case. The last module allows the user to choose the appearance of the human body in the reconstruction sequence. It is possible to display the human model with
[
Fig. 10. The synthesizedmotion sequence for example 2.
362
YUHUALUO,FRANCISCOJ. PERALESLOPEZand JUANJ. VILLANUEVAPIPAON
a stick figure or with a volumetric representation using cylinders etc. The size of the person on the screen can also be controlled. There are more options in the synthesis part. The user can send the wire frame sequence for high resolution display, which will do the rendering for the synthesized images. To record the sequences by a video system is also possible. 4. EXPERIMENTAL RESULTS
The experiments were performed on a Sun 4/280 workstation. Two examples are presented. Figure 6 shows the input sequence of a synthetic jump. The images were taken by two synchronized video S-VHS Panasonic cameras and digitized by a Matrox board. The image size is 512 × 512 pixels. Figure 7 shows the Input image sequence and the binary images for the Walking person. Figures 8 and 9 show the processing phases for example 1 and example 2, which include the original images, thresholded images, matched images, etc. The final motion detection result is shown as figures detected for the input images that are in Fig. 10. The Sun Tools Graphics facility has been used to interface with our system. 5. CONCLUSIONS An automatic rotoscopy system for human motion detection is presented in the paper. The analysis part of the system analyzes the two-view input image sequence in order to discover the 3D spatial information in each image frame. Matching a predefined, actually measured biomechanic model to the input images is the key approach. The experimental result shows the system can handle simple motion of the human body. The synthesis part displays the detected motion of the human body in any desired way. This makes it possible for further analysis of the motion, which is suitable for many applications. REFERENCES
1. Peak Performance Technologies Inc., The Peak Performance System, Eaglewood, CO ( 1988-1989). 2. H. Hatze, High-precisionthree-dimensional photogram-
metic calibration and object space reconstruction using a modified DLT-approach, Journal of Biomechanics, 21, 533-538 (1988). 3. G. Ferrigno and A. Pedotti, ELITE: A digital dedicated hardware system for movement analysisvia real-timeTV signal processing, IEEE Trans. Biomed. Eng., BME32( 11 ), 943-949 ( 1985). 4. R. Mann and E. Antonsson, Gait analysis-precise,automatic, 3-D position and orientation kinematicsand dynamics, Bulletin of the Hospital for Joint Diseases Orthopaedic Institute, XLIII(2), 137-146 (1983). 5. J. Wilhelms, Using dynamic analysis for realistic animation of articulated bodies. IEEE Comp. Graphics and Appl. 12-27 (June-87). 6. Coritel Soluciones Integrales Informfiticas,S. A. COMPAMM: Computer Analysis of Machines and Mechanisms. CEIT, Centro de Estudios e InvestigacionesTtcnicas de Guipfizcoa, 1989. 7. Paul M. Isaacs and Michael F. Cohen, Mixed methods for complex kinematics constraints in dynamic figureanimation. The Visual Computer Springer-Vedag, Berlin, 296-305 (1988). 8. J. O'Rouke and N. Badler, Model-based image analysis of human motion using constraint propagation PAMI2(6), 522-536 (1980). 9. J. K. Aggarwal and N. NandHakumar, On the computation of motion from sequence of images-A Review. Proceedings of the IEEE 76( 8 ), 917-935 (August 1988). 10. J. K. Aggarwaland W. N. Martin, Image Sequence Processing and Dynamic Scene Analysis, Springer-Vedag, Berlin, 41-73 (1983). I I. D. Miller, Modelling in biomechanics: An overview. Medicine and Science in Sports 11(2), 115-122 (1979). 12. Marianne Dooley, Anthropometricmodelingprograms-A survey. IEEE Comp. Graphics and Applications 1725 (November 1982). 13. D. Tost and X. Pueyo, Human Body Animation: A survey. The Visual Computer 3, 254-264 (1988). 14. J.J. Uicker,J. Denavit,and R. S. Hartenberg,An iterative method for the displacement analysis of spatial mechanisms. Journal of Applied Mechanics, 309-314 (June 1964). 15. F. Perales, J. J. Villanueva,and Y. Luo, Matchingcriteria in a biomechanicmodel driven human motion perception system. Proceedings of The Sixth International Symposium on Computer and Information Sciences, Elsevier Science Publishers, Amsterdam, 1029-1138 (November 1991). 16. N. I. Badler, Human task animation. NCGA '89 Conference Proceedings. l Oth Annual Conferenceand Exposition Dedicated to Computer Graphics, 1, 343-354 (1989).