Computers in Biology and Medicine 43 (2013) 1927–1940
Contents lists available at ScienceDirect
Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm
Overall design and implementation of the virtual glove Giuseppe Placidi a,n, Danilo Avola a, Daniela Iacoviello b, Luigi Cinque c a
A2VI-Lab, Department of Life, Health and Environmental Sciences, University of L'Aquila, Coppito 2, 67100 L'Aquila, Italy Department of Computer, Control and Management Engineering “A. Ruberti”, Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy c Department of Computer Science, Sapienza University of Rome, Via Salaria 113, 00198 Rome, Italy b
art ic l e i nf o
a b s t r a c t
Article history: Received 7 November 2012 Accepted 31 August 2013
Post-stroke patients and people suffering from hand diseases often need rehabilitation therapy. The recovery of original skills, when possible, is closely related to the frequency, quality, and duration of rehabilitative therapy. Rehabilitation gloves are tools used both to facilitate rehabilitation and to control improvements by an evaluation system. Mechanical gloves have high cost, are often cumbersome, are not re-usable and, hence, not usable with the healthy hand to collect patient-specific hand mobility information to which rehabilitation should tend. The approach we propose is the virtual glove, a system that, unlike tools based on mechanical haptic interfaces, uses a set of video cameras surrounding the patient hand to collect a set of synchronized videos used to track hand movements. The hand tracking is associated with a numerical hand model that is used to calculate physical, geometrical and mechanical parameters, and to implement some boundary constraints such as joint dimensions, shape, joint angles, and so on. Besides being accurate, the proposed system is aimed to be low cost, not bulky (touch-less), easy to use, and re-usable. Previous works described the virtual glove general concepts, the hand model, and its characterization including system calibration strategy. The present paper provides the virtual glove overall design, both in real-time and in off-line modalities. In particular, the real-time modality is described and implemented and a marker-based hand tracking algorithm, including a marker positioning, coloring, labeling, detection and classification strategy, is presented for the off-line modality. Moreover, model based hand tracking experimental measurements are reported, discussed and compared with the corresponding poses of the real hand. An error estimation strategy is also presented and used for the collected measurements. System limitations and future work for system improvement are also discussed. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Hand rehabilitation Rehabilitation glove Virtual glove Hand tracking Numerical hand model
1. Introduction Patients suffering from residual hand impairments, following a stroke or after surgery, need rehabilitation therapy. Each exercise has several levels of difficulty corresponding to the maximum force that can be applied, duration, and other parameters. During exercise, often an elastic object is squeezed by the patient to recover strength. It has been demonstrated that training results have to be continuously controlled by a therapist to be effective [1]. Traditional rehabilitation is done one-to-one, namely, one therapist (or sometimes several) working with one patient. Costs are high, especially for demanding patients, and monitoring is leaved to the experience of the therapist, without objective measurements. Regarding the therapy that the patient does at home, there is currently no monitoring. Recent studies suggested that repetitive and long duration n
Corresponding author. Tel.: þ 39 862 433493. E-mail addresses:
[email protected] (G. Placidi),
[email protected] (D. Avola),
[email protected] (D. Iacoviello),
[email protected] (L. Cinque). 0010-4825/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2013.08.026
training using auxiliary systems and virtual reality or augmented reality is helpful [2–5]. This context highlights the importance of having an assisting, possibly automatic, rehabilitation system whose data are numerically analyzed and transmitted, through internet, to the healthcare structures. The numerical evaluation and comparison over the time could represent a reference strategy to evaluate the rehabilitation progresses, in an objective way, with a precision which is of the order of some millimetres at a high temporal resolution, instead of a subjectively evaluated monitoring without comparison with previous data. Moreover, it could provide a useful tool to monitor rehabilitation performed at home. Several of these integrated systems, exploiting haptic glove based interfaces, have been designed for virtual reality simulation, such as: the Rutgers Master II-ND glove [6], the CyberGrasp glove [7], the LRP glove [8], integrated versions of them [9], or other pneumatic gloves [10,11]. In these systems, the mechanical unit interacts both with the hand of the patient and with a virtual reality environment during rehabilitation. Despite the undoubted advantages in the use of such rehabilitation systems, haptic gloves have several force feedback terminals per finger with the forces
1928
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
grounded in the palm or on the back of the hand which make them heavy (some hundreds of grams), cumbersome and greatly limiting of natural and spontaneous movements. Moreover, they can be very expensive due to their personalized electronic and mechanical components. In addition, each device has to be constructed specifically for each patient, taking into consideration the patient's residual infirmities. This seriously limits re-using both the mechanical device and the associated control system. Moreover, the patient can tend to rely too much on the device rather than using his own hand when becomes familiar with the assisting device. An alternative approach is to use virtual glove based interfaces to replace the electro-mechanical gloves [12–17]. The approach we propose, completion of [12–14], is based on this latter kind of interface. The system aims to assist and monitor hand movements with acceptable spatial precision and good temporal resolution to allow an objective evaluation of the rehabilitation progresses. The proposed system should overcome most of the limitations of mechanical gloves, in particular: high cost (the cost of a mechanical glove is about 20,000–30,000 USD); use difficulty (a mechanical glove has to be correctly worn and properly set); no re-usability (a mechanical glove is often personalized and cannot be used by different patients); impossibility to set the optimal recovery level (a mechanical glove cannot be used for the healthy hand in order to evaluate the personalized optimal recovery level to whom the rehabilitation should tend). The proposed system consists of a cubic box in which the hand of the patient is inserted. The hand movements are collected by a set of video cameras positioned on the vertexes of the cube and tracking is performed. Tracking is used to determine position, orientation and articulation of the hand over the time. However, building a fast, efficient and effective stereo vision based tracker for the hand is still a challenging issue. This is due to several factors, such as: ambiguities due to occlusions, high dimensionality of the problem, noise in the measurements, lack of visible surface texture, significant lighting variations due to shading. Monocular vision approaches are even more difficult due to depth ambiguities [18–21]. Approaches based on the analysis of data collected by depth sensing video cameras (3D cameras) [22] have greater occlusion issues than stereo vision approaches if a single 3D camera is used. A huge number of papers dealing with hand and fingers tracking has been published [23–32]. A possible classification can be done according to the use, or not, of markers to support tracking:
marker based hand tracking approaches [27–29]: Use passive (i.e.
colored or reflective materials) or active (i.e. LEDs providing fixed or modulated lights) markers placed on specific landmarks. The coordinates of these points are used to update the hand model state; markerless hand tracking approaches [30,31]: Based on the back projection technique [32]; the 3D visual hall of the hand is obtained from inverse projection of different silhouette views of the hand.
Detailed description of the presented classification can be found in [33,34]. The method we describe uses a set of colored passive markers placed on fingertips and on the back of the palm. The hand tracking is associated to a numerical hand model [13,14] that is used to calculate some useful parameters without using electro-mechanical haptic interfaces. This design allows a very accurate calculation of physical, geometrical and mechanical parameters, including hand, fingers and joint positions, movements, speed, acceleration, produced energy, motion direction, displacements, angles, applied forces, and so on. While in previous works we described the virtual glove general concepts [12], the hand SimMechanics model [13] and its
characterization including the system calibration strategy [14] in an off-line modality, the present paper provides the final virtual glove ensemble applied to hand, fingers and finger joints tracking, both in real-time and in off-line modalities. The first, the real-time modality, is used to drive the exercises and has to be fast enough to offer the patient a real-time interaction with the virtual environment and a general, qualitative, evaluation of how the exercise is doing. The second, the off-line modality, is used to measure hand movements and to calculate physical and mechanical parameters to offer to the therapist an objective analytical, quantitative, evaluation of the results. Quantitative calculations can be executed after the rehabilitation session is terminated. In what follows, the real-time modality is described and implemented and a marker-based hand tracking algorithm, including a marker positioning, coloring, labeling, detection and classification strategy, is presented for the off-line modality. Moreover, preliminary model based hand tracking experimental measurements are reported, discussed and compared with the corresponding poses of the real hand from different views through the reprojection error. A marker-based approach is used because our aim is to recognize the joints of the hand with accurate, and reproducible, spatial accuracy. Moreover, a marker-based approach can be useful to reduce the number of occlusions by installing rigid protruding markers either directly on the fingertips or inserted into an object to be grasped by the patient (for example a pegboard). The paper is structured as follows. Section 2 reviews the high level description of the virtual glove, illustrating its usage both in real-time and in off-line modality, with particular attention to the accurate hand model used for the off-line modality, providing details for the numerical model constraints and approximations. Section 3 presents the hand tracking strategy used to extract useful information from the recorded video streams coming from the synchronized cameras. Details regarding markers disposition and coloring, detection approach, and classification process are also provided. Section 4 reports and discusses experimental hand tracking measurements and error analysis. Section 5 concludes the paper and provides some issues on which we are currently working to improve the system.
2. The virtual glove design Fig. 1 presents the high level architecture of the virtual glove system. It has the following structure:
a. The Glove Box Module consists of a cube-shaped glass box, side dimension of 60 cm, and a support positioned in the upper face. At the center of two lateral faces of the cube, opposite one another, circular holes are produced to insert the hand inside the cube. Holes in opposite faces serve to analyze either the left hand or the right hand (data collected from the healthy hand can be very useful, for reference), if camera asymmetrical configurations are used: most of the useful information is located on the back of the hand (see below). The support in the upper face of the box hosts a 3D depth sensing camera. The camera distance from the center of the box is 100 cm. The depth sensing camera has to provide a video stream to feed a virtual reality module, implementing the real time modality. The box is used both to sustain the hand and for housing, on some of its vertices, four 2D video cameras. The video cameras supply four synchronized video streams that provide the information to reconstruct, track and analyze hand movements in the off-line modality. The video streams are collected, stored in memory, and processed at a later time. Each video camera is at about 53 cm from the center of the box. In this way, a useful
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
1929
Fig. 1. High-level description of the virtual glove environment. The Glove Box Module contains the measuring equipments. The local host implements the Measuring Module, and both the Real Time and the Off-Line modalities. A web server, connected to the local host through internet, can manage a database containing the rehabilitation outcomes for future evaluations.
circular region of about 40 cm in diameter located at the center of the cube can be used for the hand movements analysis. The walls of the box are covered by uniformly colored paper sheets to make the background uniform, thus facilitating video processing. For the same reason, the inner of the box is illuminated uniformly to reduce shadows. b. The Measuring Module consists of two sub-modules, Continuous 3D Camera Stream and Synchronized 2D Video Streams Recording. Both are used to process the video streams collected by the cameras. The first drives a 3D depth sensing camera, the OptriCamTM 130 [35], a time-of-flight (TOF) camera [36] installed on the external support and oriented toward the center of the box. This allows to construct a depth map, corresponding to the third dimension, at 30 frames per second (fps). To avoid saturation, a focal length of 100 cm is used (the minimum value). For this reason it is mounted on an external support (for details, see Figure 9 of the supplementary material). The video stream coming from the 3D depth sensing camera is neither recorded nor used in the off-line modality. The information collected by the depth camera is used just to drive the patient exercise, in real-time. It is also used to initialize and start/stop the rehabilitation session. The start and stop signals are used to start and stop the recording of the video streams coming from the other synchronized video cameras. This module is usually developed according to specific requirements involving virtual objects or targets (e.g. grasping virtual sphere, managing virtual tools, and so on) [37–40]. The second sub-module drives the recording of four synchronized videos collected by four USB low-cost cameras (Phenix Q7 USB color cameras, 640x480 in resolution) to be used in the off-line modality. Each camera is oriented toward the center of the box. With this configuration, each point of interest is hopefully covered by at least two views. This is a prerequisite to recover the 3D information from 2D images in a usual stereo vision system [41], elsewhere indetermination occurs. Each camera is linked to the same mainframe which, as first step, synchronizes and stores the four video streams collected by these cameras. To capture a video
at 30 fps, 640 480 in resolution and 24 bit colors, about 11 Mbit/ frame are necessary, corresponding to a total bandwidth of about 221 Mbit/s for 30 fps. In this case, two USB controllers are required, being 480 Mbit/s the bandwidth for an USB 2.0 controller. The video streams recording is started by a start signal coming from the Real Time Module, activated by a specific gesture captured by the 3D camera. In the same way, the video streams recording is ended by a specific gesture recognized by the same camera in the same module. c. The Real Time Module is composed by two sub-modules, Hand Reconstruction and Virtual Reality Environment. The first reconstructs a 3D blob-shaped spatial model of the hand by using the video collected by the 3D depth sensing camera. The second implements the virtual rehabilitation scenario including the blob model of the hand. The video is processed in real-time and the reconstructed blob hand model is inserted into the virtual environment (Figure 10 of the supplementary material shows some sequence of the blob model reconstruction) where it can interact with virtual objects. The blob hand model, a raw model (in the sense that it has no cognition of what it represents), can be affected by occlusions because it is based on a single pointof-view. The blob model is also used to start/stop the exercise by locking some fixed positions in the virtual space. d. The Off-Line Processing Module is used for the off-line modality. This module is implemented in Simulink and SimMechanics [42]. Simulink has powerful and suitable features to acquire and process synchronized video streams. SimMechanics is a block diagram modeling environment to design and simulate multibody mechanical systems. It is capable to simulate mechanical characteristics of the objects it models and to calculate the forces applied on each object. The off-line processing represents the system core: reconstruction and analysis of the patient's hand behavior is performed in the same module. It is composed by five different sub-modules. The first, Initialization, calibrates the hand model to the hand of the specific patient. It is necessary to set size and shape of the
1930
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
numerical model parts (e.g. if the patient hand a finger is missing, the model is set accordingly). The calibration of the numerical model is made once: its parameters are stored in memory and used when necessary. The second, Recognition, extracts, from each frame, the 2D coordinates of the colored markers placed on fiducial points of the hand. The third, Classification, performs 3D reconstruction of the coordinates of each marker from the 2D information and associates these coordinates to the corresponding anatomical landmark on the numerical model. The fourth, Tracking, performs hand and joints tracking. The last, Forces Computation, supports the numerical model of the hand. In this way, a complex object, the hand, is implemented as a set of simpler objects, the phalanges, held together by hinges used to model joints rotation. Rotational constraints, used to eliminate hand unrealistic configurations (e.g. backward bending of the fingers), and dynamic or mechanical parameters (e.g. the mass of each object and the gravity direction) are assigned to the model to increase the accuracy of the forces and torques calculation. The off-line modality can be summarized as follows: 1. Initially, the hand is placed in a predefined position. The video streams corresponding to the hand placed in this position are used for initialization. This initialization is required to calibrate the numerical model of the hand. The calibrated model is used as input for the Virtual Glove: the initial numerical model represents the first internal state for the calculation process of each exercise. 2. When the acquisition is enabled by the real-time modality, a set of synchronized videos are collected and stored (in the same time, the real-time modality allows the exercise execution in the virtual environment). 3. The memorized synchronized videos are analyzed to detect (Recognition) and classify (Classification) useful points. In this context, it should be noted that the system is also designed to analyze an object grasped by the patient: in this case, besides the numerical model of the human hand, the numerical model related to the grasped object has to be provided. The 2D coordinates (in pixels) of the joints, fingertips and the grasped object on the image plane of each camera are used to compute the 3D coordinates (in mm) of the joints and fingertips of the hand with respect to a fixed coordinate reference system. The 3D coordinates are used to fit the numerical 3D hand model with the real pose of the hand. 4. The previously fitted pose is used to track and update the following movements of the model (Tracking). 5. The forces applied by each finger on the grasped object (if present) or in response to gravity are calculated. e. The Remote Storage Module is a remote network web-server where rehabilitation results, coming from the off-line modality, are stored. It can be accessed to evaluate rehabilitation parameters and to perform temporal comparisons. f. A colored glove (or a set of colored markers) to be used for helping the recognition/classification process. g. A set of calibrated objects (rubber balls, peg boards, etc.) to be grasped by the patient during rehabilitation. We dedicate particular attention to the off-line modality: this is due to the difficulty of accurate spatial reconstruction, classification, and tracking, of an hand model. For this reason, in the following subsections we go deeply inside the chosen hand model and tracking strategy. 2.1. The numerical hand model Human hand has been studied for many years [43,44]. A general description of the hand structures is indicated in Fig. 2.
Fig. 2. Hand skeleton details. Bones and Joints are indicated (wrist bones are not indicated because not used in our case). The palm triangle is also indicated in dashed line (circles indicate markers position on the back of the hand).
A total of twenty-seven bones constitute the hand skeleton, grouped into carpals, metacarpals, and phalanges. The wrist is formed by eight carpal bones grouped in two rows with very restricted motion between them. Finger joints are named, according to their location, as metacarpo phalangeal (MCP), joints connected to the palm, and interphalangeal (IP), the others. The nine interphalangeal joints have only one degree of freedom (DOF), namely flexion/extension. MCPs are described in the literature as saddle joints with two DOFs: one for abduction/adduction (e.g. spreading finger apart) and one for flexion/extension. Thumb is particular: it is the shortest most mobile finger and is composed by two phalanges. Finally, the wrist movements can be modeled by six DOFs (i.e. three DOFs for translation and three DOFs for rotation). Starting from the anatomical structure, it is necessary to formalize the model. Several models of the hand have been proposed according to the approximation requirements of the specific design [45–50]. The complete hand model has up to 30 DOFs resulting in a very complex system. In our context, we propose a simplified model approximated by the following constraints:
the hand is represented as a skeletonized figure and each finger
as a kinematic chain, whose base is anchored to the palm and each fingertip is an end-effector; the palm is represented as a rigid body. This means that palm arches are not allowed; the hand model is associated to the back of the hand. In this way, the presence of the skin and other soft tissues do not significantly influence the model accuracy.
Volumetric models using geometric objects (e.g. cylinders, cones, cubes) are described in literature [51]; we use a nonvolumetric stick model because it represents, in a manageable and simply way, all information we are interested. Fig. 3 represents the used stick model, the generic finger and palm plane representation, and the adduction/abduction angles for the thumb and for one of the other fingers, respectively. More specifically, Fig. 3a reports the stick model, with the DOFs
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
1931
Fig. 3. Hand model (a), finger and palm plane representation (b), and adduction/abduction angles (c) for the thumb (top) and for one of the other fingers (bottom). The world reference frame, the hand reference frame and the finger plane reference frame (indicated with RW, RH, and RF, respectively) are also shown in (b).
corresponding to the bending angles. These angles are labeled as follows:
θjointName : flexion/extension angle of the “jointName” joint of i
the finger i, i ¼ 1 : 5; φjointNamei : adduction/abduction angle of the “jointName” joint of the finger i, i ¼ 1 : 5. For a 2-DOFs joint (MCP for F i ; i ¼ 2 : 5 or TM for F1) both
θjointNamei and φjointNamei are variable, but for a 1-DOF joint (DIP, PIP for F i ; i ¼ 2 : 5 or IP, MCP for F1) only θjointNamei is variable. For this
reason, each finger is modeled as a 4-DOFs mechanical system. It is worth noting that the thumb has a different structure (i.e. it has 5-DOFs: 2-DOFs both for the MCP joint and for the TM joint and 1-DOF for the IP joint) but, to simplify the model, we consider it as having the same structure of the other fingers (i.e. 4-DOFs: 2 for the TM joint and one for both the MCP and IP joints). In conclusion, we assume that:
the fingers F i ; ði ¼ 2 : 5Þ have two DOFs for the MCP joints and one DOF for DIP and PIP joints;
the thumb, F1, has two DOFs for TM (flexion/extention and abduction/adduction movements) and one DOF for IP and MCP (flexion/extention movement). The wrist forms a 6-DOFs frame representing position and orientation of the hand in the space (i.e. 3-DOFs for translation and 3-DOFs for rotation). With these assumptions, a complete hand model can be represented by 26-DOFs. 2.1.1. Motion constraints An important issue is the range of motion of the hand joints and the relationships between joint angles. In fact, most of all possible combinations of joints positions yields unrealistic hand configurations. To avoid these configurations, and to simplify the model by reducing the DOFs of the whole system, we introduce some constraints between the angles θMCP i ; θDIP i and θPIP i for the fingers F i ; i ¼ 2 : 5 and between θTM ; θMCP and θIP for the thumb [52]. In particular, the used constraints are the following:
θMCPi ¼ θPIPi ;
θDIPi ¼ 23θPIPi
ð1Þ
and:
θTM ¼ 45 θMCP ;
θIP ¼ 12θMCP
ð2Þ
The previous constraints allow a substantial reduction of the number of joints to be considered during classification and tracking
processes. The 26-DOFs model becomes a 16-DOFs model (6-DOFs for the wrist and 2-DOFs for each finger remain). The use of a 16 DOFs model reduces occlusions (fewer points have to be measured). With a 16-DOFs model, the points P1, P5, P17, P4, P8, P12, P16, and P20 (8 points) are sufficient to locate the model completely. The finger kinematics can be computed according to three different reference frames (Fig. 3b):
RW ¼ ½xW ; yW ; zW : the world reference frame; RH ¼ ½xH ; yH ; zH : the hand reference frame, where the origin
is in the point P1 and the axes are defined by the palm orientation; RF ¼ ½xF ; yF ; zF : the finger plane reference frame, where the origin is in the first flexion/extension joint of a finger and the axes are defined by the palm and the finger planes.
The first joint to be determined is the finger abduction angle. The other three joints, having parallel axes of rotation, move the finger in a single plane. A differentiation between the thumb and the other fingers is made. In particular, the abduction/adduction angles are calculated as follows (Fig. 3c):
φTM (thumb): Angle between the projection of the line l0 of the thumb on the plane π and the line identified by the points P1 and P5. The plane π contains the line identified by the points P1 and P5 and is perpendicular to the palm plane;
φMCP ; i ¼ 2 : 5 (other fingers): angle between the projection of i
the line l0 of F i ; i ¼ 2 : 5, on the palm triangle plane and the line l ? . l0 is identified by the base and the tip points of a finger. Finally, l ? is contained in the palm plane and perpendicular to the line identified by the points P5 and P17.
Once the adduction/abduction angles are calculated, the orientation of the finger planes are defined. The position and the orientation of the finger plane reference frame RF with respect to the world reference frame RW is defined by the transformation: F T FW ¼ T H W TH
ð3Þ
where TH W represents the position/orientation of the palm plane with respect to the world reference frame and TFH represents the position /orientation of the finger with respect to the palm plane. In this way, the kinematic relation of a finger with respect to the finger plane reference, can be written as (for simplicity we omit
1932
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
Fig. 4. Logical scheme of the off-line tracking algorithm.
the subscript indicating the finger): 8 x ¼0 > < tip ytip ¼ l1 sin ðθMCP Þ l2 sin ðθMCP þ θPIP Þ l3 sin ðθMCP þ θPIP þ θDIP Þ > : z ¼ l cos ðθ Þ þ l cos ðθ þ θ Þ þ l cos ðθ þθ þθ Þ tip
1
MCP
2
MCP
PIP
3
MCP
PIP
4.
DIP
ð4Þ By substituting Eqs. (1) and (2) in (4) we obtain: 8 x > tip ¼ 0 > < ytip ¼ l1 sin ðθMCP Þ l2 sin ð2θMCP Þ l3 sin ð83θMCP Þ > > : ztip ¼ l1 cos ðθMCP Þ þ l2 cos ð2θMCP Þ þ l3 cos ð8θMCP Þ
ð5Þ
3
The system of Eq. (5), in the variable θMCP , can be solved by a non-linear optimization method, such as the Newton's method [53], to provide the set of angles of Fig. 3b.
3. Hand tracking We adopted a computer stereo vision based approach that evaluates hand movements by processing, during time, a set of synchronized video frames referred to different perspective views. Each new hand pose is considered as a new step of the calculation process (in the off-line modality). For the reasons explained above, we adopt a marker based method and stereo vision system composed by 4 2D video cameras. The tracking process is supported by the 16-DOFs hand kinematic model described above. The logical scheme of the used tracking algorithm is shown in Fig. 4. The method first treats separately the four corresponding frames of each of the four videos by instancing 4 threads of the same process and then merges the obtained information to perform classification and tracking. In particular: 1. The windowing step detects all viewed markers within defined search windows. A search window is a virtual rectangle region, a portion of the image, where a marker is searched: the remaining of the image is ignored. Each marker has a specific search window that has to be adjusted through the following video frames. 2. The markers detection step indicates the position of a marker inside the search window. In this context, it is important the initialization stage (the hand has to assume a starting position that is always the same for each exercise and for each instance of the same exercise). This action produces the initial setting for each search window. The updating of the search windows, relatively to the frames progression, is ensured by Kalman Filter [54]. This filter provides the prediction information to detect future markers positions. The search windows must be as small as possible to improve computing efficiency. 3. The blob analysis step is used to calculate blob geometric properties (area and centroid). A blob is a region of connected and homogeneous pixels indicating a feasible region. Blobs too small (with area below a fixed threshold) or too big (with area above another fixed threshold) are discarded. Since all markers have a fixed size, blob mean value can be approximated
5.
6.
7.
8.
9.
by using the known camera focal length and distance of the operating volume from the cameras. In markers correspondence, the blobs representing the same marker coming from different cameras (different views) are matched using the epipolar geometry [55,56], to establish a correspondence. In this way, each marker centroid is linked to a list of its corresponding points in the other views. The number of elements of this list can vary from zero (i.e. the marker is viewed only by a camera) to three (i.e. the marker is viewed by all cameras). The epipolar geometry is used to find points correspondence using a pair of different views related to the current time instant. Due to errors in the camera calibration process or to noise, matched point may not lie exactly on the epipolar line, as expected, but can be very close to it. For this reason a threshold distance is used in all calculations of epipolar distance (calculated values below this threshold indicate matches). In this way, we obtain a rapid and reliable matching process. The 3D triangulation step computes the 3D triangulation of corresponding centroids from different views. The result is a set of 3D coordinates of those markers collected by at least two distinct views. The markers classification step finalizes the matching. This step considers the geometric relationship between markers (i.e. triangulation process), recovers the color of each marker and associates it to the related point on the stick model of the hand. The 3D Kalman Filter is used to smooth the 3D data coming from the triangulation step. This stage is also used to detect prediction information of the state of the points (i.e. velocity and position): such information are used when occlusions occur and a position prediction strategy is required [57]. The inverse kinematic step is used to estimate the joint angles given the position of the fingertips [58]. It is performed on the 16-DOFs stick model (i.e. on its representation provided by the recognized markers). The outputs of this step are the joint angles and the palm triangle pose at the current time instant. The model state update step updates the pose of the numerical model. Starting from the initial position of the hand, the system obtains a set of time-varying hand poses showing the movements of the hand during the rehabilitation activity.
The tracking method strongly relies on the markers positioning and coloring strategies. The following sub-sections show the tracking steps: markers disposition and coloring, markers detection, classification procedure, tracking scheme, and occlusion solution. 3.1. Markers positioning and coloring Markers are simple colored patches of about of 1 cm2 attached on the fingertips and on the back of the hand. We experimentally noted that smaller markers produce higher evaluation errors during windowing and markers detection. We chose ring shaped markers for the fingertips to reduce the probability of occlusions occurring when the hand is bend or closed. The other markers have rectangular shape. To recover hand pose with the 16-DOFs model, it is necessary to consider both the pose of the palm plane (given by the coordinates
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
of the three labeled points on the palm plane) and the coordinates of all fingertips. This requires the identification of at least 8 markers. Initially, we adopted a simple and trivial solution based on different colors for different markers. Each marker had its unique color that was recognized and classified without ambiguities. However, the color detection process was complex, since it had to recognize 8 different colors in variable lighting conditions. The robustness of the process was increased by considering the geometric relationships between markers due to the chosen 16-DOFs stick model. Specifically, the process considers the relationships between fingers, and between the whole set of fingers with the hand palm. In this way, the number of colors is further reduced. In particular, we adopt the following markers disposition (Fig. 6b): palm triangle identified by two specific colors: ○ one used to mark the base of the thumb (P1); ○ one used to mark P5, P17, and an auxiliary point Paux); thumb fingertip (P4) marked with a specific color; other fingertips identified by two specific colors: ○ one used to mark P8 and P16; ○ one used to mark P12 and P20. The choices for two different colors to mark P8, P16, P12 and P20 and for the usage of the auxiliary point Paux will be clarified below, when markers classification is discussed. It should be noted that the coordinates of the points P9 and P13 are calculated once the hand pose is known. These points belong to the same rigid body and their positions, with respect to the other points, are defined in the initial calibration, where the model is settled up in shape and dimensions. This approach requires an additional marker, Paux. 3.2. Markers detection The detection process is based on a simple color segmentation algorithm [59] that uses a measure of similarity to classify a pixel as belonging, or not, to a specific color range. The used color space is the normalized Red-Green-Blue (nRGB). The developed process analyzes the whole dynamic range to search the four fixed target colors. Given a set of sample color points representing the target colors, the mean value, μ A N3 , and the covariance matrix, C A N3 N3 , are computed. Let z A N3 be an arbitrary point in RGB space: it can be considered similar to μ if the distance between them is less than a specified threshold D0 A N. The Euclidean distance between z and μ is Dðz; μÞ ¼ J z μ J ¼ ½ðz μÞ > ðz μÞ2 1
1
¼ ½ðzR aR Þ2 þ ðzG aG Þ2 þ ðzB aB Þ2 2
ð6Þ
where zR, zG, zB and aR, aG, aB are the components of the vectors z and μ, respectively. The set of points such that Dðz; μÞ⋜D0 is a solid sphere of radius D0, as reported in Fig. 5b-I. Points inside the sphere satisfy the specified color criterion, the others are discarded. Coding these two sets of points in a black and white image, the segmented binary image is obtained. An useful generalization of Eq. (6) is the following: Dðz; μÞ ¼ ½ðz μÞ C >
1
ðz μÞ
1 2
ð7Þ
In this case, the set Dðz; μÞ⋜D0 describes a solid 3D elliptical body (Fig. 5b-II) with the property that its principal axis is oriented in the direction of maximum data spread. When C ¼ I, Eq. (7) is reduced to Eq. (6). Since the values provided by the distance formalization are positive and monotonic, it is possible to work with the square of the distance, thus avoiding square root computation (Mahalanobis distance) [60]. However, if Eqs. (6) and (7) are calculated on the whole image, they can result computationally expensive even if
1933
Fig. 5. Three approaches for enclosing data regions for RGB vector segmentation.
the square root is not computed. A good compromise is to perform the calculation in a bounding box, as shown in Fig. 5b-III. The box is centered on μ, and its dimension along each color axis is chosen proportional to the standard deviation of the samples along the given axis (typically two or three standard deviation values are sufficient). Finally, given an arbitrary color point, segmentation is performed by determining whether or not it falls inside the box. Working with bounding boxes is much simpler than working with spherical or elliptical enclosures. However, the accuracy is lower. The provided approach for the nRGB color space can be easily used with other color spaces [61]. A detailed review of the color spaces is reported in [62]. 3.3. Markers classification Once markers have been detected, a procedure maps each joint or fingertip with the corresponding marker. The classification method depends on the markers position. By using the disposition shown in Fig. 6a, the classification can be divided in two consecutive steps:
classification of markers on the back of the palm; classification of fingertips markers. The first step is accomplished by using two colors. The first one for the point P1 (hereinafter color1), the second for points P 5 ; P 17 and Paux (hereinafter color2). Distances between them are all known from the hand calibration process. The following steps summarize the classification algorithm for the palm triangle:
Step 1: the marker colored with color1 is identified and classified
as P1. Since this marker has an exclusive color, this task consists simply on a color detection, as reported in Section 3.2; Step 2: markers colored with color2 are detected (i.e. markers on P5, P17 and Paux); Step 3: distances between markers founded in step 2 are computed; Step 4: the two markers having in their own list a distance close to d5 aux (i.e. d5 aux 7 threshold) are those on P5 and Paux. Between them, the marker with the great distance from the marker P1 is classified as P5. The other is classified as Paux (see Fig. 6a); Step 5: the remaining marker is classified as P17.
The use of an auxiliary marker Paux is necessary because the distances between P1, P5 and P17 very close each other. To help the fingertips classification process a different procedure is necessary. A differentiation between the thumb fingertip and the other fingertips is done. In particular, for the thumb, the classification procedure is reduced to the detection of a color because the marker P4 has an exclusive color. For the other fingers the procedure requires information about the geometry of the markers disposition. Let consider a single color, color3, used to mark these fingertips. Once all palm markers have been classified, the position/orientation of the palm with respect to the global reference frame (let, for example, that of camera1) is known.
1934
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
Fig. 6. Hand markers. Labeling of the palm triangle and distances (a). The whole hand labeling (b). To avoid that close markers are fused and detected as a single marker, two colors are used to mark the fingertips of F i ; i ¼ 2 : 5.
By defining a reference frame, as shown in Fig. 6b, where x is directed along the line defined by the point P5 and P17, y is directed perpendicularly to the palm plane and z lying on the palm plane, it is possible to make a transformation between the world reference frame (i.e. camera1) and this new reference frame that simplifies the classification process. In fact, once the markers colored with color3 are detected, and their centroid coordinates calculated, it is possible to sort them, in crescent order with respect to the x coordinate thus obtaining the sequence of the markers placed on F2, F3, F4 and F5. However, the given configuration has two drawbacks. The first is that adjacent markers can be confused and detected as a single marker. The second is that, if an occlusion occurs for a point, it is difficult to identify what the missing marker is. The introduction of a second color to mark alternately the fingertips of F i ; i ¼ 2 : 5, solves the problem. The new configuration is shown in Fig. 6b. By indicating with di j the distance between two points Pi and Pj, with Δd a threshold defining a confidential interval (it can be chosen as the maximum between distances d5 9 ; d9 13 and d13 17 ), and with xi the x coordinate of the point Pi in the reference frame, the classification algorithm is shown in Algorithm 1. Note that the method is identical for the two fingertips pairs F 2 F 4 and F 3 F 5 ; moreover, the knowledge of the palm pose is required.
3.4. Markers tracking In general, once fixed a suitable reference system, being a marker viewed by at least two different cameras, it is possible to perform its tracking with respect to this reference system (in our case, the reference system is located on one camera). The computation of the 3D coordinates of a point, starting from a correlated set of 2D coordinates from different views, is an hard task: inaccurate detection of the marker (and its centroid), bad segmentation processing (an error of one pixel in the fingertip localization can correspond to several millimetres in 3D reconstruction), and so on, are problems that can be solved if an accurate markers detection strategy is used. We observed that marker research can be reduced to a small window thanks to the prediction of its position. The goal of temporal tracking is to facilitate the localization of a marker, to smooth the trajectories and to solve occlusions. Our approach is based on the Kalman Filter supporting marker centroid localization and its speed determination. To simplify the method explanation, we consider only one marker. Furthermore, we assume that the movement of the hand is uniform and the speed constant, being short the frame interval ΔT. Under these assumptions, we define the state vector xk A R3 R3 as follows: xk ¼ ðxðkÞ; yðkÞ; zðkÞ; vx ðkÞ; vy ðkÞ; vz ðkÞÞ
Algorithm 1. Classification procedure. n ¼# of markers with color3 (or color4) X ¼the X-coordinate of a measured marker if (n ¼2) then classify as P5 (or P9) the marker with X closest to the origin. The remaining marker is classified as P13 (or P17) else {Occlusion(s) occurred} if (X 5 Δd oX o X 5 þ Δd) then classify the marker as P5 else if (X 9 Δd o X o X 9 þ Δd) then classify the marker as P9 else if (X 13 Δd o X o X 13 þ Δd) then classify the marker as P13 else if (X 17 Δd o X o X 17 þ Δd) then classify the marker as P17 end if end if
Obviously, having classified P5, P9, P13, and P17 also P8, P12, P16, and P20 will be automatically classified.
ð8Þ
where ðxðkÞ; yðkÞ; zðkÞÞ are the components representing the position, and ðvx ðkÞ; vy ðkÞ; vz ðkÞÞ are the components representing the speed of the marker at the frame k. The state vector xk and the measurements zk, also called observation vector, are correlated by a state-space relationship expressed by the following equations: xk þ 1 ¼ Axk þ wk zk ¼ Hxk þ νk
ð9Þ
where wk and νk are the process and measurement noises, respectively, assumed as independent Gaussian white noises [63]. A and H are the state transition matrix and the observation matrix, respectively, defined as 0
1 B0 B B B0 A¼B B0 B B @0
0 1
0 0
ΔT 0
0 ΔT
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1 0 0 C C C ΔT C C 0 C C C 0 A 1
ð10Þ
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
1935
Fig. 7. Details of the applied Kalman Filter.
and
0
1 B H ¼@0 0
1
0 1
0 0
0 0
0 0
0 C 0A
0
1
0
0
0
ð11Þ
where ΔT ¼ 1=FrameRate. bk the a-posteriori and a-priori state By indicating with xk and x b k the a-posteriori and a-priori estimated error estimations, P k and P covariances, respectively, Q the process noise covariance, R the measurement noise covariance, and K k the Kalman Filter gain, the following equations are obtained: Prediction equations: bk ¼ Axk 1 x ck ¼ AP k 1 A > þ Q P
ð12Þ
Update equations: ck H > ðH P ck H > þ RÞ 1 Kk ¼ P bk þ K k ðzk H b xk ¼ x zkÞ ck P k ¼ ðI 66 K k HÞP
ð13Þ
The three components are supposed to be independent, thus the covariance matrices are diagonal. Since we assume that the velocity is constant, that may be not always true, the process noise covariance is expected to have big influence on the velocity component and low influence on the position component. Noise covariance is calculated by measuring a sequence of camera shots where the hand is maintained steady. In our case, we obtained Varðx; y; zÞ ¼ ð0:31; 1:09; 5:18Þ, showing that the measurement error is more significant on the z component than on x and y components. This is due to the fact that the z component is reconstructed through triangulation: a small error in x or in y would imply a greater error along z. Fig. 7 summarizes the filtering process. In particular, starting from the computation of the 3D position with a pair of views, the process predicts the 3D position corresponding to the next pair of frames to facilitate the whole tracking process. The predicted 3D position is projected within the two frames to obtain a 2D prediction of the marker position. Then the research of a marker in each frame can be reduced to a neighborhood-based search of the predicted marker position.
4. Experimental results Before starting an extensive test, the system was preliminary tested on a quadruple of synchronized freehand movement videos, collected for 10 s at 30 fps, of an healthy, right-handed, male (age 34). The videos, collected by the four cameras, were stored in memory before their usage. The user performed three flexion/ extension fingers movements: with one finger, with two fingers and with all fingers. Data were processed by a multi-threads implementation (four instances of the same process executed separately). The adopted
hardware (a Pentium D, 3.2 GHz, with 2 GB of RAM) was suitably chosen of middle/low level to ensure, as far as possible, an effective processing portability of the system and an efficient robustness of the conceived algorithms. The performances were heavily limited by the tracking process and taken about 30 s to analyze each quadruple of synchronized frames. This was mainly due to the hardware limitations, and the intrinsic limited performances of the ensemble Matlab/Simulink/SimMechanics. Subsets of 35 frames of the four synchronized sequences and the corresponding reconstructed numerical model are shown in Figs. 11, 12, 13, 14, and 15 of the supplementary material, respectively. The visual analysis was satisfactory with the exception of the thumb: it did not represent correctly the pose of the real thumb. This finger has a great freedom of movement and the 16 DOFs model is unable to represent it in all poses. To perform an objective error measurement, an analysis was attempted by estimating the global tracking error. The global error is given by the contribution of: measuring error; detection error; classification error. The first kind of error was accurately studied and determined within the order of 1 mm [13]. On the basis of the performed experimental observations, the detection error is predominant. In fact, although the detection procedure is very simple and effective, the analysis of a detected blob is not always errors free. In particular, a small error in calculation of the blob centroid in one view can lead to a big triangulation error. The error in centroid calculation is mainly due to partial or total occlusion of the marker. Obviously, a direct consequence of an error in centroid calculation is a wrong marker positioning. However, occlusions never occurred in the considered preliminary experiment. The evaluation of the absolute error can be very difficult if we cannot use another measuring system for comparison. In this case, the simplest way to provide an error quantification is to analyze the reprojection error. The error can be calculated as the distance between the 2D projection of a 3D point representing a marker and the calculated centroid (after detection). The mean values of the reprojection errors, in pixels units, with relative standard deviations values for each marker are reported in Table 1, both before ðμ; sÞ and after the application of the Kalman Filter ðμKal ; sKal Þ. Numbers are absent for obscured points. The mean value and the standard deviation, regarding a single user, are referred to the whole set of 300 quadruples of frames (10 s at 30 fps). The time occurring to process the whole process was about 2.5 h. As shown in Table 1, the reprojection error was reduced when the Kalman Filter was used. That filter supported the process and improved the accuracy of about 40%. In really few cases it did not provide any relevant improvement. The maximum error was for the thumb fingertip (11 points, corresponding to about 8 mm for a 40 30 cm2 field of view). For the other markers the maximum error was about 6 pixels, corresponding to less than 4 mm. After this comfortable result, though the computation time was quite long, the system (always including the Kalman filter) was used to process the freely right hand movements of 10 healthy (righthanded) subjects, for 5 min, at 30 fps (5 min for the duration of the
1936
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
test was perfectly compatible with the fact that we observed that the number of different movements of the hand, in a confined space, was limited and often the same repeated). The 10 subjects were 5 men and 5 women (mean age: 33 þ/ 6). This division was done to test the system with hands of different dimensions and frequency of mobility. In fact, we found that the length of the hands (measured starting from the wrist to the end point of the middle finger) of the men's group oscillated between 18.5 cm and 21.5 cm (mean length 19.5 cm) and those of the women's group oscillated between 14.0 cm and to 16.5 cm (mean length 15.0 cm). Moreover, in the analyzed subjects, women moved their hand faster than men
Table 1 Reprojection error of each marker related to the four views. The symbol “-” indicates that the point was not viewed by the corresponding camera. View/Point
P1
P4
P5
P8
P12
P16
P17
P20
Camera1
μ s μKal sKal
2.01 0.32 1.93 0.23
– – – –
1.5 0.29 1.45 0.29
5.12 0.52 4.91 0.38
5.90 0.59 5.97 0.55
– – – –
10.31 3.38 6.38 3.38
7.45 1.36 6.02 1.06
Camera2
μ s μKal sKal
– – – –
9.90 3.38 8.21 2.21
– – – –
4.99 0.53 9.01 0.41
5.68 0.50 4.96 0.49
3.90 0.38 2.86 0.12
8.45 3.38 8.21 2.21
7.10 1.39 6.56 1.01
Camera3
μ s μKal sKal
3.16 0.468 1.76 0.18
10.66 4.37 8.61 2.31
2.93 0.20 1.60 0.19
5.33 0.56 4.01 0.44
5.17 0.51 4.79 0.45
3.16 0.33 2.86 0.31
10.66 4.37 8.61 2.31
6.97 2.53 5.93 1.69
Camera4
μ s μKal sKal
– – – –
10.16 4.91 8.35 2.37
– – – –
5.99 0.60 5.01 0.48
5.69 0.55 4.73 0.44
4.56 0.42 3.63 0.37
– – – –
7.79 2.99 6.99 2.01
(in particular the reciprocal movements of the fingers increased in number and speed). For each subject, four synchronized videos, each composed by 9000 frames, were recorded. The computation time for each exercise corresponded to about 75 h (each exercise was analyzed at the complete frame rate, though a smaller frame rate would be sufficient to represent accurately the movements of the hand): this made necessary to use two PCs of the same type running in parallel to analyze the whole set of trials. In this way, the time required to elaborate the whole dataset was 12.5 days. Tables 2 and 3 show the reprojection error (min, max, mean, and standard deviation), in pixel units, of each point for each analyzed subject. For each point of each subject, a single dataset was created by including data allowing to different frames and to different cameras. For all subjects, the videos presented occlusions (also this information is contained in Tables 2 and 3, specified for each point allowing to a specific subject). In the men's group the maximum number of occlusions was 2382, the minimum was 1329, and the mean value was 1756, corresponding to about corresponding to about 20% of the collected quadruples of views. In the women's group, the maximum number of occlusions was 2623, the minimum was 1650, and the mean value was 1970, corresponding to about 22% of the collected quadruples of views. The increased number of occlusions in the women's group with respect to the men's group was probably due both to the reduced size (the fingers joints were closer each other) and to the faster mobility of their hands. We solved the problem of occlusions by using four different strategies: by performing an interpolation between the position before and after the occlusion occurred; by using the prevision of position calculated with the velocity provided by the Kalman filter; by using the constraints provided by the numeric hand model; by using the information regarding the principal axis of the point provided by a video-camera, in the case of a partial occlusion. It is important to note that these strategies were not mutually excluding but, being different, were used to support one another. Occluded position was calculated as the mean position between those provided
Table 2 Reprojection error of each marker related to the hands of the five analyzed men. Subj./Point
P1
P4
P5
P8
P12
P16
P17
P20
Man1
min max μ s occl:
4.35 5.10 3.33 0.24 187
7.95 10.10 8.34 1.45 456
2.80 3.36 3.10 0.21 217
3.45 4.13 3.88 0.36 399
4.45 5.90 5.12 1.02 276
3.60 4.33 4.05 0.32 265
5.70 6.67 6.28 0.26 201
6.77 7.39 7.10 0.38 381
Man2
min max μ s occl:
2.35 4.77 3.70 1.10 144
6.15 8.18 7.44 1.17 207
2.83 3.59 3.10 0.34 179
3.73 4.70 4.17 0.24 223
3.50 4.90 3.88 0.44 205
4.02 4.44 4.13 0.15 198
6.77 7.56 7.23 0.38 177
7.08 7.84 7.52 0.22 182
Man3
min max μ s occl:
3.33 4.29 3.78 0.19 234
7.18 9.05 8.30 0.88 321
3.05 4.20 3.49 0.25 127
3.00 4.17 3.22 0.31 276
4.10 5.27 4.88 0.28 225
4.10 4.91 4.36 0.17 236
6.48 7.35 6.98 0.29 241
7.02 8.88 8.22 0.31 302
Man4
min max μ s occl:
4.05 4.88 4.35 0.21 239
7.37 8.56 7.88 0.24 241
3.88 4.08 4.00 0.23 200
3.82 4.61 4.22 0.24 213
4.22 4.90 4.43 0.23 219
3.97 5.05 4.66 0.31 156
7.52 8.39 7.82 0.36 151
5.90 6.70 6.32 0.39 175
Man5
min max μ s occl:
2.88 4.19 3.56 0.75 111
5.70 7.92 6.83 0.63 214
2.20 3.45 2.95 0.34 117
3.56 4.00 3.76 0.20 231
3.84 4.77 4.21 0.24 152
4.55 4.98 4.44 0.19 126
7.98 9.87 8.65 0.71 121
6.35 7.23 6.89 0.83 257
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
1937
Table 3 Reprojection error of each marker related to the hands of the five analyzed women. Subj./Point
P1
P4
P5
P8
P12
P16
P17
min max μ s occl:
3.14 3.95 3.55 0.15 167
8.62 10.15 9.20 0.63 275
5.10 6.01 5.56 0.43 133
5.75 6.16 5.75 0.24 317
5.86 6.44 6.22 0.33 165
5.10 6.21 5.77 0.29 134
7.77 8.85 8.24 0.45 176
386
Woman2
min max μ s occl:
2.80 4.08 3.25 0.72 224
7.44 8.76 8.12 0.34 212
5.34 6.18 5.88 0.25 199
5.71 6.50 6.21 0.24 250
6.02 7.05 6.52 0.25 166
5.14 6.31 5.78 0.32 196
8.24 9.22 8.89 0.37 160
7.06 8.33 7.41 0.22 143
Woman3
min max μ s occl:
2.95 3.30 3.10 0.64 269
6.70 7.88 7.36 0.33 345
4.44 5.66 5.27 0.26 223
5.20 5.95 5.55 0.24 264
7.11 7.88 7.51 0.26 237
5.05 6.35 5.87 0.33 147
8.10 9.93 9.45 0.29 176
6.31 7.45 7.12 0.30 331
Woman4
min max μ s occl:
3.22 3.74 3.46 0.26 225
7.52 8.43 7.99 0.23 442
4.96 5.55 5.18 0.26 314
5.35 6.10 5.59 0.28 345
6.22 7.54 6.62 0.33 329
5.45 6.33 5.86 0.26 317
8.66 9.42 9.03 0.22 231
7.55 8.28 7.88 0.24 420
Woman5
min max μ s occl:
3.06 3.80 3.38 0.17 246
8.37 9.23 8.22 0.65 236
4.69 5.28 4.97 0.30 199
5.86 6.32 6.08 0.24 327
5.98 6.87 6.55 0.22 226
5.05 6.29 5.66 0.22 168
7.60 8.79 8.29 0.29 187
6.35 7.33 6.88 0.27 242
Woman1
by the first three methods, ensuring that the calculated point was lying on the principal axis supplied by the single camera, when present. It is important to note that, a solution to the occlusions was always found, though the fourth method was not always usable (when complete occlusions occurred the fourth method was simply ignored). However, data calculated by solving the occlusions were not included into the dataset: this was due to the impossibility of calculating the position error after reprojection (we had no real camera view to whom the numerical reprojected model could be compared). Fig. 8 shows the same results which have been synthesized for the two groups: men and women. Data demonstrated that, as occurred in the preliminary test, errors remained confined into acceptable ranges, though a huge dataset of positions were elaborated. Moreover, though with a little increment, the error remained almost the same in the two datasets: this is a good result if we consider that the dimensions and the frequency of movement of the hands were quite different in the two datasets. A particular evaluation has to be made about the difference of occlusions and errors occurring in different points of the same subject. Points P4, P8, and P20 showed a number of occlusions greater than the others. This was mainly due to the movements of the hand: the little finger (P20), when bended, was easily hidden by the rest of the hand; when moving, the thumb (P4) and the index (P8) often obscured each other. Regarding the errors, bigger values were generally found on P4 (the thumb), P17 (little finger base or, equivalently, one of the palm triangle points) and P20 (little finger). The thumb, as discussed for the preliminary test, has a great freedom of movement, thus making difficult to indicate its correct position. Regarding the little finger, the error increment is mainly due to the modification, in practical situations, of the reciprocal distance between P17 and the other points of the palm triangle (P1 and P5). In this way, being P17 not rigidly joined to the other palm points (P1 and P5, on the contrary, are fixed on the
P20 5.91 7.17 6.56
same bone), its position can vary during the hand movements with respect to the other palm points, thus producing a positioning error that is transmitted also to P20. Though these limitations, the simplified hand model we adopted allowed to maintain the maximum error in an acceptable range for giving qualitative and quantitative information regarding position and amplitude and frequency of hand and fingers movements. Regarding amplitude and frequency of movements, a final test we performed consisted in the evaluation of the maximum decimation factor we can use without loosing the information collected at 30 fps. For this reason, we simply decimated the reconstructed moving models, for all the analyzed subjects, at different ratios (2, 3, 5, 6, and 10) and we re-calculated amplitude and frequency of the model movements. The results showed that, also by reducing by 6 the frame rate, the loosed information was very low. This was due to the fact that, the fingertips natural movements (being extremities, fingertips moved faster than other points), inside the box where the hand was resting on the forearm, never exceeded 3 cm/s. This, at 5 fps, corresponded to a minimum resolution of 6 mm, that was the amplitude of the mean positioning error. This means that, to be conservative, a reduction by a factor of 5 (6 fps instead of 30 fps) in the frame rate would be acceptable for our purposes and, in the same time, would be beneficial for the reduction of computation time.
5. Conclusion Different diseases involving brain as well as hand surgery can have a direct impact on the hand functionalities. In these cases rehabilitation is necessary. Classical rehabilitation is performed by specialists and the progress monitoring is subjective because data are not numerically comparable. Numerical rehabilitation systems are based on the mechanical haptic interfaces, but these often resulted
1938
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
Fig. 8. Reprojection error, separated by Men (a) and Women (b), for each marker.
too expensive (of the order of 20,000–30,000 USD), cumbersome, difficult to use, not re-usable and, hence, not usable with the healthy hand. The present paper provided an alternative approach based on the virtual glove. In particular, the overall virtual glove ensemble has been described. Moreover, experimental trials applied to hand, fingers and finger joints tracking, both in real-time and in off-line modalities, have been performed and results provided. Being the offline modality critical because it had to calculate a series of important parameters regarding the hand and fingers movements, it was studied more extensively. For the off-line modality, a hand model was proposed and implemented. Our efforts were focused on the markers positioning and coloring and on the definition of recognition, classification and tracking algorithms. An important aspect of the hand tracking process is the detection, classification and evaluation of the 3D position of some markers placed on specific landmarks. In fact, the markers disposition was crucial for the classification problem. A solution with nine makers and five colors was chosen to track a 16 DOFs numerical hand model. A solution to the occlusion problem was also described. We also provided experimental measurements and an error evaluation strategy and, though a limited DOFs model was used, promising results were obtained. Errors were mainly due to the detection process where an error of one pixel leaded to a big triangulation error. This was reduced by smoothing the trajectory with the 3D Kalman Filter. In this case, the reprojection error was
reduced to some millimetres. The system, supported by a numerical model, was able to represent the movements of the hand without using wires, sensors or mechanical supports. It was wholly based on image processing methods to analyze the images collected by four video cameras. For this reason, it can be easily re-used or used to test also the healthy hand of the same patient to collect patientspecific hand mobility information. Moreover, the great advantage, it is a very low-cost system: we estimated a final cost of about 500 USD, about two orders of magnitude lower than mechanical gloves. The limitations of the proposed system are; the long computation time necessary to elaborate each dataset; the use of a reduced DOFs hand model that prevents the representation of all possible movements of the hand and fingers; the presence of occlusions; the use of markers on the hand. The computation time could be reduced by compiling the software implemented in Matlab, Simulink and its extension SimMechanics. Moreover, a further reduction could be obtained by reducing the frame rate (we noted that a reduction of a factor 5 could be obtained without loosing in precision: this would imply a corresponding factor 5 in the computation time). Another possibility of reducing computation time would be to adopt low-cost 3D depth sensing cameras, such as timeto-flight (TOF) cameras [64] instead of web-cameras: these should allow a very fast 3D spatial position evaluation of the recognized markers, although problems of mutual interference due to the use of infra-red rays should be addressed and solved. The second limitation, the use of a 16 DOFs hand model, could be overcome by using a complete, 26 DOFs, hand model but, in this case, other joints should be colored and tracked. This would further increase both computation time and occlusions that should be addressed. The occlusions reduction could be obtained by concentrating the cameras on those positions from where the back of the hand is better visible. Another strategy could be to install rigid protruding markers either directly on the fingertips or included into the grasping objects in correspondence of the fingertips positions. The used marker-based location strategy allowed us to obtain a spatial precision of the order of some millimetres. We made some attempt in testing marker-less tracking strategies but the localization error was very high (greater than 1.5 cm). Lot of work has to be made in order to find the adapt marker-less strategy. Beside solving the previous limitations, we are involved in finding solutions to other aspects that can contribute to improve the system. One of this is the implementation of the forces calculation module. As mentioned, we implemented the model with SimMechanics that implicitly gives force calculation. In this case, a set of calibrated elastic objects (such as, pegboards, fingerboards, elastic rings, and so on), and their respective SimMechanics models, have to be realized to use this facility. Moreover, we are currently working on the accurate evaluation of the positioning error. We aim to use a magnetic positioning system (the MicroBird Tracker [65]) to estimate better the precision of the obtained computations, especially those regarding the solved occlusions. Once solved the primary issues related to working and performance, some extension could be made: the implementation of a client-server architecture where the system and its data can be remotely used and managed, and the extension of the whole system to other human body districts (e.g. legs or the whole body).
Acknowledgements The authors would like to acknowledge Carispaq Foundation (L'Aquila, ITALY) for funding this project.
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.compbiomed.2013.08.026.
References [1] J.Y. Bouguet, Camera calibration toolbox for matlab, Website, 2011. URL 〈http:// www.vision.caltech.edu/bouguetj/〉. [2] L.E. Kahn, M. Averbuch, W.Z. Rymer, D.J. Reinkensmeyer, Comparison of robotassisted reaching to free reaching in promoting recovery from chronic stroke, in: In Integration of Assistive Technology in the Information Age, Proceedings 7th International Conference on Rehabilitation Robotics, IOS Press, 2001, pp. 39–44. [3] L.E. Kahn, P.S. Lum, W.Z. Rymer, D.J. Reinkensmeyer, Robot-assisted movement training for the stroke-impaired arm: does it matter what the robot does? J. Rehabil. Res. Dev. 43 (5) (2006) 619–630. [4] C.G. Burgar, P.S. Lum, P.C. Shor, H.F.M.V. der Loos, Development of robots for rehabilitation therapy: the Palo Alto, VA/Stanford experience, J. Rehabil. Res. Dev. 37 (6) (2000) 663–673. [5] A.S. Merians, E. Tunik, S.V. Adamovich, Virtual reality to maximize function for hand and arm rehabilitation: exploration of neural mechanisms, Stud. Health Technol. Inf. 145 (2009) 109–125, URL 〈http://www.ncbi.nlm.nih.gov/pubmed/ 19592790〉. [6] A.S. Merians, J. David, B. Rares, M. Tremaine, G.C. Burdea, S.V. Adamovich, M. Recce, H. Poizner, Virtual reality-augmented rehabilitation for patients following stroke, J. Am. Phys. Ther. Assoc. 82 (9) (2002) 898–915. [7] C. Systems, Cyber grasp glove, website, 2011. URL 〈http://www.cyberglovesys tems.com/products/cybergrasp/〉. [8] M. Bouzit, Design, Implementation and Testing of a Data Glove with Force Feedback for Virtual and Real Objects Telemanipulation, Ph.D. Thesis, University of Pierre Et Marie Curie, Paris, France, 1996. [9] X. Luo, T. Kline, H. Fischer, K. Stubblefield, R. Kenyon, D. Kamper, Integration of augmented reality and assistive devices for post-stroke hand opening rehabilitation, in: Proceedings of the 27th Annual International Conference of the Engineering in Medicine and Biology Society, vol. 7, IEEE-EMBS '05, 2005, pp. 6855–6858. [10] L. Connelly, Y. Jia, M.L. Toro, M.E. Stoykov, R.V. Kenyon, D.G. Kamper, A pneumatic glove and immersive virtual reality environment for hand rehabilitative training after stroke, IEEE Trans. Neural Syst. Rehabil. Eng. 18 (5) (2010) 551–559, URL 〈http://www.ncbi.nlm.nih.gov/pubmed/20378482〉. [11] L. Connelly, M.E. Stoykov, Y. Jia, M.L. Toro, R.V. Kenyon, D.G. Kamper, Use of a pneumatic glove for hand rehabilitation following stroke, in: Conference Proceedings of the International Conference of IEEE Engineering in Medicine and Biology Society, 2009, pp. 2434–2437. URL 〈http://www.ncbi.nlm.nih.gov/ pubmed/19965204〉. [12] G. Placidi, A smart virtual glove for the hand telerehabilitation, Comput. Biol. Med. 37 (2007) 1100–1107. [13] D. Franchi, A. Maurizi, G. Placidi, A numerical hand model for a virtual glove rehabilitation system, in: Proceedings of the 2009 IEEE International Workshop on Medical Measurements and Applications, MEMEA '09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 41–44. [14] D. Franchi, A. Maurizi, G. Placidi, Characterization of a simmechanics model for a virtual glove rehabilitation system, in: CompIMAGE, 2010, pp. 141–150. [15] A. Aristidou, J. Lasenby, Motion capture with constrained inverse kinematics for real-time hand tracking, in: Proceedings of the 4th International Symposium on Communications, Control and Signal Processing, IEEE-ISCCSP '10, 2010, pp. 1–5. [16] A. Chaudhary, J.L. Raheja, K. Das, S. Raheja, Intelligent approaches to interact with machines using hand gesture recognition in natural way: a survey, Int. J. Comput. Sci. Eng. Survey (IJCSES) 2 (1) (2011) 122–133. [17] Z. Rusak, C. Antonya, I. Horvath, Methodology for controlling contact forces in interactive grasping simulation, Int. J. Virtual Reality 10 (2) (2011) 1–10. [18] M. de La Gorce, N. Paragios, A variational approach to monocular hand-pose estimation, Comput. Vision Image Understand. 114 (3) (2010) 363–372. [19] M. de la Gorce, N. Paragios, A variational approach to monocular hand-pose estimation, Comput. Vision Image Understand. 114 (3) (2010) 363–372. [20] M. de la Gorce, N. Paragios, Monocular hand pose estimation using variable metric gradient-descent, in: BMVC06, 2006, p. III:1269. [21] A. Vacavant, T. Chateau, Realtime head and hands tracking by monocular vision, in: ICIP05, 2005, pp. II: 1302–1305. [22] U. Hahne, M. Alexa, Depth imaging by combining time-of-flight and ondemand stereo, in: A. Kolb, R. Koch (Eds.), Dynamic 3D Imaging, Lecture Notes in Computer Science, vol. 5742, Springer, Berlin/Heidelberg, 2009, pp. 70–83. [23] B.D.R. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Model-based hand tracking using a hierarchical Bayesian filter, Pattern Anal. Mach. Intell. 28 (9) (2006) 1372–1384. [24] B.D.R. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Estimating 3d hand pose using hierarchical multi-label classification, Image Vision Comput. 25 (12) (2007) 1885–1894. [25] B.D.R. Stenger, P.R.S. Mendonca, R. Cipolla, Model-based 3d tracking of an articulated hand, in: CVPR01, 2001, pp. II:310–315.
1939
[26] E. Dente, A.A. Bharath, J. Ng, A. Vrij, S. Mann, A. Bull, Tracking hand and finger movements for behaviour analysis, Pattern Recognition Lett. 27 (15) (2006) 1797–1808. [27] C.S. Chua, H.Y. Guan, Y.K. Ho, Model-based finger posture estimation, in: IEEE Asian Computer Vision Conference, 2000, pp. 43–48. [28] C.S. Chua, H. Guan, Y.K. Ho, Model-based 3d hand posture estimation from a single 2d image, Image Vision Comput. 20 (3) (2002) 191–202. [29] E. Holden, Visual Recognition of Hand Motion, Ph.D. Thesis, Department of Computer Science, University of Western Australia, 1997. [30] P. Kaimakis, J. Lasenby, Gradient-based hand tracking using silhouette data, in: ISVC07, 2007, pp. I: 24–35. [31] P. Kaimakis, J. Lasenby, Physiological modelling for improved reliability in silhouette-driven gradient-based hand tracking, in: CVPR4HB09, 2009, pp. 19–26. [32] S.R. Deans, The Radon Transform and Some of its Applications, John Wiley & Sons, New York, 1983. [33] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, A review on visionbased full dof hand motion estimation, in: Proceedings of the CVPR Workshops Computer Vision and Pattern Recognition - Workshops IEEE Computer Society Conference, 2005. [34] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, Vision-based hand pose estimation: a review, Comput. Vis. Image Underst. 108 (2007) 52–73. [35] S.K. International, OptrimaTM 130, Website, 2012. URL 〈http://www.softkinetic. com/Solutions/〉. [36] R. Lange, P. Seitz, Solid-state time-of-flight range camera, IEEE J. Quantum Electron. 37 (3) (2001) 390–397. [37] V.G. Popescu, G.C. Burdea, M. Bouzit, V.R. Hentz, A virtual-reality-based telerehabilitation system with force feedback, IEEE Trans. Inf. Technol. Biomed. 4 (1) (2000) 45–51. [38] V.G. Popescu, G.C. Burdea, M. Bouzit, Virtual reality simulation modeling for a haptic glove, in: Proceedings of the Computer Animation, CA '99, IEEE Computer Society, Washington, DC, USA, 1999, pp. 195–200. [39] G. Riva, Virtual reality in neuroscience: a survey, Stud. Health. Technol. Inform. 58 (1998) 191–199. [40] K.S. Ong, Y. Shen, J. Zhang, A.Y.C. Nee, Augmented reality in assistive technology and rehabilitation engineering, in: B. Furht (Ed.), Handbook of Augmented Reality, Springer, New York, 2011, pp. 603–630. [41] C.C. Cheng, C.T. Li, L.G. Chen, A novel 2d-to-3d conversion system using edge information, IEEE Trans. Consum. Electron. 56 (3) (2010) 1739–1745. [42] MathWorks, Simmechanics: Model and simulate mechanical systems, website, 2011. URL 〈http://www.mathworks.it/products/simmechanics/index.html〉. [43] R. Tubiana, The Hand, vol. I, W.B. Saunders Company, 1981. [44] R. Tubiana, J. Thomine, E. Mackin, Examination of the Hand and Wrist, 2nd edition, illustrated, reprint Edition, Taylor & Francis, 1998. [45] A. Vardy, Articulated Human Hand Model with Inter-Joint Dependency Constraints, Technical Report Computer Science 6752, 1998. [46] Y. Yasumuro, Q. Chen, K. Chihara, 3D modeling of human hand with motion constraints, in: NRC '97: Proceedings of the International Conference on Recent Advances in 3-D Digital Imaging and Modeling, IEEE Computer Society, Washington, DC, USA, 1997, p. 275. [47] I. Albrecht, J. Haber, H.P. Seidel, Construction and animation of anatomically based human hand models, in: SCA '03: Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 2003, pp. 98–109. [48] Y. Wu, T.S. Huang, Human hand modeling, analysis and animation in the context of HCI, in: IEEE Signal Processing Magazine, Special issue on Immersive Interactive Technology, 1999, pp. 6–10. [49] J.J. Kuch, T.S. Huang, Human computer interaction via the human hand: a hand model, in: Proceedings of the Conference Signals, Systems and Computers Record of the Twenty-Eighth Asilomar Conference, vol. 2, 1994, pp. 1252–1256. [50] W.B. Griffin, R.P. Findley, M.L. Turner, M.R. Cutkosky, Calibration and mapping of a human hand for dexterous telemanipulation, in: ASME IMECE 2000 Conference Haptic Interfaces for Virtual Environments and Teleoperator Systems Symposium, 2000. [51] J.M. Rehg, T. Kanade, Digiteyes: vision-based hand tracking for humancomputer interaction, in: Proceedings of the IEEE Workshop Motion of NonRigid and Articulated Objects, 1994, pp. 16–22. [52] J. Lee, T.L. Kunii, Model-based analysis of hand posture, IEEE Comput. Graph. Appl. 15 (5) (1995) 77–86. [53] J.F. Bonnans, J.C. Gilbert, C. Lemarechal, C.A. Sagastizabal, Numerical Optimization, Theoretical and Numerical Aspects, Springer, 2006. [54] E. Cuevas, D. Zaldivar, R. Rojas, Kalman filter for vision tracking, Measurement 33 (August) (2005) 1–18. [55] G.L. Mariottini, D. Prattichizzo, EGT: a toolbox for multiple view geometry and visual servoing, IEEE Robotics and Automation Magazine 3 (12) (2005) 26–39. [56] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edition, Cambridge University Press, 2004. [57] A. Aristidou, J. Cameron, J. Lasenby, Predicting missing markers to drive realtime centre of rotation estimation, in: Proceedings of the 5th International Conference on Articulated Motion and Deformable Objects, AMDO '08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 238–247. [58] M. Tarokh, K. Mikyung, Inverse kinematics of 7-dof robots and limbs by decomposition and approximation, IEEE Trans. Robotics 23 (3) (2010) 595–600. [59] W. Skarbek, A. Koschan, Colour image segmentation – a survey, 1994. [60] P.C. Mahalanobis, On the generalised distance in statistics, in: Proceedings National Institute of Science, India, vol. 2, 1936, pp. 49–55.
1940
G. Placidi et al. / Computers in Biology and Medicine 43 (2013) 1927–1940
[61] R.C. Gonzalez, R.E. Woods, Digital Image Processing, 3rd ed, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2006. [62] D. Pascale, A Review of rgb Color Spaces...from xyz to r’g’b’, The BabelColor Company, 2003. [63] C. Liu, R. Szeliski, S.B. Kang, C.L. Zitnick, T. Freeman, Automatic estimation and removal of noise from a single image, IEEE Trans. Patt. Anal. Mach. Intell. 30 (2) (2008) 299–314.
[64] S.B. Gokturk, H. Yalcin, C. Bamji, A time-of-flight depth sensor – system description, issues and solutions, in: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRWapos04), vol. 3, IEEE Computer Society, Washington, DC, USA, 2004, p. 35. [65] Microbird, ascension technology corporation, P.O. Box 527, Burlington, VT05402, U.S.A, 2005.