Subunit sign modeling framework for continuous sign language recognition

Computers and Electrical Engineering 74 (2019) 379–390 Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepa...

Download PDF

1MB Sizes 1 Downloads 149 Views

Report

PDF Reader
Full Text

Computers and Electrical Engineering 74 (2019) 379–390

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Subunit sign modeling framework for continuous sign language recognitionR Elakkiya R. a,∗, Selvamani K. b a b

School of Computing, SASTRA University, Thanjavur, India DCSE, Anna University, Chennai, India

a r t i c l e

i n f o

Article history: Received 28 July 2018 Revised 12 February 2019 Accepted 12 February 2019

Keywords: Sign language recognition Subunit sign modeling Continuous sign recognition Bayesian parallel hidden Markov model

a b s t r a c t A new framework named three subunit sign modeling is introduced for automatic sign language recognition. This works on continuous video sequences consisting of isolated words, signed sentences under different signer variations and illuminations. Three major issues of automatic sign language recognition is addressed namely: (i) importance of discriminative feature extraction and selection (ii) handling epenthesis movements and segmentation ambiguities (iii) automatic recognition of large vocabulary sign sentences and signer adaptation in a single subunit sign modeling framework. The proposed work has been evaluated and experimented subjectively and quantitatively with real-time signing videos gathered from different corpora and different sign languages. The results of the experiments have proven that the proposed subunit sign modeling framework remains scalable while increasing the sign vocabulary. Further, the approach is more reliable and eﬃcient enough to adapt to real-time constraints and signer independence. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Sign language (SL) serves as a communication medium among the deaf and hard-of-hearing community. SL habitually involves considerable resemblances to relevant spoken words but it has its own structure and grammar and varies according to the ﬂuency of signing. SL should not be considered as body language just because it is another form of nonlinguistic communication because general linguistics considers both signed and spoken language as different types of natural language. Rather, SL communication comprises the visual conveyance of meaning instead of spoken words. This communication involves the simultaneous combination of both manual and nonmanual means of expressions. Manual parameters include hand shape, hand position, hand orientation, hand trajectories and arms movements while nonmanual parameters include facial expressions, head and body postures, mouth and gaze directions. All these expressions together convey the intended meaning and information on the part of the signer in terms of visual projection for deaf and hard-of-hearing people. SL recognition (SLR) poses a particular challenge due to the visual analysis of hand gestures and the multimodal nature of sign gestures. While SLs are complex in structure with their own syntax, grammar, phonology and morphology, they vary in structure from spoken language. Generally, spoken language structure uses words in a sequential order whereas SL uses several body movements in parallel. Therefore, the linguistic behavior of SLs also varies due to the existence of several components including head movements and facial expressions along with hand movements. R ∗

Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. A. P. Pandian. Corresponding author. E-mail address: [email protected] (E. R.).

https://doi.org/10.1016/j.compeleceng.2019.02.012 0045-7906/© 2019 Elsevier Ltd. All rights reserved.

380

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

The proposed research work introduced a novel subunit sign modeling framework known as 3-SU to recognize large vocabulary multimodal signs from continuous video sequences constituting different SLs. The proposed approach involves spatial and temporal modeling of subunits using the Bayesian parallel hidden Markov model (BPaHMM) [6] and constructs the spatiotemporal sequentially of subunits without any prior linguistic information in an unsupervised behavior. The sequentially of spatial and temporal subunits is constructed based on the feature extraction and discriminative feature selection strategy discussed in [5], in which the proposed subunit modeling constructs the sign lexicon for one spatial subunit (hand shape) and two temporal subunits (velocity and position). Hence, the constructed sign lexicon is named as 3-SU (three subunits) and the proposed approach is called the 3-SU subunit sign modeling framework. A Bayesian network is employed to manage the classiﬁcation of spatial and temporal subunits. After that, an HMM model is adapted to recombine these subunits where the ﬁrst state corresponds to spatial subunits and the second state corresponds to temporal subunits. The obtained cues are then clustered based on the behavior of spatial and temporal subunits using minimum entropy clustering (MEC) and dynamic time warping (DTW) techniques and the sign lexicon is formed based on data-driven approach. After obtaining the subunits, based on the constructed sign lexicon and its corresponding cluster, each subunit is trained using subunit multi-stream parallel HMM (SMP-HMM) based on Bakis topology. The Viterbi algorithm is adopted to train these PaHMMs and ﬁnd the maximum probability node among all the subunits, i.e., the state of the existing subunit from the gesture base. The Baum–Welch algorithm is employed to recalculate the subunit parameters, and the recognition begins with the emission probability of each HMM node. The rest of the article is organized as follows. Section 2 describes the related work. Section 3 illustrates the overall proposed subunit sign modeling framework and describes different components of the framework. Section 4 gives a detailed evaluation of the experiment results in different dimensions and Section 5 summarizes the proposed framework. 2. Related work Although many researchers have focused on using linguistic descriptions of signs and constructing a standard dictionary for writing signs, none of the systems has produced effective results, while the practical implementation of such a linguistic-oriented approach is highly infeasible. To overcome the infeasibility in the linguistic-oriented approach, another visual-oriented method was adopted: namely, subunits extracted [7,9] for SLR using the data glove approach. This method relies on HMMs in such a way that there is one HMM for each subunit. A dynamic Bayesian network was employed [17] to model the systematic variations for different signs as parallel cues with independent features, with these cues later combined as subunits. A mapping between phonemic [19] and kinematic visual motions was derived [4] to recognize isolated movement phonemes. A clustering algorithm [1] was applied for self-organizing subunits using spatial features as individual frames (SU-F). This approach ignores the temporal feature, which is one of the essential features to recognize sign gestures. Subunits were derived [8,13] from hand actions in time and space, based on hand motion boundaries, while the obtained subunits were correlated with linguistic syllables. But this approach required more training samples, irrespective of gesture size. To overcome this, a boosted subunit framework [9] was proposed for recognizing isolated signs. Nevertheless, this framework left the problem unsolved for a large vocabulary set. The concept of dynamic and static subunits (2-S-U) [21,22] for deriving subunits, acquired an accuracy of around 95% in signer-dependent recognition and 63% in signer-independent recognition. However, their approach required more computation time due to the late integration of hand shape subunits. A subunit sign model [12] was based on subunit and Image net hand motion (SU-IMH). However, pretraining for the motion of hands was required for this approach due to the lack of an information model in convolutional neural networks (CNNs). Linguistically extracted subunits [3] were used without any separation between spatial and temporal features (SU-noST), with the results of three-dimensional tracking information compared using boosted sequential trees for recognizing British SL. The concept of movement using 3D data in HMMs was introduced to recognize continuous signs, with similar concepts of phoneme approaches later used [16] to remove epenthesis movements. More recently, a scalable approach [15] was presented by simply ignoring transition movements, with the corresponding system giving an accuracy of around 87%. But this proposed approach deals only with a small vocabulary, and more time was required to ﬁnd the matching pattern. An extended approach [24], to overcome the problem of distance computation [23] at each level in the level-building algorithm, made use of a fast HMM. However, this approach used a Kinect sensor for data acquisition. It also did not incorporate the hand shape feature, which is the most essential feature in classifying the sign gestures, while more running time was required due to the level-building approach. More recently, an SLR system to support signer adaptation based on the novel active appearance model (AAM) [11] was developed. The authors evaluated the proposed approach using two datasets: (1) the SIGNUM laboratory restricted dataset (2) the RWTH-PHOENIX-Weather unconstrained real-life dataset. Even though their approach showed a 16.4% error rate in the laboratory setup, it showed a 53% error rate in the unrestricted environment, which would have highly affected the recognition performance of SLR. DWT and one-dimensional signal [18] were used for modeling the SLR system. This approach showed an accuracy of over 88% in non-ﬁst-based gestures; in ﬁst-based gestures, it showed an accuracy of over 58%. A lexicon-free SLR [10] approach was presented using a segmental conditional random ﬁeld (SCRF) and a deep neural network (DNN). However, this approach produced 92% accuracy in the case of the signer-dependent mode and a reduced

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

381

Fig. 1. Proposed 3-SU framework and its components for automatic SLR.

accuracy of 83% for multi-signer recognition. A Kinect sensor [20] and a leap motion controller (LMC) [14] were used for extracting multimodal features, producing an accuracy of 94.27% for Arabic SL sign words. However, this approach included more features and did not select any robust features to reduce time complexity. The proposed 3-SU framework contributes to the following four aspects: (1) A data-driven approach for the sequential and parallel breakdown of signs into subunits without any prior knowledge of gestures; (2) A Bayesian parallel hidden Markov model to combine manual and non-manual subunit features to overcome the problem of movement ambiguities; (3) An intra-gloss sign lexicon to construct the gesture base where similar features of different signs were shared and stored; (4) Regardless of the signer dependence, the work managed various signers under different illuminations in real-world SLR. 3. Subunit modeling The proposed subunit modeling consists of four components: spatial/temporal (S/T) subunit modeling, construction of S/T subunits, sign lexicon construction, and subunit (SU) training and recognition. All these four components play a signiﬁcant role in enabling subunit modeling to achieve a better performance in SLR. The ﬁrst major contribution of this proposed research work is S/T subunit modeling. Among the essential features of SL are velocity cues. With the help of these velocity features, the recognition of signs, even in very critical situations, can also be achieved with ease. The next contribution of this research is the construction of S/T subunits. This is achieved by extracting all the cues, such as hand shape, convexity hull, convexity defects and hand trajectory information. The ﬁrst three cues are spatial, whereas trajectories such as position, orientation and velocity are temporal features. Finally, three SU-level sign lexica for hand shape, velocity and position are constructed based on feature selection strategies, given that these three cues play a vital role in yielding a better recognition rate for signs. 3-SU is constructed for hand shape, velocity and position. Once the hand shape subunit is constructed, all the subunits are recombined to form one corresponding sign lexicon. Similarly, for position and velocity, after the cues are obtained, corresponding lexica are formed without any manual interventions. The ﬁnal contribution of this research is the training and recognition of obtained SUs. The proposed work employed a SMP-HMM to integrate and train the hand shape (HS), position (P) and velocity (V) cues. The trained SU models and lexicons are obtained separately for HS, P and V. In recognition, the proposed system ﬁnds all S/T SUs by identifying the most probable SU per segment among other S/T SUs. These results in obtaining a sequence of S/T SUs needed to recognize continuous signs. The proposed SU sign modeling framework is shown in Fig. 1. 3.1. Spatial/temporal subunit modeling The major concern in SU sign modeling is segmenting signs into sub sign segments and classifying such segments as S/T SUs. This classiﬁcation is purely based on hand movements and postures. The spatial SUs are considered as static postures, which are not involved in any transitions, whereas the temporal subunits are those with transitions. The spatial subunits are constructed from hand shape and the temporal subunits are constructed from the position and velocity of hands during

382

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

Fig. 2. Illustration of BPaHMM for S/T SU segmentation.

transitions. A Bayesian network is employed to address the classiﬁcation of spatial and temporal subunits. After that, an HMM model is adapted to recombine these subunits, where the ﬁrst state corresponds to spatial subunits and the second state corresponds to temporal subunits. After obtaining the subunits, the next step is to train those subunits using the Baum–Welch algorithm, while the Viterbi algorithm is used to ﬁnd the maximum probability, i.e., to ﬁnd the most probable state of the existing subunit from the gesture base. Fig. 2 shows the structure of the BPaHMM in classifying the S/T SUs. In this ﬁgure: S represents sign gestures, i.e., observations of input sequences; hSt represents the hidden state of spatial cues with respect to observation time t; hTt represents the hidden state of temporal cues with respect to observation time t; and S/T1…n represents the output sequence of S/T SUs after segmentation. Based on the subjective logic of the Bayesian model, the notation is deﬁned for the joint probability distribution. This is denoted as:

P

ω

O S|¯T

ωHO|F h(H ) = P ωSO|T h(H ) + P ωSO|¬T h(¬T ) P

(1)

where: O denotes the observation sign sequence; S denotes the spatial manual features; and T represents the temporal subunit manual features. The pair (ωSO|T , ωSO|¬T ) denotes the source S based on binomial conditional opinions and argument h(S) denotes the prior probability of spatial feature S. The pair (ωO¯ , ωO¯ S |T

S |¬ T

) denotes inverted conditional opinions and (ωSO|T )

denotes the conditional opinion that generalizes the probabilistic condition P(S|T). 3.2. Construction of spatial/temporal subunits S/T SUs are clustered after the segmentation and modeled with the multi-stream parallel HMM. Spatial subunits are constructed based on the extracted hand shape using Fourier descriptors. Once all the SUs are obtained, all the frames are considered to cluster as a result of MEC, where each cluster corresponds to different SUs. As a result, similar hand shapes of different signs are clustered accordingly. This allows the gesture base to be shared by two or more similar SUs. Temporal subunits are constructed considering the transition movements. Based on the velocity of transition, the subunit is considered as either hand velocity or hand position. By calculating the average frequency of frames, i.e., 30 keyframes per second, the frame with high transition movement is treated as a velocity cue whereas the frames with fewer transition movements are treated as position SUs. The hand position information is obtained from the trajectories based on Euclidean distance. The occurrence of the ﬁrst position of the hand is computed based on the starting point of the hand in the ﬁrst frame and the centroid of the hand trajectories. Similarly, the occurrence of the second position of hand coordinates is calculated based on the distance between the starting point and the current point of the hand coordinates in consecutive frames. Once hand shape, hand position and hand velocity SUs are obtained, all these SUs are clustered. Since position and velocity are based on temporal information, MEC alone is not enough to cluster the temporal SUs. It is necessary to incorporate temporal data by using the DTW technique. Finally, all these distances are clustered to obtain the end criterion for the number of clusters for temporal SUs. Due to temporal clustering, each SU in the gesture base refers to both velocity and position cues whereas the hand shape cue will be concatenated during training as a separate SU. Before clustering, the direction or scale of the feature vectors, with respect to the initial position, needs to be normalized to effectively obtain normalized segments to incorporate model invariance. Without normalization, modeling of the trajectories with respect to the position will increase the invariance. Similarly, no scale corresponding to amplitude will also affect the model invariance. Therefore, for the effective segmentation and modeling of subunits, 3-SU modeling is performed along with the normalization and scaling of feature vectors with respect to the initial position. Fig. 3a presents the subunit trajectories without scale normalization and with only an initial position, while Fig. 3b presents the trajectories with scale and initial position normalization.

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

383

Fig. 3. Mapping of trajectories into the two-dimensional signing space. (a) After initial position normalization and (b) after initial position and scale normalization.

Fig. 4. Lexicon construction of the sign for ALONE (ASLLVD).

3.3. Sign lexicon construction The sign lexicon is constructed without any prior knowledge of phonetic transcriptions, annotations or manual labelling. A data-driven approach is followed in order to construct the sign lexicon by combining hand shape with position and velocity features. These are obtained via S/T SU construction, after which the constructed lexicon is used for SU training and recognition. After segmenting and clustering S/T SUs, these are united together to obtain the lexicon. In this way, the sign lexicon consists of an entry in the gesture base for each SU. All SUs in the gesture base consist of a spatial (S) or temporal (T) SU label along with the assigned cluster ID. For spatial SUs, the label is assigned as S, along with the cluster ID; for the temporal SUs, the label is assigned as TV for velocity and TP for position along with the corresponding cluster ID. Hence, this lexicon is named as 3-SU, which is constructed without any explicit or implicit modeling of transition movements and positions. The dominant hand for processing the lexicon is the right hand, whereas the non-dominant hand is considered when both hands are moving or both hands are non-moving during transitions. Fig. 4 presents an example of lexicon construction for S/T SUs. This lexicon starts with spatial subunits (S1–S6), followed by temporal subunits in terms of temporal position (TP3–TP8) and temporal velocity (T V9–T V11), and again rests with spatial subunits (S4–S10). 3.4. Subunit training and recognition According to the constructed sign lexicon, each sign is composed of a sequence of S/T SUs. For the training and recognition of SUs, a SMP-HMM is used. In recognition, the most probable S/T segment needs to be found from the given set of features and from the SU models which best match the S/T segment. The SMP-HMM model ﬁnds the S/T segments by using TV cues, TP cues and S cues for the dominant hand. Similarly, it integrates all the cues for the non-dominant hands. As an HMM alone cannot manage all these cues, a multi-stream HMM is used to train this set of features; however, as the features are segmented based on S/T instances, multi-cue parallelism is used to handle these variations. Hence, an SMP-HMM is introduced to manage the 3-SU segments and train these lexical; an illustration of this model is depicted in Fig. 5. The main concern in implementing the SMP-HMM is setting the stream weight for different cues under different segmentations. At this point, the cue segments are considered on the basis of S/T instances. Henceforth, each SU has its own stream weight, computed as 1 for every single stream. It is assumed that, when the stream follows the spatial SUs, the movements or velocity instances are set to 0 and the position cues are set to 1. Similarly, hand shapes are set as 1, i.e., TV = 0, TP = 1 and HS = 1. In the same way, when the temporal SUs are considered, velocity is set to 1 and position and hand shape are set to 0, i.e. TV = 1, TP = 0 and HS = 0. This interlacement of stream weight is shown in Fig. 6. These cue segments, along with the cluster ID, are collected to map the training samples and associated S/T SU models. Based on Bakis topology, the SMP-HMM is constructed with ﬁve states for T SUs and one state for S SUs. The Viterbi algorithm is used to train this HMM to ﬁnd the most probable state sequence in a repetitive manner. Once training is completed, the HMM parameters are estimated, i.e., the log likelihood of all the training data. This process is repeated until all the data are trained without a further increase in the likelihood of the HMM parameters.

384

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

Fig. 5. An SMP-HMM for multi-cue parallelism (dark circles = spatial subunits, dotted circles = temporal subunits. (a) S/T sequence (b) S/T sequence in multi-stream.

Fig. 6. Interlacement across time in an SMP-HMM (dark circles = spatial subunits, dotted circles = temporal subunits, shading = interlacements of S/T sequences).

After this training, the Baum–Welch algorithm is applied to re-estimate the parameter values for both spatial and temporal SUs. When all the training samples are processed, the total accumulative set of statistical HMM and temporal HMM values are re-estimated. When the training is complete, sign instances need to be recognized based on the constructed subunit models. The subunit recognition is made based on the trained HMM along with the constructed SU lexicon. In this way, the recognition begins with each node being allowed by the HMM to emit along the exact path for a test example t with T frames. All the path values must to be calculated and summed to ﬁnd the individual transition log probability and also the log probability of the emitting node. The most probable S/T sequence at each time instance is computed to maximize the above log probability estimated from each path in the recognition network. 4. Experiments and results 4.1. Phonetic subunit construction and 3-SU construction The phonetics-based SU can only be used when phonetic transcriptions are available for all the signs, but these kinds of low-level annotations are time-consuming. For the alternative strategy, the implementation has to either work either towards phonetic adaptation or proceed with new models to account for missing phonetic labels. Further, the usual issue arises when training the unequal feature space, i.e., because of the phonetic labels, the gesture base will be highly populated. The main characteristic of the proposed approach is the unsupervised recognition of static (S), which corresponds to nonmovements, while temporal (T) corresponds to movement SUs. In this way, the ﬁnal SU model consists of information on trajectories, which divides the spatial information and, at the same time, incorporates the data-driven information for the eﬃcient recognition of static (hand shape) and temporal (velocity and position) SUs. In speciﬁc terms, the proposed approach decodes each feature sequence using the Viterbi algorithm in the HMM space and generates sequential SU labels with the starting and ending frames of all the signs. With the help of this mechanism, the intra-sign SUs are concatenated as lexica in the gesture base, based on the boundaries of S/T SUs for all the sign sequences. Each subunit in the lexicon consists of transition-based information along with the position and hand shape information for all the signs in the vocabulary of the American Sign Language Lexicon Video Dataset (ASLLVD). The decomposition of signs into subunits for three different signs, i.e., ANY, ABROAD and CHILD, from this dataset after normalization, with respect to initial position and scale, is depicted in Fig. 7. The experiments are conducted on the ASLLVD for American Sign Language (ASL) word recognition, which consists of almost 3300 different signs and 9800 distinct sign video instances involving six native ASL signers. Each sign consists of an average of at least ﬁve utterances. To evaluate the performance of signed sentences, the RWTH-PHOENIX-Weather dataset is used, which consists of 5356 sentences and 1081 distinct signs performed by seven different signers. To evaluate the accu-

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

385

Fig. 7. Decomposition of three signs (ASLLVD) after normalization into SUs (S1 = “any”, S2 = “abroad”, S3 = “child”).

Fig. 8. Comparison of data-driven and linguistics-based approaches.

racy of sign detection for a large vocabulary, two metrics are employed: sign accuracy and sign correctness. Sign correctness is measured as the total number of signs (N) subtracted by the sign deletion (D) and substitution (S), i.e. (N−D−S), divided by the total number of signs and multiplied by 100. Sign accuracy is calculated as the total number of signs (N) subtracted from the sign deletion (D), substitution (S) and insertion (I), divided by the total number of signs and multiplied by 100.

Correctness = (N − D − S )/N × 100

(2)

Accuracy = (N − D − S − I )/N × 100

(3)

Data-Driven SUs corresponding to spatial and temporal subunits have been decomposed and constructed automatically based on the segmentation procedure of transition and non-transition movements by means of velocity. These segments are clustered, based on unsupervised learning, to form the cues of shape, position and velocity. The SU-based data-driven approach shows reduced complexity and greater advantages in terms of accuracy without combining the phonetics models within the set of other data-driven-based approaches and linguistic models. For experimental purposes, an average of 150 isolated words is trained by two signers (Liz and Brady) and 91 signs are tested from the signs of the new signer (Tyler) to check for cross-sign validation. The results produced by 3-SU are promising for a large vocabulary dataset, showing an improvement of 32.3% with a reduced error rate of 2% for isolated signs. To evaluate the accuracy of signed sentences, the RWTH-PHOENIX-Weather dataset is used where two signers are used for training and one signer is used for testing the signed sentences. From the corpus, 150 sentences are used in total, along with tested sentences. The proposed approach shows a reduced error rate of 16.4% from 33.4% and produces an average accuracy of 88.7% in signed sentences without using any linguistic modeling. Experimental results are clearly depicted in Figs. 8 and 9, proving the sign classiﬁcation accuracy of the proposed 3-SU approach. 4.2. Experiments on the RWTH-PHOENIX-Weather dataset Recognition experiments on the RWTH-PHOENIX-Weather dataset are conducted to test the accuracy of the proposed 3SU approach by taking into account the transitions of both the hands and the shape of the dominant hand (right hand). The evaluation criterion is set as follows: signer-dependent experiments, i.e., training and testing with the same signer on a large corpus of data. Cross-signer validation is carried out based on training with Signer A and testing with Signer B. The RWTHPHOENIX-Weather dataset employed in the recognition experiments consists of a vocabulary of 1200 signs and almost 5356

386

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

Fig. 9. Total error in isolated words and continuous signs.

sentences with about 60 0,0 0 0 frames signed by seven different signers, although the data are not equally partitioned. All recorded videos are at 25 frames per second and the size of the frames is 210 × 260 pixels. Based on the experimental results, the video frame size is reduced and cropped to 50 × 70 pixels because the hands are exactly cropped within this range. The proposed 3-SU approach-based subunit sign modeling is compared with similar approaches to prove the effectiveness of the proposed 3-SU approach. The 2-S-U approach is almost identical to the proposed strategy, with the only difference being that hand shape is integrated into the stage after accumulating the hand trajectories, whereas the proposed approach initially categorizes the subunits based on spatial and temporal feature cues. The SU-noST (segmentation without any separation among spatial and temporal subunits) approach is almost similar to the 2-S-U segmentation, with the only difference between these two approaches being the separation in dynamic and static subunits. The next approach in subunit sign modeling is SU-IMH, where pretraining for the motion of the hands is required due to a lack of any information model in the CNNs. In the initial stage, the frame-level SU (SU-F)is designed for subunit and lexicon construction at the frame level without considering the whole segments, whereas the remaining four approaches are based on whole segment models. For all the competitive approaches, their features and the notations for the dominant and non-dominant hands are as follows: hand shape (HS), position (TP) and velocity (TV) are the general features considered without normalization, whereas, after normalization, the SU trajectories with respect to initial position and direction are position (TPn) and velocity (TVn). In 3-SU, HS represents the spatial cues and TPn and TVn represent the temporal ones. Assimilation of multiple features is incorporated using “+” (e.g., HS + TPn + TVn). In the remaining approaches, S/T SUs are not segmented, meaning that, to facilitate a fair comparison of all the models, the cues are integrated via the PaHMM. For this experiment, the number of SUs is set at six sentences based on the recognition performance of the training dataset, which does no overlap with the SUs in the test dataset. The 3-SU uses 60, 120 and 300 SUs for TPn, TVn and HS, while the highly similar 2-S-U approach uses 120, 180 and 660 SUs for position, velocity and hand shape. For the SU-noST, SU-IMH and SU-F approaches, 960, 660 and 660 SUs are used for position, velocity and hand shape cues, respectively. From this setup, it is observed that the SUs used by 3-SU are much smaller when compared to the SU-noST, SU-IMH and SU-F approaches. The similar 2-SU approach exercises more SUs for hand shape because of the late integration of HS cues when compared with the 3-SU approach. The 3-SU approach employs a smaller number of subunits because of the discrimination between the spatial and temporal SUs. From the results obtained, it is inferred that 3-SU outperforms the other approaches by employing different features. From Fig. 10a, it can be seen that the normalized feature vectors produced more accurate results when compared to nonnormalized ones, showing a 2% improvement in normalized cues with respect to the initial position of scale and direction, compared to the non-normalized cues. The proposed 3-SU shows an average increment of 15.6% and 7.9% when compared with the most similar 2-S-U approach and the more recent SU-IMH approach for the signer-dependent classiﬁcation. This is due to the discrimination between spatial and temporal subunits in the initial stage. Meanwhile, the lexicon is also constructed based on the three subunits and, in the 2-S-U approach, the hand shape cue will be integrated into the latter stage, even though there is a discrimination based on static and dynamic cues. This indicates that feature integration also plays a crucial role in recognition and concerns the integration of different cues, involving the most appropriate features as sequential segments, instead of the blind integration of all cues. Table 1 lists the overall recognition accuracy for the RWTH-PHOENIX-Weather dataset. Generally, while increasing the sign glosses, the recognition performance will be reduced and the error rate will be increased. One of the most interesting facts to be taken from Fig. 10b is the improvement in recognition performance when increasing the number of sign sentences. This is due to intra-gloss subunit sharing, i.e., when different sentences are used, different subunits are formed for different signs, with similar subunits for different signs also formed in sign lexica. This results in the sharing of subunits among different signed sentences in order to increase the recognition accuracy and reduce the space of the gesture base in lexicon construction. Therefore, the proposed 3-SU averages an absolute increase of 6.78%

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

387

Fig. 10. Experimental results for the RWTH-PHOENIX-Weather dataset: (a) comparison of 3-su with the other four existing approaches and (b) gradual increments in the number of signs.

Table 1 Overall recognition accuracy of the RWTH-PHOENIX-Weather dataset among discriminative features. Recognition experiment

Approaches

Subunit segmentation

S/T integration

# of sign sentences

Sign accuracy (%)

Features

3-SU 2-S-U SU-noST SU-IMH SU-F

SMP-HMM 2S-Ergodic HMM 2S-Ergodic HMM CNN NA

Spatial/temporal Static/dynamic NA NA NA

6

91 85.3 78 85 80

Table 2 Overall recognition accuracy of the RWTH-PHOENIX-Weather dataset with respect to the variation in signed sentences. Recognition experiment

Approaches

Subunit segmentation

S/T integration

# of sign sentences

Average accuracy (%)

# of signed sentences

3-SU 2-S-U SU-noST SU-IMH SU-F

SMP-HMM 2S-Ergodic HMM 2S-Ergodic HMM CNN NA

Spatial/temporal Static/dynamic NA NA NA

{20,45,60,85,100}

89.98 78 69 83.2 66.8

compared to SU-IMH, 13.2% compared to 2-S-U, 20.98% compared to SU-noST and 23.18% compared to SU-F. Table 2 lists the overall recognition accuracy of the RWTH-PHOENIX-Weather dataset according to the varying number of sign glosses. Next, the proposed approach is tested for cross-signer validation, i.e., where no data from the test set are employed in the training dataset. Here, in the experiment, Signer 1 is performing actions related to 304 sentences out of 5356 sentences without any overlap with other signers. Henceforth, the proposed 3-SU is trained with signs of two other signers, Signer 2 and Signer 3, with all repetitions per sign from both signers, while the 3-SU eﬃciency is tested based on unseen signed sentences from Signer 1. From the obtained experimental results, 3-SU outperforms all the other approaches even in unseen signer conditions with an average accuracy of 88.10%. There is a slight variation in the 3% reduction in performance when compared with dependent signers for continuous sign sentences. When compared with the two most similar approaches, namely, SU-IMH and 2-S-U, 3-SU shows an improvement of 19.10% and a 40% accuracy for unseen signers, respectively. Similarly, for cross-signer validation 3-SU produces an average accuracy of 91% in multi-signer group sign veriﬁcation, while the corresponding increment in average accuracy is 6% and 6.3% when compared with the SU-IMH and 2-S-U approaches, and shows a 12% and a 13% improvement in sign accuracy when compared with the SU-F and SU-noST approaches. The proposed approach is also tested with the whole sign-based (SW-1) approach to demonstrate the eﬃcacy of subunit level sign modeling. The whole sign-based approach needs to model the transition movements along with the signs. This situation leads the system to lose its scalability while considering the sign sentences, involving a large vocabulary, from the RWTH-PHOENIX-Weather dataset. Further, from the acquired results, it is proven that SU-level sign modeling outperforms in terms of scalability and accuracy because of its intra-sign SU sharing in the case of large vocabularies and the fact that no separate modeling is required for transition movements. The proposed 3-SU produces an improved accuracy of 3.4% for cross-signer validation and 10.10% for unseen signer validation when compared to the whole sign-based SLR approach. The experimental results for cross-signer validation and unseen signer veriﬁcation are listed in Table 3.

388

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390 Table 3 Cross-signer validation and unseen signer veriﬁcation for SU modeling and whole sign modeling. Signer

Features

3-SU

2-S-U

SU-noST

SU-IMH

SU-F

SW-1

Signer 2 & Signer 3

TV + TP+HS TPn + HS TVn + HS TPn + TVn + HS TV + TP + HS TPn + HS TVn + HS TPn + TVn + HS

89 87 89.3 91 85 85.3 87.4 88.10

75 84 85 85.3 30.1 38.8 45.3 48

76 77 75 78 35 37.10 38.8 42.6

81 79 83 85 58 61 65.3 69

83 75 78 80 11.9 28.8 32.7 35.53

84.3 85 85 87.6 72.3 75 75.10 78

Signer 1

Fig. 11. Experimental results for the ASLLVD. (a) Comparison of 3-SU with four other existing approaches and (b) gradual increments in the number of signs. Table 4 Accuracy of unseen signer veriﬁcation in the ASLLVD. Signer

Features

3-SU

2-S-U

SU-noST

SU-F

SW-2

Signer Lana

TV + TP + HS TPn + TVn + HS

95 97.3

85.1 86.4

65 68

67 71

83 85.4

4.3. Experiments on the ASLLVD The ASLLVD [2] consists of more than 3300 isolated signs from six different native ASL signers. For this experiment, the number of SUs is set according to the recognition performance of the training dataset, which has no overlap with the SUs in the test dataset. The 3-SU approach uses 10, 10 and 30 SUs for TPn, TVn and HS for six different signs and, for the highly similar 2-S-U 20 approach, 30 and 200 SUs are used for position, velocity and hand shape. For the SU-noST, SU-F and whole sign-based (SW-2) approaches, 160,200 and 200 SUs are used for position, velocity and hand shape cues, respectively. From this setup, it is observed that the SUs used by 3-SU is highly smaller when compared to the SU-noST-, SU-F- and SW-2based approaches. The similar 2-SU approach exercises more SUs for hand shape because of the late integration of the HS cues with static/dynamic cues when compared with the 3-SU approach. This latter approach also employs a smaller number of subunits because of the discrimination between spatial and temporal SUs for isolated signs. From Fig. 11a, it is inferred that the proposed 3-SU approach has reached a maximum accuracy of 99.3% in recognizing isolated signs. Comparatively, it produces an improvement of 11.9% compared to the SW-2 approach, while showing an increased accuracy even when compared with subunit sign models. The proposed 3-SU approach produces an increased accuracy of 3.4% compared to the SU-F approach, 5.2% compared to the 2-S-U approach and 6.06% compared to the SUnoST approach. Similarly, from Fig. 11b, for the varying number of isolated signs from 20 to 100, due to intra-gloss subunit sharing, the 3-SU approach acquires an average eﬃciency of 98.64%. The 2-S-U approach produces an accuracy of 93.4% on average after obtaining the results of zero to four instances, but the proposed 3-SU approach does not require any instantiation for the development dataset and produces an improved accuracy in the recognition results. Henceforth, it is proven that 3-SU produces effective recognition results when compared with other existing approaches and for large vocabulary datasets. It is also proven that the scalability of the system remains constant for an increasing vocabulary of signs. Table 4 lists the produced accuracy for unseen signer veriﬁcation in the ASLLVD. In this experiment, Signer Lana video sequences are used to train the 3-SU approach with one training sample per sign, for the 3300 signs chosen, with Signer

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

389

Dana tested without any overlapping of the training sequences. The performance of 3-SU is tested with all three feature cues in both the SU-level and SW-2 approaches, showing a relative improvement of 10.9% compared to 2-S-U, 29.3% compared to SU-noST, 26.3% compared to SU-F and 11.9% compared to SW-2. It is proven that 3-SU outperforms other approaches in the case of the unseen signer condition with an accuracy of 97.3% for large vocabulary sign lexica, i.e., only a 2% variation in the multi-signer group veriﬁcation results. 5. Conclusion The proposed SLR system introduces a novel strategy 3-SU framework for subunit sign modeling. The 3-SU approach is assessed in various SLR experimentations with data from two different SLs and two different large corpus datasets: (1) German Sign Language (GSL) – the RWTH-PHOENIX-Weather dataset (2) ASL – the ASLLVD. Extensive comparison experiments are conducted with four SU-level approaches and two whole sign-based approaches. The average results obtained for the GSL RWTH-PHOENIX-Weather (304 sentences) corpus, with respect to other approaches, show a relative improvement of 6% compared to 2-S-U, 6.3% compared to SU-IMH, 12% compared to SU-noST and 12.3% compared to SU-F in the case of cross-signer validation and unseen signer veriﬁcation. Further, 3-SU shows a relative improvement of 10.10% compared to SW-1. The average results obtained for the ASLLVD, involving 3300 isolated signs, with respect to other approaches shows an average improvement of 10.9% compared to 2-S-U, 29.3% compared to SU-noST, 26.3% compared to SU-F and 11.9% compared to SW-1. The overall 3-SU framework produces effective recognition results with constant scalability for an increasing vocabulary of signs. References [1] Bauer B, Kraiss KF. Towards an automatic sign language recognition system using subunits. In: Proceedings of the gesture workshop; 2001. p. 64–75. [2] Neidle Carol, Vogler Christian. A new web interface to facilitate access to corpora: development of the ASLLRP data access interface. In: Proceedings of the 5th workshop on the representation and processing of sign languages: interactions between Corpus and Lexicon, LREC 2012, Istanbul, Turkey; 2012. [3] Cooper H, Ong EJ, Pugeault N, Bowden R. Sign language recognition using sub-units. Springer Gesture Recognition; 2017. p. 89–118. [4] Derpanis KG, Wildes RP, Tsotsos JK. Deﬁnition and recovery of kinematic features for recognition of American sign language movements. Image Vis Comput 2008;26(12):1650–62. [5] Elakkiya R, Selvamani K. Enhanced dynamic programming approach for subunit modelling to handle segmentation and recognition ambiguities in sign language. J Parallel Distrib Comput 2018;117:246–55. [6] Elakkiya R, Selvamani K. Extricating manual and non-manual features for subunit level medical sign modelling in automatic sign language classiﬁcation and recognition. J Med Syst 2017;41(11):175. [7] Fang G, Gao W, Zhao D. Large vocabulary sign language recognition based on fuzzy decision trees. IEEE Trans Syst Man Cybern-Part A Syst Hum 2004;34(3):305–14. [8] Han J, Awad G, Sutherland A. Automatic skin segmentation and tracking in sign language recognition. IET Comput Vis 2009;3(1):24–35. [9] Han Junwei, Awad George, Sutherland Alistair. Boosted subunits: a framework for recognising sign language from videos. IET Image Process 2013;7(1):70–80. [10] Kim T, Keane J, Wang W, Tang H, Riggle J, Shakhnarovich G, Brentari D, Livescu K. Lexicon-free ﬁngerspelling recognition from video: data, models, and signer adaptation. Comput Speech Lang 2017;46:209–32. [11] Koller O, Forster J, Ney H. Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput Vis Image Underst 2015;141:108–25. [12] Koller O, Bowden R, Ney H. Automatic alignment of HamNoSys subunits for continuous sign language recognition. In: Proceedings of the 7th workshop on the representation and processing of sign languages: corpus mining. LREC; 2016. p. 121–8. [13] Kong WW, Ranganath S. Towards subject independent continuous sign language recognition: a segment and merge approach. Pattern Recognit 2014;47(3):1294–308. [14] Kumar P, Roy PP, Dogra DP. Independent Bayesian classiﬁer combination based sign language recognition using facial expression. Inf Sci 2018;428:30–48. [15] Li K, Zhou Z, Lee CH. Sign transition modeling and a scalable solution to continuous sign language recognition for real-world applications. ACM Trans Access Comput (TACCESS) 2016;vol. 8(2):7–23. [16] Neidle C, Liu J, Liu B, Peng X, Vogler C, Metaxas D. Computer-based tracking, analysis, and visualization of linguistically signiﬁcant nonmanual events in American Sign Language (ASL). LREC workshop on the representation and processing of sign languages: beyond the manual channel, 5; 2014. [17] Ong SC, Ranganath S. Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans Pattern Anal Mach Intell 2005;1(6):873–91. [18] Pattanaworapan K, Chamnongthai K, Guo JM. Signer-independence ﬁnger alphabet recognition using discrete wavelet transform and area level run lengths. J Visual Commun Image Represent 2016;38:658–77. [19] Pitsikalis V, Theodorakis S, Vogler C, Maragos P. Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. In: IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW); 2011. p. 1–6. [20] Shanableh T, Assaleh K, Al-Rousan M. Spatio-temporal feature-extraction techniques for isolated gesture recognition in Arabic sign language. IEEE Trans Syst Man Cybern Part B (Cybern) 2017;vol. 37(3):641–50. [21] Theodorakis S, Pitsikalis V, Maragos P. Model-level data-driven sub-units for signs in videos of continuous sign language. In acoustics speech and signal processing (ICASSP), 2010 IEEE international conference on 2010 Mar 14 (pp. 2262–2265). IEEE. [22] Theodorakis S, Pitsikalis V, Maragos P. Dynamic–static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis Comput 2014;32(8):533–49. [23] Yang R, Sarkar S, Loeding B. Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE Trans Pattern Anal Mach Intell 2010;32(3):462–77. [24] Yang W, Tao J, Ye Z. Continuous sign language recognition using level building based on fast hidden Markov model. Pattern Recognit Lett 2016;78: 28–35. R. Elakkiya is an Assistant Professor in the Department of Computer Science and Engineering, School of Computing, SASTRA University, Thanjavur. She received her PhD from Anna University, Chennai, in 2018. She has been working on various issues concerning the machine learning. She has published several research articles on different aspects of sign language classiﬁcation and recognition.

390

E. R. and S. K. / Computers and Electrical Engineering 74 (2019) 379–390

K. Selvamani is an Assistant Professor in the Department of Computer Science and Engineering, Anna University, Chennai. He received his PhD from Anna University, Chennai, in 2012. He has vast knowledge and experience in the ﬁelds of machine learning, web services and network security. Currently, he is guiding several research projects in various domains.

Subunit sign modeling framework for continuous sign language recognition

Subunit sign modeling framework for continuous sign language recognition

Recommend Documents