Accepted Manuscript Fast Indoor Scene Description for Blind People with Multiresolution Random Projections Mohamed L. Mekhalfi, Farid Melgani, Yakoub Bazi, Naif Alajlan PII: DOI: Reference:
S1047-3203(17)30031-7 http://dx.doi.org/10.1016/j.jvcir.2017.01.025 YJVCI 1945
To appear in:
J. Vis. Commun. Image R.
Received Date: Revised Date: Accepted Date:
21 October 2016 19 January 2017 24 January 2017
Please cite this article as: M.L. Mekhalfi, F. Melgani, Y. Bazi, N. Alajlan, Fast Indoor Scene Description for Blind People with Multiresolution Random Projections, J. Vis. Commun. Image R. (2017), doi: http://dx.doi.org/10.1016/ j.jvcir.2017.01.025
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Fast Indoor Scene Description for Blind People with Multiresolution Random Projections
Mohamed L. MEKHALFI(a), Farid MELGANI(a), Yakoub BAZI(b), Naif ALAJLAN(b)
(a) Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 9, I-38123, Trento, Italy E-mail: {mohamed.mekhalfi, melgani}@disi.unitn.it
(b) College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia E-mail: {ybazi, najlan}@ksu.edu.sa
1
ABSTRACT Object recognition forms a substantial need for blind and visually impaired individuals. This paper proposes a new multiobject recognition framework. It consists of coarsely checking the presence of multiple objects in a portable camera-grabbed image at a considered indoor site. The outcome is a list of objects that likely appear in the indoor scene. Such description is meant to uplift the conscience of the blind person in order to better sense his/her surroundings. The method consists of a library containing (i) a bunch of images represented by means of the Random Projections (RP) technique, and (ii) their respective list of objects, both prepared offline. Thus, given an online shot image, its RP representation is generated and matched to the RP patterns of library images. It thus inherits the objects of the closest image from the library. Extensive experiments returned promising recognition accuracies and a processing lapse of realtime standard. Keywords:
Assistive technologies, coarse scene description, multiobject recognition, multiresolution processing, random projection, visually impaired people.
2
I.
INTRODUCTION
Blindness and visual disability are deemed among the most detrimental dilemmas that may befall an individual and limit his/her social as well as personal experiences. For visually impaired people, it is undoubtedly of great help if a sighted person accompanies them and offers assistance in order to carry out their daily activities with no hurdles. This, however, is impractical in view of today modern society. Alternatively, other traditional aids have been taking place in the last decades such as a cane or a guide dog, which have shown to lessen, at least slightly, the sharpness of the trouble. Nonetheless, what is referred to as ‘assistive technologies’ have emerged with innovative solutions in the recent years and drawn a major outgrowth as compared to the traditional means. One of the fast growing assets in this regard maybe confined to computer vision, which has been proving useful because of its reasonable tradeoff between efficiency, implementation cost, and lightweight hardware. It is known for a fact that allowing blind individuals to move on their own and meantime avoid obstacles forms the biggest concern when it comes to designing/developing an assistive system/prototype. For instance, a navigation assistant named the guide cane was proposed in [1], which accommodates a wheeled housing supplied by ultrasonic sensors, a handle to guide the cane through, and a processing core that processes the ultrasonic signals emitted by the sensors as to infer whether an object is present along the walking path. The underlying insight of the ultrasonic sensors is that they simultaneously emit signals, which in case of presence of obstacles if any, are reflected back. The distance to the obstacles is then deduced based on the time lapse between emission and reflection (commonly termed as time of flight TOF). The same concept was adopted in the work presented in [2], where the sensors are placed on a wearable belt instead. Another similar work was put forth in [3]. In this work, the sensors were placed on the
3
shoulders of the user as well as on a guide cane. Another design considers the use of electromagnetic signals instead of the ultrasonic ones [4]. A widespread antenna takes charge of sending an electromagnetic signal that is reflected back in case of presence of obstacles. The returned signal is afterwards amplified and the distance to the potential obstacles is estimated from the consumed TOF. However, the capacity of the prototype is limited to 3 meters ahead of the user. Besides object detection, the prototype introduced in [5] was supplemented by pedestrian recognition and global positioning system modules. In this context, departing from the literature, it is worth mentioning that assisted navigation for the blind/visually impaired people has manifested a reasonable amount of emphasis over the past years. By contrast, aided object recognition has not caught much attention. In [6], for instance, a food product recognition system in shopping spaces was proposed. It relies on detecting and recognizing the items QR codes through a portable camera. Another work considers detecting and recognizing bus line numbers for the blind [7]. Banknote recognition has also been addressed in [8]. Staircases, door, and indoor signage detection/recognition have been considered in [9], [10], and [11], respectively. Consequently, it can be made out that even the scarce amount of works that has been devoted to assisted object recognition for the blind so far, emphasizes on detecting/recognizing single classes of objects, which raises questions whether a blind person would be interested to have information about only one object in his/her surrounding provided that most usually numerous of objects may appear together across an indoor scene. Considering the multi-object recognition issue for the blind, it is to note that the claimed prototype/design is required to detect as many objects as needed on spot (i.e., in a very short time lapse). In this context, Mekhalfi et al. posed a concept termed ‘coarse scene description’ aimed at
4
recognizing objects in bulk while holding the processing time under control [12]-[13]. The underlying insight is that the blind person uses a portable camera to capture an image (of the scene to be described) in an indoor space. Then, the grabbed image is compared to an already constructed library of images (each stored along with its list of objects). Afterwards, the most similar images (and accordingly their lists of objects) from the library are fused based on a majority vote rule in order to deduce the list of objects likely to appear in the test image. Particularly in [12], it was made use of three strategies based on Scale Invariant Feature Transform (SIFT), Bag of Words model (BOW), and Principal Component Analysis (PCA). It came out that the PCA and BOW strategies run much faster while the SIFT strategy scored higher recognition rates. In [13], however, it was taken advantage of the Compressive Sensing (CS) theory as to promote a compact representation of the images, and a Gaussian Process Regression (GPR) model for the completion of the recognition process. The proposed scheme demonstrated good accuracies and a rather short processing time. In both previous works, the authors came to the conclusion that an adequate image representation is ought to bring forth good recognition accuracies and a short processing time. In this regard, we propose in this paper a novel image multilabeling approach, which derives insights from the so-called Random Projections (RP). The basic concept of the RP technique is to cast a signal (for instance, an image converted into a vector) onto a matrix of random scalars (basically, a bunch of vectors piled up column-wise to form a RP matrix) [14-18]. The output is thereupon a series of coefficients whose length is smaller than the size of the input signal, shifting thereby the process of coping with high dimensional signals to dealing with a set of representative coefficients. RP has been found interesting in different applications, ranging from data clustering [19], remote sensing [20], to biometrics [21]. The rationale behind adopting the
5
RP in our work is its notable potential of representing compactly a given image, while preserving a sound representativeness, which is reflected on a very short processing time (as just simple matrix operations are needed) and a good accuracy as validated by the experiments conducted in this paper. The rest of this paper is organized as follows, Section 2 outlines the coarse description concept through the image multilabeling concept, Section 3 gives insights about the RP technique and details how it is exploited for image representation in our work. Section 4 conducts the experimental outcomes, and Section 5 draws conclusions and sets future directions.
II.
COARSE IMAGE MULTILABELING
Departing from the need that allowing a blind individual to recognize multiple objects at once would be more informative than recognizing only a single class of objects, we thereby undertake in this paper a scheme termed ‘image multilabeling’ aimed at coarsely describing an indoor scene, whose concept is to deduce the list of objects that most likely show up in a portable camera-shot image. In this respect, the frequency as well as the location of the objects across the indoor scene do not make part of the scope of this paper. The image multilabeling procedure goes through two stages, namely (i) an offline phase, followed by (ii) an online stage. The offline part refers to the construction of a library of training images captured at different locations across the considered indoor space, where each training image is coupled with its corresponding list of objects and its RP-based compact representation, and stored offline. It is noteworthy that the number as well as the type of objects to be recognized are completely customizable. In the online part, given a query image acquired by the user by means of a portable camera (launched by the user through a voice-based interface), the algorithm proceeds by producing its respective RP sequence, which is further compared to the library of training images RPs. The training image
6
that scores the highest match is picked up and its corresponding list of object is ultimately inherited by the query (test) image. For the sake of clarity, Fig. 1 depicts the pipeline of the presented scheme.
III.
COMPACT IMAGE REPRESENTATION
A. Random Projections Concept From the multilabeling strategy driven in the previous section, it is to make out that, in order to satisfy near-real-time standards in terms of processing loads, a compact image representation paradigm needs to be adopted. In the literature, several techniques meant for dimensionality reduction have been presented such as principal component analysis (PCA) [22], and linear discriminant analysis (LDA) [23]. The underlying idea of PCA is to construct a set of linearly uncorrelated vectors, called principal components. The bunch of principal components, being less than or equal to the number of original vectors in the data, are then used as a basis to represent the data in hand. LDA is a method that searches for the best basis vectors (features) among the data for further separation into one or more classes. However, such dimensionality reduction methods may draw low performances as the original set of data is projected onto a subspace that does not guarantee the best discriminatory representation. Moreover, they require a training stage in order to produce the basis vectors, which often need to be regenerated once the data have been modified (i.e., data-dependent). A recently emerged technique, namely random projections (RP), has shown powerful assets for dimensionality reduction while holding a data-independent property. Its concept is rather simple, making it also well adapted for very fast implementations. It consists in ‘multiplying’ the input image with randomly generated images serving as filters. The smaller the number of filters, the stronger the dimensionality reduction. In this context, the well-known Johnson–Lindenstrauss
7
lemma [16]-[24] states that pairwise distances can be well maintained when randomly projected to a medium-dimensional space. The basis of RP comprises a matrix of random entries that serve as projection atoms for the original data. The entries of the matrix are generated once at the start, and then used even in case the dataset has been amended. As motivated above, the rationale for suiting the RP for image representation in our work is its remarkable aspect of concisely narrowing down an image into a series of scalars by means of a small-sized projection matrix. Consider a high dimensional signal random vectors belonging to of
onto
, and a projection matrix
comprising
(arranged column-wise). The low dimensional projection
is expressed by Eq (1): (1)
The projection procedure is more detailed in the next subsection. B. Random Projections for Image Representation From Eq (1), the input
is a one-dimensional signal that is projected column-wise on the
random matrix . Hence, each column of the matrix represents an element of the projection basis. Subsequently, we can obtain the same results if a two-dimensional representation of the input signal and the projection elements were used. In particular, the input signal that is meant for a RP-based representation is a portable camera-grabbed image. Therefore, the projection elements consist of a bunch of random matrices holding the same size of the image. In more details, if
filters are adopted, then
performed, which points out
inner products (between the image and each filter) are
scalars whose concatenation forms the ultimate compact RP
representation of the image. Fig. 2 outlines the routine for generating a RP representation of a generic image.
8
C. Multiresolution Random Projections We have stated earlier that the input image undergoes an inner product with the templates (filters) of the adopted random matrix (also referred to as measurement matrix). Accordingly, the choice of the matrix entries has to be defined. From the literature, it emerges that the popular matrix configuration is confined to the one presented in [18], where the probability distribution of the random matrix entries is expressed as follows:
(2)
where the indices and point to the lines and columns of the random matrix
, and
is a
constant which controls the distribution of values across the random matrix. In particular, two popular special cases used in the literature are given by case of
and
[14-15], [17]. In the
, two thirds of the projection templates entries would be null. Whereas if
is set to
unity, the projection matrix would then exhibit a uniform distribution of +1 and -1, which may serve our application as it randomly captures the gradients at different positions over the input image. In this paper, the value of is set to one, the respective uniform distribution is conducted hereunder:
(3)
As to thoroughly analyze the images across different scales, we propose in this work a multiresolution random projection (MRP) of the input image. Multiresolution concept is meant to further analyze a given signal/image at different scales. Hence, the aim is to capture richer information and cover finer details regarding the addressed signal/image. Examples of
9
multiresolution analysis include for instance scene classification [25]-[26], texture analysis [27], and remote sensing [28]. The MRP method consists of casting the input image onto a set of multiresolution random templates generated according to Eq. (3) with different patterns that vary gradually. In particular, assuming the images are of size m×n, then the first template of the projection matrix consists of four regions of size (m/2)×(n/2) each. The assignment of +1 and -1 is done according to equation (3). The next template consists of either ‘+1’ or ‘-1’-filled up regions of size (m/4)×(n/4). The size of the regions degrades progressively until the smallest sized region is reached, that is a single pixel of the image, where the pixels of the image take the values ‘+1’ or ‘-1’. In our work, the images consist of a standard resolution of 640×480. Eight multiresolution levels are adopted for generating the templates. The region size as well as the number of the respective templates are listed in Table 1. Fig. 3 depicts two samples of each template.
IV.
EXPERIMENTS
A. Designed Wearable System The proposed recognition scheme makes part of a complete wearable prototype that comprises two systems, namely (i) a recognition system, which serves for describing the indoor site for blind people as to provide them with a better ability to sense their nearby surrounding objects, and (ii) a guidance system, which takes charge of guiding the blind individual across the indoor space all the way from his/her current location to a desired destination whilst avoiding mobile as well as static objects. In terms of hardware, the system is completely computer-vision based, and mainly accommodates a portable camera (a CMOS camera from the IDS Imaging Development Systems, model UI-1240LE-C-HQ with KOWA LM4NCL lens) to be mounted on the user’s chest, a portable processing unit (a laptop or a tablet/smartphone), and a laser range
10
finder for assessing objects distance to the user, all mounted on a wearable jacket. Whilst the navigation system operates all the time once the prototype is launched, the recognition system works only under request by the user who grabs an image by means of the camera through a vocal command, which is afterwards forwarded to the processing unit and subjected to the multilabeling algorithm proposed in this paper. The outcome of the recognition (list of surrounding objects) is communicated to the user vocally by means of an earphone. Fig. 4 illustrates the wearable prototype. B. Dataset Description The evaluation of the presented image multilabeling approach was performed on four different datasets. The first two datasets (dataset 1 and dataset 2) were acquired in two separate indoor locations at the University of Trento, Italy. The acquisition was run by means of a chestmounted CMOS camera from the IDS Imaging Development Systems, model UI-1240LE-C-HQ with KOWA LM4NCL lens, carried by a wearable lightweight vest as illustrated in Fig. 4. This last shows the multisensor prototype on which we are working for both guiding blind people and helping them in recognizing objects in indoor sites. It is noteworthy that the images acquired by the portable camera were not compensated for barrel distortion, as our method handles the images as a whole and does not extract any feature from the images. Dataset 1 contains a total of 130 images, which was split into 58 training images, and 72 for testing purposes. Dataset 2 holds 131 images, divided into 61 images for training, and 70 images for testing. As noted above, a list of objects of interest must be predefined. Thereupon, we have selected the objects deemed to be the most important ones across the considered indoor environments. Regarding dataset 1, 15 objects were considered as follows: ‘External Window’, ‘Board’,
11
‘Table’, ‘External Door’, ‘Stair Door’, ‘Access Control Reader’, ‘Office’, ‘Pillar’, ‘Display Screen’, ‘People’, ‘ATM’, ‘Chairs’, ‘Bins’, ‘Internal Door’, and ‘Elevator’. As for dataset 2, the list was the following: ‘Stairs’, ‘Heater’, ‘Corridor’, ‘Board’, ‘Laboratories’, ‘Bins’, ‘Office’, ‘People’, ‘Pillar’, ‘Elevator’, ‘Reception’, ‘Chairs’, ‘Self Service’, ‘External Door’ and ‘Display Screen’. The second two datasets (dataset 3 and dataset 4) were shot at two separate indoor locations at the University of King Saud, Saudi Arabia, by means of a Samsung Note 3 smartphone. Dataset 3 accommodates 161 training and 159 test images for a total of 320 images, where the list of objects is set as follows: ‘Pillar’, ‘Fire extinguisher/hose’, ‘Trash can’, ‘Chairs’, ‘External Door’, ‘Hallway’, ’Self-service’, ‘Reception’, ’Didactic service machine’, ‘Display Screen’, ‘Board’, ‘Stairs’, ‘Elevator’, ‘Laboratory’, ‘Internal Door’. Dataset 4 comprises 174 images consisting of 86 training and 88 testing images, and the object list contains: ‘Board’, ‘Fire extinguisher’, ‘Trash cans’, ‘Chairs’, ‘External door’, ‘didactic service machine’, ‘Self-service’, ‘Reception’, ‘Cafeteria’, ‘Display screen’, ‘Pillar’, ‘Stairs’, ‘Elevator’, ‘Prayer room’ and ‘Internal door’. It is noteworthy that the training images for all datasets were selected in such a way to cover all the predefined objects in the considered indoor environment. C. Results Two measures are employed in order to quantify the multilabeling accuracy, namely the sensitivity (SEN) and specificity (SPE). They respectively convey the accuracies of true positives (existing objects) and true negatives (non- existing objects). We have hinted earlier that a central quest in our work is to point out the list of objects present in the scene in a very short amount of time. In that regard, it is noted that image
12
resolution has a deep impact on the processing time. In particular, the higher the resolution the more time is consumed whilst in the projection step (i.e., the inner product) as the RP templates have the same size of the input image. Therefore, reducing the image size will incur in reducing the size of the RP templates too, which will cause a drastic decrease in the processing time. In this concern, we have considered four resolution levels, the first one is the full sized resolution (640×480), the second one divides the first case by a ratio of 1/2 (320×240), the third one divides it by a ratio of 1/5 (128×96), whilst the last one divides it by a 1/10 ratio (64×48). Given that the projection matrix comprises random ‘+1’ and ‘-1’ distributions, we have taken the average accuracies over ten runs (each run performed with a different set of random projections). The matching step has been achieved by means of the cosine distance as it pointed out slight improvements with respect to the basic Euclidean distance. We initiate the experimental evaluation by checking the efficiency of a multiresolution RP as compared to a regular multiresolution strategy (i.e., where the projection filters consist of a resolution of one pixel). Therefore, we further launched the experiments by means of 257 RP templates (which is the total number of multiresolution templates according to Table 1) of a 1×1 region size. The average accuracies over ten runs are reported in Tables 2-5. The results clearly point out the valuable increase incurred by the multiresolution RP over the ordinary scenario, which indicates the capacity of the MRP in investigating further scales of the considered images. From the reported results, it appears that the presented MRP algorithm points out its best at a resolution ration of 1/10 regarding all the datasets. This might trace back to the fact that decaying the image resolution diminishes the small details as well as the size of the objects while maintaining the backgrounds and the large surfaces/objects. In other terms, the tiny details can be considered as outlier noise, hence reducing the resolution fades that noise and keeps the
13
dominating spectral content of the images, which is essential for the comparison as validated by the results. To further assess the proposed multilabeling scheme, we show a comparison against the multilabeling strategy presented in [13]. In this work, the authors proposed a CS-based image representation and a GPR-based image similarity prediction. The results with respect to all image resolution ratios (1, 1/2, 1/5, and 1/10) are respectively listed in Tables 2-5. The tables report the SEN, SPE, their average, and the average processing time per image (the machine consists of a Dell laptop supplied with an Intel i5 processor and a 4 GB of memory). Considering a ratio of 1/10, the GPR-CS method gives higher SEN as compared to MRP regarding dataset 1 and dataset 3. That is referred to the reason that the GPR method does not rely on the spectral similarity between the images in order to achieve the multilabeling process, but on the semantic content of the images, which seems to be helpful when the density of the objects within the images is not high. For dataset 2 and dataset 4, however, the proposed scheme outperforms the one presented in [13], which is explained by the fact that the indoor space where the dataset was shot is relatively small, making the images spectrally close to each other. Moreover, the density of objects is relatively higher in both datasets, which actually raises the probability of detecting true positives. These both motives (i.e., spectrally similar images, with high density of objects) favor the MRP method as it relies on the apparent similarity between the images in order to perform the multilabeling. Considering the average between SEN and SPE regarding all image resolutions, it appears clearly the MRP method predominates the GPR-CS on all datasets. Moreover, MRP has exhibited slight improvements while decreasing image resolutions as explained earlier.
14
As regards the processing time per image, MRP is, by far, faster than the GPR-CS method. In particular, for the smallest resolution of a one tenth ratio, the GPR-CS runs over one second per image whilst the processing time for the MRP revolves around 0.036 sec/image, which is almost 27 times faster, fulfilling thereby real-time processing requirements. The reason CS took longer processing time is that it operates in an iterative manner as to produce a compact image representation, whereas RP requires only a direct projection of the image onto the considered filters. The global classification accuracies as well the processing time over all the four datasets, considering a resolution of 1/10, are depicted in Table 6, which once again recalls the fact that the MRP method exhibits better performance. Two examples from each dataset are depicted in Fig. 5, Fig. 6, Fig. 7, and Fig. 8, respectively. D. Comparison with Deep Models Besides the comparison conducted earlier in the previous subsection, we further evaluate the performance of the proposed scheme versus a remarkably well-posed state-of-the-art image classification model, namely the Convolutional Neural Network (CNN). Technically, the CNN is was introduced around 1998 [29], but hasn’t received much attention due to its demanding processing requirements. However, lately, owing to the abundance as well as affordability of relatively powerful machines, steady increasing emphasis is being directed towards taking advantage of CNNs. In one word, the CNN is a variant of feed forward artificial neural networks whose underlying insight is to apply an ensemble of filters over the input space (typically an image) as to capture salient information that can be exploited to normally tell the (object) class pertaining to the input pattern. The CNN has exhibited notable success in various applications, including for instance pattern recognition [30], biomedical engineering [31], action recognition [32], object tracking [33], and speech recognition [34]. On account of its particular recognition
15
assets, the CNN was opted for as a comparison ground to underline the effectiveness of the proposed multilabeling approach. In the literature, there exists a number of CNN models already learned on extremely large image datasets, one the most popular is the GoogLeNet model, which is employed in our work to extract image representations which are afterwards fed into a Softmax classification layer. In this regard, the Softmax classifier is learned on the CNN representation deduced from the training images and then applied on the test images to produce the final lists of objects. The final classification results as well as the average processing time per image regarding the proposed scheme and the CNN model are reported in Table 7. From this table, a first observation that can be drawn is that the CNN incurs slight improvements over our method (~2 %) in terms of overall accuracy between sensitivity and specificity. In terms of sensitivity, however, our scheme by far (except for dataset 3) outperforms the CNN model, while this latter scored a higher specificity. Out interest however is to invest into gaining a higher sensitivity as it expresses more informative content (i.e., the existing objects). As for processing times, out method is roughly twice faster than the CNN as it comprises only two basic stages (i.e., projection and matching). As a trade-off between processing time and accuracy, both methods can be viewed equivalent, with a tendency of our method to favor the sensitivity as well as the processing time.
V.
CONCLUSIONS
This paper poses a multiobject recognition scheme meant to recognize objects in bulk at a given indoor environment for visually impaired people. In particular, the main focus of the paper is the existence/absence of objects instead of their position in the indoor site. The underlying insight is to apply the RP technique in order to compactly represent a portable camera-acquired image and compare it to a set of images whose RP signatures and
16
object lists are prepared and stored offline a priori. Afterwards, based on the comparison of the RP corresponding to the test image versus all the RP of the built library, the object list of the closest training image is thus assigned to the query image. The proposed method has revealed a sound accuracy with a remarkable short processing time, which enables it to detect multiple objects with real time processing constraints. The presented method is likely to tackle the scale change problem. In general terms, however, it might lack some robustness to rotation changes, which can be addressed by augmenting the images held in the library. However, given that the camera is fixed on the user’s chest (and thus does not exhibit significant rotational variability between the images), such issue is not posed. As for further concerns, it is believed that the current form of the algorithm can be further upgraded. For instance, the image representations have been generated through a projection onto the RP matrix leading to a representation dimensionality smaller than the size of the input images. An alternative is to cast the image to be represented onto a higher dimensional space (through a first very large RP matrix) and then project again the obtained representation onto another smaller RP matrix for compact representation [35]. Another potentially promising direction to invest in is to consider the coarse description concept by exploiting the temporal dimension in video sequences rather than single shots (extracted from the video). This may benefit the addressed application in terms of accuracy but at the cost of higher processing overheads. However, recent works suggest that coding video data can be made faster considering for instance multi-core parallel processing platforms [37-38].
Acknowledgement This work was supported by the Deanship of Scientific Research at King Saud University through the Local Research Group Program under Project RG-1435-055.
17
The authors would like to extend their gratitude to A. Vedaldi and K. Lenc for making available the MatConvNet toolbox [36] exploited in the context of this paper.
References [1] I. Ulrich and J. Borenstein, 'The GuideCane-applying mobile robot technologies to assist the visually impaired', IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 31, no. 2, pp. 131-136, 2001. [2] S. Shoval, J. Borenstein and Y. Koren, 'Auditory guidance with the Navbelt-a computerized travel aid for the blind', IEEE Trans. Syst., Man, Cybern. C, vol. 28, no. 3, pp. 459-467, 1998. [3] M. Bousbia-Salah, M. Bettayeb and A. Larbi, 'A Navigation Aid for Blind People', Journal of Intelligent & Robotic Systems, vol. 64, no. 3-4, pp. 387-400, 2011. [4] L. Scalise, V. Primiani, P. Russo, D. Shahu, V. Di Mattia, A. De Leo and G. Cerri, 'Experimental Investigation of Electromagnetic Obstacle Detection for Visually Impaired Users: A Comparison With Ultrasonic Sensing', IEEE Trans. Instrum. Meas., vol. 61, no. 11, pp. 3047-3057, 2012. [5] S. Lee, S. Kang and S. Lee, 'A Walking Guidance System for the Visually Impaired', International Journal of Pattern Recognition and Artificial Intelligence, vol. 22, no. 06, pp. 1171-1186, 2008. [6] D. López-de-Ipiña, T. Lorido, and U. López, “BlindShopping: Enabling accessible shopping for visually impaired people through mobile technologies,” in Proc. 9 th Int. Conf. Toward Useful Services Elderly People Disabilities, 2011, pp. 266–270. [7] H. Pan, C. Yi, and Y. Tian, “A primary travelling assistant system of bus detection and recognition for visually impaired people,” in Proc. IEEE Int. Conf. Multimedia Expo Workshops (ICMEW), Jul. 2013, pp. 1–6. [8] F. Hasanuzzaman, X. Yang and Y. Tian, 'Robust and Effective Component-Based Banknote Recognition for the Blind', IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1021-1030, 2012. [9] T. J. J. Tang, W. L. D. Lui, and W. H. Li, “Plane-based detection of staircases using inverse depth,” in Proc. ACRA, 2012, pp. 1–10. [10] X. Yang and Y. Tian, “Robust door detection in unfamiliar environments by combining edge and corner features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2010, pp. 57–64. [11] S. Wang and Y. Tian, “Camera-based signage detection and recognition for blind persons,” in Proc. 13th Int. Conf. Comput. Helping People Special Needs, 2012, pp. 17–24.
18
[12] M. Mekhalfi, F. Melgani, Y. Bazi and N. Alajlan, 'Toward an assisted indoor scene perception for blind people with image multilabeling strategies', Expert Systems with Applications, vol. 42, no. 6, pp. 2907-2918, 2015. [13] M. Mekhalfi, F. Melgani, Y. Bazi and N. Alajlan, 'A Compressive Sensing Approach to Describe Indoor Scenes for Blind People', IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 7, pp. 12461257, 2015. [14] J. Wang, Geometric structure of high-dimensional data and dimensionality reduction. Beijing: Springer, 2012. [15] E. Bingham and H. Mannila, “Random Projection in Dimensionality Reduction: Applications to Image and Text Data,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 245- 250, 2001. [16] E. Candes and T. Tao, 'Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?', IEEE Trans. Inform. Theory, vol. 52, no. 12, pp. 5406-5425, 2006. [17] D. Achlioptas, 'Database-friendly random projections: Johnson-Lindenstrauss with binary coins', Journal of Computer and System Sciences, vol. 66, no. 4, pp. 671-687, 2003. [18] P. Li, T. J. Hastie, and K. W. Church, BVery sparse random projections,[ in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, 2006, pp. 287–296. [19] Â. Cardoso and A. Wichert, 'Iterative random projections for high-dimensional data clustering', Pattern Recognition Letters, vol. 33, no. 13, pp. 1749-1755, 2012. [20] W. Li, S. Prasad and J. Fowler, 'Classification and Reconstruction From Random Projections for Hyperspectral Imagery', IEEE Trans. Geosci. Remote Sensing, vol. 51, no. 2, pp. 833-843, 2013. [21] J. Pillai, V. Patel, R. Chellappa and N. Ratha, 'Secure and Robust Iris Recognition Using Random Projections and Sparse Representations', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 9, pp. 1877-1893, 2011. [22] I. Jolliffe, Principal component analysis. New York: Springer, 2002. [23] X. Shu, Y. Gao and H. Lu, 'Efficient linear discriminant analysis with locality preserving for face recognition', Pattern Recognition, vol. 45, no. 5, pp. 1892-1898, 2012. [24] W.B. Johnson and J. Lindenstrauss, "Extension of Lipschitz Mapping into a Hilbert Space," Proc. Conf. Modern Analysis and Probability, pp. 189-206, 1984. [25] L. Zhou, Z. Zhou and D. Hu, 'Scene classification using multi-resolution low-level feature combination', Neurocomputing, vol. 122, pp. 284-297, 2013. [26] L. Zhou, Z. Zhou and D. Hu, 'Scene classification using a multi-resolution bag-of-features model', Pattern Recognition, vol. 46, no. 1, pp. 424-433, 2013.
19
[27] J. Florindo and O. Bruno, 'Texture analysis by multi-resolution fractal descriptors', Expert Systems with Applications, vol. 40, no. 10, pp. 4022-4028, 2013. [28] Y. Bazi, F. Melgani and H. Al-Sharari, “Unsupervised Change Detection in Multispectral Remotely Sensed Imagery With Level Set Methods“, IEEE Trans. Geosci. Remote Sensing, vol. 48, no. 8, pp. 31783187, 2010. [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [30] S. L. Phung and A. Bouzerdoum, “A pyramidal neural network for visual pattern recognition,” IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 329–343, Mar. 2007. [31] E. J. S. Luz, R. S. William, CH Guillermo, D. Menotti, “ECG-based heartbeat classification for arrhythmia detection: A survey,” Computer Methods and Programs in Biomedicine, pp. 2608-2611, 2015. [32] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013. [33] J. Fan, W. Xu, Y. Wu, and Y. Gong, “Human Tracking Using Convolutional Neural Networks,” IEEE Trans. Neural Networks, vol. 21, no. 10, pp. 1610-1623, Oct. 2010. [34] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 10, pp. 1533–1545, Oct. 2014 [35] L. Liao, Y. Zhang, S. Maybank, Z. Liu and X. Liu, 'Image recognition via two-dimensional random projection and nearest constrained subspace', Journal of Visual Communication and Image Representation, vol. 25, no. 5, pp. 1187-1198, 2014. [36] "MatConvNet - Convolutional Neural Networks for MATLAB,", A. Vedaldi and K. Lenc, Proc. of the ACM Int. Conf. on Multimedia, 2015. [37] C Yan, Y Zhang, J Xu, F Dai, J Zhang, Q. Dai, F. Wu, "Efficient parallel framework for HEVC motion estimation on many-core processors," IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, pp. 2077-2089, 2014. [38] C Yan, Y Zhang, J Xu, F Dai, L Li, Q. Dai, F. Wu, "A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors." IEEE Signal Processing Letters, vol. 21, pp.573576, 2014.
20
Fig. 1. Flowchart of the coarse scene description framework.
Fig. 2. Diagram outlining the RP-based image representation. A compact form is made possible after applying a set of simple product operations between the input image and an ensemble of randomly generated image filters.
21
Fig. 3. Two samples of each projection template sorted from top to bottom according to the resolution. Top templates refer to resolutions of half the image size. Bottom templates refer to regions of one pixel. Black color indicates the ‘-1’ whilst the grey color refers to ‘+1’. 22
Fig. 4. View of the main components of the designed wearable prototype. Query Image
Closest training image
Detected Objects: Board, Office, Pillar, Chairs, Bins, Internal Door. Query Image
Closest training image
Detected Objects: External Door, Stair Door.
Fig. 5. Image multilabeling examples from Dataset 1. The left images show the captured frames, while the right ones reports the corresponding closest training images from the library and (at the bottom) the corresponding labels assigned as detected objects to the query images.
23
Query Image
Closest training image
Detected Objects: Stairs, Heater, Corridor, Board. Query Image
Closest training image
Detected Objects: Board, Bins, External Door.
Fig. 6. Image multilabeling examples from Dataset 2. The left images show the captured frames, while the right ones reports the corresponding closest training images from the library and (at the bottom) the corresponding labels assigned as detected objects to the query images.
24
Query Image
Closest training image
Detected Objects: Trash Can, Chairs, Display Screen, Board, Stairs. Query Image
Closest Image
Detected Objects: Trash Can, Chairs, Stairs.
Fig. 7. Image multilabeling examples from Dataset 3. The left images show the captured frames, while the right ones reports the corresponding closest training images from the library and (at the bottom) the corresponding labels assigned as detected objects to the query images.
25
Query Image
Closest training image
Detected Objects: Board, Fire Extinguisher, Trash Cans, Internal Door. Query Image
Closest training image
Detected Objects: Fire Extinguisher, Chairs, External Door, Display Screen.
Fig. 8. Image multilabeling examples from Dataset 4. The left images show the captured frames, while the right ones reports the corresponding closest training images from the library and (at the bottom) the corresponding labels assigned as detected objects to the query images.
26
Table 1. Characteristics (dimensions and numbers) of the adopted set of multiresolution random projection filters. Region size 240x320 120x160 60x80 30x40 15x20 5x10 3x5 1x1 Total
Number of templates 2 5 10 20 30 40 50 100 257
Table 2. Multilabeling results in terms of recognition accuracy and computation time on all datasets for a resolution ratio of 1.
GPR-CS [13]
Muliresolution Random Projections
Random Projections
SEN SPE Average Time (Sec) SEN SPE Average Time (Sec) Std (SEN) Std (SPE) SEN SPE Average Std (SEN) Std (SPE)
Dataset 1 79.77 66.54 73.16 2.17 68.69 82.89 75.79 1.04 2.24 1.70 72.45 90.73 81.59 1.60 0.42
Dataset 2 75 74.09 74.55 2.61 74.32 90.88 82.6 1.07 2.60 0.61 68.76 83.35 76.05 1.09 0.94
27
Dataset 3 73.28 80.12 76.7 3.21 68.48 94.11 81.3 1.06 1.99 0.39 68.06 94.30 81.18 2.17 0.42
Dataset 4 69.42 75.14 72.28 3.08 67.48 91.52 79.5 1.09 1.92 0.80 62.89 90.30 76.59 3.10 0.83
Table 3. Multilabeling results in terms of recognition accuracy and computation time on all datasets for a resolution ratio of ½.
GPR-CS [13]
Muliresolution Random Projections
Random Projections
SEN SPE Average Time (Sec) SEN SPE Average Time (Sec) Std (SEN) Std (SPE) SEN SPE Average Std (SEN) Std (SPE)
Dataset 1 79.77 66.91 73.34 1.39 68.65 82.85 75.75 0.25 2.12 1.65 69.18 81.14 75.16 2.60 0.96
Dataset 2 75 74.09 74.55 1.48 74.5 90.83 82.67 0.22 2.67 0.58 67.45 89.88 78.67 3.47 0.95
Dataset 3 73.3 80.13 76.72 1.78 69.21 94.24 81.73 0.23 1.79 0.39 63.46 93.75 78.60 0.92 0.32
Dataset 4 67.35 74.4 70.88 1.53 67.69 91.58 79.63 0.29 1.72 0.68 58.72 90.71 74.71 2.17 0.61
Table 4. Multilabeling results in terms of recognition accuracy and computation time on all datasets for a resolution ratio of 1/5.
GPR-CS [13]
Muliresolution Random Projections
Random Projections
SEN SPE Average Time (Sec) SEN SPE Average Time (Sec) Std (SEN) Std (SPE) SEN SPE Average Std (SEN) Std (SPE)
Dataset 1 81.64 67.4 74.52 1.13 68.84 82.77 75.8 0.056 2.43 1.63 70.07 80.12 75.10 2.20 0.96
Dataset 2 74.09 73.73 73.91 1.36 75.86 91.08 83.47 0.056 2.17 0.53 64.63 93.89 79.26 1.98 0.72
28
Dataset 3 72.51 80.18 76.35 1.67 70.42 94.54 82.48 0.061 1.63 0.35 64.63 93.89 79.26 1.71 0.37
Dataset 4 66.53 74.12 70.33 1.11 69.55 91.73 80.64 0.061 1.92 0.78 57.15 90.35 73.75 2.10 0.73
Table 5. Multilabeling results in terms of recognition accuracy and computation time on all datasets for a resolution ratio of 1/10.
GPR-CS [13]
Muliresolution Random Projections
Random Projections
SEN SPE Average Time (Sec) SEN SPE Average Time (Sec) Std (SEN) Std (SPE) SEN SPE Average Std (SEN) Std (SPE)
Dataset 1 80.89 68.14 74.52 1.033 71.16 83.2 77.18 0.037 2.32 0.78 70.60 80.92 75.76 1.04 1.16
Dataset 2 75 73.97 74.49 1.228 77.18 91.41 84.3 0.037 2.32 0.49 67.77 89.75 78.76 2.32 0.47
Dataset 3 73.82 80.53 77.18 1.406 71.65 94.81 83.23 0.035 1.49 0.31 65.16 94.15 79.65 1.68 0.23
Dataset 4 64.46 72.91 68.69 1.056 70 92.33 81.16 0.037 1.90 0.73 61.32 90.73 76.03 2.54 0.59
Table 6. Comparison in terms of recognition accuracies, averaged over the four datasets, and processing time per image. MRP GPR-CS [13]
SEN 72.49 73.54
SPE 90.43 73.88
Time (Sec) 0.036 1.18
Table 7. Comparison of classification rates and computation times between the MRP and the CNN model.
Muliresolution Random Projections CNN_Softmax
SEN SPE Average Time (Sec) SEN SPE Average Time (sec)
Dataset 1 71.16 83.2 77.18 0.037 61.79 96.55 79.17 0.068
Dataset 2 77.18 91.41 84.3 0.037 67.27 98.55 82.91 0.066
29
Dataset 3 71.65 94.81 83.23 0.035 73.56 97.95 85.75 0.064
Dataset 4 70 92.33 81.16 0.037 67.76 95.91 81.84 0.067
Highlights
A multiresolution random projection for image representation is presented.
An indoor scene description for visually impaired people is proposed.
Experiments are conducted on four different indoor datasets.
Results qualify the framework for a near-real time blind assistance technology.
30