Photo sundial: Estimating the time of capture in consumer photos

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Photo sun...

Download PDF

10MB Sizes 0 Downloads 27 Views

Report

Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Photo sundial: Estimating the time of capture in consumer photos Tsung-Hung Tsai a, Wei-Cih Jhou a, Wen-Huang Cheng a,n, Min-Chun Hu b, I-Chao Shen a, Tekoing Lim a, Kai-Lung Hua c, Ahmed Ghoneim d, M. Anwar Hossain d, Shintami C. Hidayati a,c a

Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan c Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taiwan d Department of Software Engineering, College of Computer and Information Science (CCIS), King Saud University, Saudi Arabia b

art ic l e i nf o

a b s t r a c t

Article history: Received 20 January 2015 Received in revised form 5 September 2015 Accepted 14 November 2015 Communicated by Tao Mei

The time of capture of consumer photos provides rich information in temporal context and has been widely employed for solving various multimedia problems, such as multimedia retrieval and social media analysis. However, we observed that the recorded time stamp in a consumer photo does not often correspond to the true local time at which the photo was taken. This would greatly damage the robustness of time-aware multimedia applications, such as travel route recommendation. Therefore, motivated by the use of traditional sundials, this work proposes a system, Photo Sundial, for estimating the time of capture by exploiting the astronomical theory. In particular, we infer the time by establishing its relations to the measurable astronomical factors from a given outdoor photo, i.e. the sun position in the sky and the camera viewing direction in the photo-taken location. In practice, since it is more often that people would take multiple photos in a single trip, we further develop an optimization framework to jointly estimate the time from multiple photos. Experimental results show that the average estimated time error is less than 0.9 h by the proposed approach, with a signiﬁcant 65% relative improvement compared to the state-of-the-art method (2.5 h). To the best of our knowledge, this work is the ﬁrst study in multimedia research to explicitly address the problem of time of capture estimation in consumer photos, and the achieved performances highly encourage our system for practical applications. & 2015 Elsevier B.V. All rights reserved.

Keywords: Photo time estimation Camera parameters Sun modeling

1. Introduction Most modern digital cameras have the ability to automatically add a time stamp to the photos that users take. The cameras store the information in the digital photo itself, in the exchangeable image ﬁle (EXIF) format. Since the time information provides rich contextual cues related to capture conditions independent of the captured scene contents, it is widely employed in multimedia research to beneﬁt the understanding of content semantics and the creation of various timeaware applications, such as time-based photo clustering [1], semantic scene classiﬁcation [2,3], automatic image annotation [4], travel route recommendation [5–7], and virtual navigation in photos [8,9]. In the literature [1,5,6], a time stamp is commonly expected to be the local time at which the photo was taken in a geographical location. However, this is often not a valid assumption, especially for consumer photos. For instance, Fig. 1(a) shows three examples downloaded from Flickr. The captured scene contents are obviously inconsistent with the n

Corresponding author. E-mail address: [email protected] (W.-H. Cheng).

associated time stamps. Further, Fig. 1(b) gives the time stamp distributions of two large sets of GPS-geotagged photos that we randomly collected from Flickr using the GPS coordinates of two Sydney attractions, i.e. Bondi Beach and Royal Botanic Gardens. It can be found that peculiar phenomena are also caused by the problematic time stamps. For example, according to the time distribution, there are many photos being taken in the Royal Botanic Gardens at midnight when it is closed. Moreover, it is unusual that the number of photos taken between 2 AM and 5 AM is larger than that between 2PM and 5PM in the afternoon. In general, the problematic time stamp is due to two main cases: (1) The camera time is set incorrect. (2) Some camera models would reset the time to the default setting once the battery pack is removed. Besides, the wide-spread use of social networking websites brings additional issues. For example, many modern websites allow visitors to upload photographs and the metadata of these photos are commonly stripped as in the uploading process. Moreover, in the past few years, there is a considerable rise in availability of digital image editing software as well as photo editing mobile apps. The metadata of photos

http://dx.doi.org/10.1016/j.neucom.2015.11.050 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

2

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 1. Examples of incorrect time stamps. (a) Sample photos with the corresponding recorded time stamps (the date stamps are in parentheses). The ﬁrst and the second scenes in the beach have daytime scenes with inconsistent nighttime stamps. Similarly, the third scene with a street view at night has an incorrect time stamp close to 10 AM. (b) Time stamp distributions of two sets of GPS-geotagged photos that are collected from Flickr using GPS coordinates of two Sydney attractions, Bondi Beach and Royal Botanic Gardens (RBG), respectively, with the total number of 1662 and 1663 photos. According to the time distribution, there are over 14% photos taken in RBG around 1–6 AM. However, RBG is closed during those hours.

processed by these editing software are often overwritten. As a result, the correctness in photograph metadata is often doubted. Therefore, the development of effective techniques for probing the correct time of capture will not only be desired to help efﬁciently organize people's personal photos but also be able to ensure the robustness of multimedia applications those signiﬁcantly depend on time information. For example, travel route recommendation is an active topic of multimedia research [5–7]. It often utilizes the social media sites to collect time-labelled images and identify “the best time of day” for visiting travel spots. The unreliable time stamps would greatly degrade the recommendation quality. Similarly, advanced time-sensitive services, such as the “Any Time” featured in Google's image search, would suffer if images were indexed by incorrect time stamps. In this paper, to address the above problem, we propose a system, Photo Sundial, to compute the time information of outdoor photos. Motivated by the use of traditional sundials idea [10], we ﬁrst measure two astronomical factors from a given photo, i.e. the sun position in the sky and the camera viewing direction. According to the astronomical theory [11,12], we then infer the initial estimated time by formulating the mathematical relation between the involved astronomical factors. Considering that people tend to take multiple photos in a single trip (possibly including a number of different locations and even indoor environments) [5,6,13], this fact can be used to further optimize the initial estimated time of capture by the proposed joint alignment for each set of multiple photos. To validate the effectiveness of our system, in the experiments, we have collected a dataset of 2102 consumer photos annotated with the ground truth of the time of capture and GPS coordinate. The experimental results show that the average estimated time error is less than 0.9 h by our approach, with a signiﬁcant 65% relative improvement compared to the state-ofthe-art method (2.5 h). Our main contributions are threefold: (1) To the best of our knowledge, this work is the ﬁrst study in multimedia research to explicitly address the problem of time of capture estimation by exploiting the temporal relationship among multiple photos, and the achieved performances highly encourage our system for practical applications. (2) An approximation method is proposed to effectively calculate the initial estimated time of capture when a measured sun position is inaccurate. Also, general algorithms for accessing the camera viewing direction are developed to help obtain the true position of the sun. (3) The collected dataset is valuable to be the ﬁrst to provide reliable contextual information of the photo capture conditions lacking in the existing image datasets. In the rest of this paper, Sections 2 and 3 review the related work in the literature and give the prior knowledge of astronomical theory, respectively. The main framework of the proposed approach for estimating the time of capture is presented in Section 4. Section 5

shows the experimental results and discussions. Finally, Section 6 concludes our work and gives directions for the future research.

2. Related work In the literature, since it lacks research studies on estimating the time of capture of consumer photos and videos, we will brieﬂy summarize relevant researches on a broader range of aspects, including the time-related factors measurement, camera pose determination, and multimedia applications based on geo-temporal contexts, as follows. 2.1. Time-related factors measurement Lalonde et al. [14] developed a sky model for estimating the sun position in a single outdoor image according to the clues of sky, shadows, and shadings. In addition, Reda et al. [15] presented a solar position algorithm for solar radiation applications, such as the deployment of solar panels. In date stamp estimation, Sunkavalli et al. [16] collected a sequence of outdoor images in a day from cameras ﬁxed in different geographical locations to determine the day of the year by color changing analysis. Besides, Garg et al. [17] used optical sensors and video cameras to record the electric network frequency (ENF) signal as a natural time stamp from indoor environments under ﬂuorescent lighting. 2.2. Camera pose determination Park et al. [18] determined the camera viewing direction of a photo according to its GPS information and utilized the Google street views and aerial images around the GPS location for visual content based analysis. Bansal et al. [19] estimated the position and orientation of a photo taken in urban areas by using aerial image databases. In Snavely et al.'s work [9,20], they collected an image dataset from photo sharing websites and trained a model that can estimate the camera orientation by visual matching. Moreover, Lalonde et al. [21] used the sun position, time stamp, and clear sky image to estimate the geographical location where a photo was taken by the camera. 2.3. Multimedia applications based on geo-temporal contexts Time stamps and geo-tags have been widely applied in multimedia tasks. For example, Cooper et al. [1] used time stamps of photos to perform temporal clustering and treated them as important clues for event recognition. Besides, Hays et al. [22] collected more than 6 million GPS-geotagged photos and trained a distributional model that can estimate geographic information from a single image. Shrivastava et al. [23] proposed a crossdomain image matching method that improves the accuracy of the

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

Fig. 2. Illustrations of astronomical concepts.

GPS coordinate estimation of photos. In traveling applications, Hsieh et al. [8] presented a system that exploits temporal and spatial relations among photos and hence provides users a new photo browsing experience as like they are revisiting their trips in the photos. In addition, Lu et al. [6] developed a traveling recommendation system by analyzing the temporal and spatial information of photos collected from photo sharing websites. Similarly, Arase et al. [5] performed trip pattern mining to discover the spatio-temporal relationship of consumer photos.

3. Prior knowledge In astronomy [11,12], a celestial coordinate system is a metric geometry (a geometry where distance can be measured) for mapping positions on the celestial sphere. As shown in Fig. 2(a), the celestial coordinate system uses two parameters, i.e. the azimuth angle ϕs and the elevation angle θs, of the spherical coordinate system to express the position of an interesting point, such as the sun or a star. The azimuth angle is an angle between a reference vector and the vector from an observer to the interesting point projected perpendicularly onto the local horizon. To describe the azimuth angle of the sun, it deﬁnes the reference vector pointing from the observer to the due north and the angle is in clockwise direction. On the other hand, an elevation angle is an angle between the interesting point and the local horizon, with respective to the observer. It is expressed between 0 to 90° to describe the altitude of the interesting point. In addition, the sun declination angle δ and the solar hour angle hs are two other important astronomical terms [11,12]. The declination of the sun is the angular position of the sun at noon1 with respect to the plane of the Earth's equator, cf. Fig. 2(b). The sun declination angle currently has the range: 23:441 r δ r þ 23:441 during its yearly cycle. One approximation for the declination angle given by Cooper's equation is as follows: 3601 ðN þ 10Þ ð1Þ δ ¼ 23:441 cos 3651

Moreover, as shown in Fig. 2(c), the solar hour angle of a point on the Earth's surface is the angle between two planes: one containing the Earth's axis and the zenith of the given point (the meridian plane), and the other containing the Earth's axis and the sun. That is, the solar hour angle, hs, is an angular displacement of the sun east or west of the local meridian. We represent the apparent displacement of the sun away from solar noon, either as a negative or positive angle. When hs equals zero it indicates that the sun is at its highest point (noon) for that given day. ( 1801 rhs o01 if before noon ðmorningÞ; hs A ð2Þ 01 ohs r þ 1801 if after noon ðafternoonÞ: Corresponding to Earth's rotation, this angular displacement can be used to measure time t, i.e. one hour represents 15 angular degrees of travel around the 3601 celestial sphere. The measure of the hour angle is: hs ¼ 15 ðt 12Þ1, where t¼12 indicates local noon. In other words, by observing the position of the sun with respect to the Earth, we are able to compute the time t according to the solar hour angle. Given the hour angle hs and the geographical location (latitude Φ) of an observer, the sun position is represented by a pair of the solar elevation angle θs and the solar azimuth angle ϕs. According to the astronomical theory [11,12], the solar elevation angle θs can be approximated by the following equation: sin θs ¼ sin δ sin Φ þ cos δ cos hs cos Φ And the equation which related to the solar azimuth angle be formulated as:

ð3Þ

ϕs can

cos ϕs cos θs ¼ sin δ cos Φ cos δ cos hs sin Φ:

ð4Þ

sin ϕs cos θs ¼ sin hs cos δ:

ð5Þ

where N denotes the number of days since January 1. For example, in the Northern Hemisphere, it varies from 23:441 at the winter solstice, through 01 at the vernal equinox, to þ 23:441 at the summer solstice.

Therefore, by setting θs equal to zero (i.e. the sun exactly on the local horizon), we can obtain the solar hour angles at sunrise and sunset, i.e. hsr and hsf, from Eq. (3). Since the daily rotation of the Earth is counter-clockwise, the sun will rise in the east (at a solar azimuth angle of approximately 90°), move to the south (at a solar azimuth angle of approximately 180°), and fall in the west (at a solar azimuth angle of approximately 270°), which form a sun path diagram,2 cf. Fig. 3. By further substituting the two solar hour angles (i.e. hsr and hsf) into Eq. (4), we can uniquely determine the corresponding sunrise and sunset solar azimuth angles ϕsr and ϕsf,

1 In astronomy, the noon refers to the solar noon, i.e. the time of day when the sun appears to have reached its highest point in the sky. The solar noon will not be necessarily the same time as “clock noon”, but the difference is bounded in few minutes and can be ignored.

2 The information of sun path diagram for any speciﬁed location on Earth at any speciﬁed time of the year can be obtained at the University of Oregon's Solar Radiation Monitoring Laboratory. Website: http://solardat.uoregon.edu/SunChart Program.html

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

with

8 > < A ¼ cos δ sin Φ B ¼ cos ϕs cos δ > : C ¼ sin ϕ cos Φ sin δ: s

The solutions of Eq. (6) are given by the quadratic formula: pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2AC 7 2B A2 þ B2 C 2 : cos hs ¼ 2ðA2 þ B2 Þ

ð7Þ

ð8Þ

An overview of the proposed method is shown in Fig. 4. Let P ¼ ðP 1 ; …; P N Þ denote the sequence of N given outdoor photos Pi, i A f1; …; Ng. Since the sky portions are exploited for estimating the sun position, each Pi is ﬁrst segmented into sky and non-sky regions by using a pixel-wise labeling algorithm [24]. Next, given the segmented sky regions, we calculate the position of the sun ϕc by a machine learning based sky modeling method [14]. Further, the photo's initial estimated time of capture (i.e. Tie of Pi) can be calculated based on the estimated sun position as well as two other types of information inferred from the metadata (the geotags for obtaining the geometric latitude Φ of the photo taken (cf. Section 4.2.1) and the date of capture for obtaining the sun declination angle δ by Eq. (1)). An approximation method is also proposed to effectively calculate the initial estimated time of capture when a measured sun position is inaccurate, cf. Section 4.1. Note that the obtained sun position here is relative to the camera viewing direction of a photo but not the true one in a grographical location. Hence, in Section 4.2, general algorithms are also developed for assessing the camera viewing direction ϕΔ in order to help obtain the true position of the sun (i.e. ϕs ¼ ϕc þ ϕΔ ), cf. Section 4.2. In Section 4.3, based on the Tie's from multiple photos in P, we then propose our optimization framework, namely joint alignment, to jointly estimate the time of capture of the multiple photos altogether (i.e. T n ¼ ðT n1 ; …; T nN ÞÞ.

As a result, the solar hour angle hs can be calculated by taking the inverse cosine of cos hs . The initial estimated time of capture is then obtained: T ei ¼ hs =15 þ 12, cf. Section 3. Note that Tie is represented by a real number in the range ½0; 24Þ (24 h in a day), e.g., T ei ¼ 3:14 means a time 03:08:24 AM. However, the estimated sun position ϕs by [14] might be sometimes inaccurate, causing the solution of Eq. (8) to be imaginary, i.e. cos hs is equal to a complex number. It happens when the discriminant of Eq. (8) (the part under the radical sign, i.e. the A2 þ B2 C 2 part) is negative. That is, if a quadratic equation with real-number coefﬁcients has a negative discriminant, then the two solutions to the equation are complex conjugates of each other. In Eq. (8), the discriminant is related to three variables δ, Φ, and ϕs. Therefore, to get insight into the imaginary cases, a way is to enumerate one by one all possible combinations of values of the variables, and for each combination, calculate the corresponding discriminant value. Fig. 5 gives an example. By randomly picking up a day of the year (August 8th, i.e. δ ¼ 161 according to Eq. (1)), we can ﬁnd that the discriminant values are negative inside an oval-shaped area centered at Φ ¼ 01 and ϕs ¼ 2701. For example, when Φ ¼ 01, if ϕs is inaccurately estimated with a 20o shift from 210o to 230o, the discriminant becomes from a positive to a negative value. Overall, the “negative” area is large, taking nearly 50% of the combinations. In such cases, we thus propose an approximation method instead to obtain hs. To be more speciﬁc, given a solar azimuth n angle ϕs that causes complex solutions in Eq. (6), we search for r f n two numbers in ½ϕs ; ϕs , one is larger than ϕs and the other is n þ smaller than ϕs , denoted as ϕs and ϕs , respectively. They both þ have real solutions for hs in Eq (6), i.e. hs and hs , and are chosen n possibly close to ϕs . The resultant solar hour angle hs is then obtained by performing linear interpolation between the two n þ þ values, hs and hs , according to the ratios of ϕs to ϕs and ϕs . In other words, since solar azimuth angles and solar hours are monotone and have a near-linear relationship [11,12], the corresponding solar hour (say hs) of an estimated solar azimuth angle ϕns is able to be interpolated by the solar hours ðhsþ and hs Þ of two þ other azimuth angles ðϕs and ϕs Þ. It is worthy to note that, in the above formulation, we purposely ignore the other component of a sun position, i.e. the solar elevation angle θs, but choose to use the solar azimuth angle ϕs alone because of the reliability. First, in comparison to ϕs, the values of θs and the time of day are not in one-to-one relationship. As shown in Fig. 3, the sun has two same solar elevation angles in a day, one in the morning and the other in the afternoon. Second, if a photo was taken by a camera that is not being held parallel to the horizon (e.g. the user held the camera upward when making a shot of the upper part of a building), the estimated θs from this photo is then unreliable and requires extra information for correcting its value.

4.1. Initial estimated time of capture

4.2. Azimuth angle adjustment

In Eq. (4), by removing the term cos θs using Eq. (5), we can form a quadratic equation of cos hs :

In practice, the solar azimuth angle obtained in Section 4.1 is relative to the camera viewing direction but not to the due north as deﬁned in a celestial coordinate system, c.f. Section 3. In this sub-section, we will propose general algorithms for estimating the

Fig. 3. Example of the sun position in a day (26 March 2012, in London, United Kingdom).

Fig. 4. Overview of the proposed method.

respectively. Thus, a valid value of r f given day will be in ½ϕs ; ϕs .

ϕs associated with hs on the

4. The proposed method

ðA2 þ B2 Þ cos 2 hs 2ðA CÞ cos hs þ ðC 2 B2 Þ ¼ 0;

ð6Þ

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Fig. 5. A contour map of the discriminant.

Fig. 6. Determining the camera viewing direction of an outdoor photo (an example with EXIF information is given in the leftmost end) in three cases: (a) the photo has both GPS and geoname tags, (b) the photo has GPS tags only, and (c) the photo has geoname tags only (See Section 4.2 for details).

true camera viewing direction in order to adjust the obtained solar azimuth angle. Speciﬁcally, in order to have the true azimuth angle of the sun in a geographical location (i.e. ϕs, deﬁned by the celestial coordinate system), the camera viewing direction of a taken photo in that location needs to be determined and employed to compensate the solar azimuth angle ϕc estimated from the scene contents by assuming that the camera viewing direction is aligned to the due north. In other words, we should ﬁnd the camera viewing direction ϕΔ to perform the adjustment as follows:

ϕs ¼ ϕc þ ϕΔ :

ð9Þ

In particular, the camera viewing direction is a direction pointing from the camera to a target that the camera is focused on. Given an outdoor photo, the camera can be located by the associated GPS tags, e.g. the stored GPS coordinates in EXIF. Also, the geographical location of the target that is photographed in the photo can be inferred from the associated geoname tags [25], e.g. photo tags containing the name of a landmark or location-related distinct objects. According to the availability of the two types of information, we classify the problem of ﬁnding a camera viewing direction to three cases: (1) photos with both the GPS and geoname tags, (2) photos with the GPS tags only, and (3) photos with the geoname tags only, as detailed in the following sections. 4.2.1. Photos with both GPS and geoname tags If a photo has both the GPS and geoname tags, we need to know the GPS coordinate of the geoname for calculating the camera viewing direction, as shown in Fig. 6(a). Since people tend

to give various kinds of tags to annotate a photo [25], we ﬁrst adopt the GeoNames geographical database3 to remove nongeoname annotations. All the remaining geoname tags, e.g. “Big Ben” and “London” in Fig. 6, are then collected as an input query to obtain a unique GPS coordinate in the GeoNames database. For example, the name “Big Ben” can refer to many famous clock towers around the world, but the collection of “Big Ben” and “London” is speciﬁc to the great bell of the clock at the north end of the Palace of Westminster in London. Since the Earth is round, the distance between two points on the Earth's surface is given by the great-circle distance [11,12]. Therefore, after we obtain the GPS coordinate of the photo geonames, the camera viewing direction can be calculated according to the Haversine formula as follows: cos Φ2 sin ðλ2 λ1 Þ ; ð10Þ ϕΔ ¼ atan cos Φ1 sin Φ2 sin Φ1 cos Φ2 cos ðλ2 λ1 Þ where Φ1 and λ1 are the latitude and longitude of the camera location respectively, and Φ2 and λ2 are the latitude and longitude in the GPS coordinate of the photo geonames. 4.2.2. Photos with GPS tags only If a photo I has the GPS tags only, we need to stand virtually in the camera location and rotate 360° to observe which view looks most like the captured scene content of the photo, as shown in Fig. 6(b). Following a similar approach in [18], we ﬁrst use the 3

GeoNames: http://www.geonames.org/

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Google Maps API4 to generate a Google Street View panorama at the GPS location of I. Next, we uniformly sample a street view in every 15° from the panorama to obtain 24 candidate images for visual matching [26]. The camera viewing direction is then set to the direction from the camera to the candidate image having the highest visual similarity with I. 4.2.3. Photos with geoname tags only If a photo I has the geoname tags only, we need to ﬁrst convert the geonames into a unique target location L, as explained in Section 4.2.1. Next, we rotate a virtual camera around L to observe in which location the view looks most like the captured scene content of the photo, and then it is selected as the camera location where I is photographed, as shown in Fig. 6(c). Technically, we use the Google Maps API to generate a sequence of street views at locations in the circumferences of a set of concentric circles for every 15 angular degrees. For each location and virtual camera angle, we can obtain one street view image as the candidate view to the target location L. Then the camera position and viewing direction is set to be the same as the image having the highest visual similarity with I among these candidates [26].

Fig. 7. Relationship between the time variables Tt, To, and Te associated with a given set of outdoor photos (See Section 4.3 for details).

Table 1 The collected dataset for experiments. Latitude region

Location

Label

Photo #

N66.5°–N45°

London Paris Dresden

LON PAR DRE

49 15 398

N30°– N45°

Mount Fuji

MFJ-1 MFJ-2 LAX CFF

149 1,296 17 78

TPE-1 TPE-2

46 54

4.3. Joint alignment from multiple photos In practice, since a traveller often takes multiple photos in a single trip, in this sub-section, we propose an optimization framework to jointly estimate the time of capture by taking into account the temporal relationship among the photos in order to further increase the overall estimation accuracy. Considering the naïve case, the given photos from a user are assumed to be taken at the same location. For example, a user from Europe brings his/ her camera to take a short trip at Mount Fuji in Japan. Let P ¼ ðP 1 ; …; P N Þ denote the sequence of N given outdoor photos Pi, i A f1; …; Ng. We assume that Tit is the true local time associated with Pi, and the photos in P are presented in chronological order of being taken, i.e. according to the original time stamp, Tio, stored in Pi. Since the photos in P are basically from the same camera, the time difference between Tit and Tio of each photo Pi in P is the same. For example, if the ﬁrst photo P1 has T1o ¼08:10:00 AM and T1t ¼08:00:00 AM (T1t is 10 min earlier than T1o), we will guess that T2t is 09:00:00 AM given T2o ¼ 09:10:00 AM (based on the same difference of T1t and T1o). For our analysis, note that each of the time stamps is represented by its lasting time in seconds from a given time origin. As shown in Fig. 7(a), if we consider T o ¼ ðT o1 ; …; T oN Þ and T t ¼ ðT t1 ; …; T tN Þ as two time series of P, we have that Tt is the To with a time shift ΔT as follows: T t ¼ T o þ ΔT ¼ ðT o1 þ ΔT ; …; T oN þ ΔT Þ:

ð11Þ

Since the set of independently estimated time of capture Tie of Pi (i.e. T e ¼ ðT e1 ; …; T eN Þ) is an estimate of Tt, the To and Te also appear to have linear relationship if the estimation is perfect (with no estimation error). That is, according to Eq. (11), To and Te is in the linear form T e ¼ T o þ w0 where w0 is a time shifting parameter. However, observing that actually T ti ¼ T ei þ ϵi ¼ T oi þ w0 þ ϵi due to the estimation error ϵi, we can exploit To to determine w0 from Te in a regression manner [27], and use the w0 value and To to reestimate each Tie. In other words, by taking the linear regression between To and Te, we know that measuring the model parameter w0 is equivalent to estimate the time shift ΔT, cf. Fig. 7(b). Meanwhile, we are also able to minimize the mean square estimation 4

Google Maps API: http://code.google.com/apis/maps/

Los Angeles

N23.5°–N30°

Taipei

Total: 2102

error (MSE): e¼

N X i¼1

ϵ2i ¼

N X i¼1

ðT ti T ei Þ2 ¼

N X

ð ΔT w 0 Þ 2 :

ð12Þ

i¼1

In practice, possible data outliers (the Tie's that do not ﬁt the regression model) will be removed before determining the resuln tant time shift ΔT by employing RANSAC mechanisms [27]. As a result, the time of capture of each Pi (including the removed outn liers) can be obtained as T ni ¼ T oi þ ΔT .

5. Experimental results and discussions 5.1. The dataset In the experiments, a testing photo is required to have the ground truth of the local time of capture and GPS coordinate. All the information are easy to obtain using a modern smartphone, but it is uneasy to ﬁnd an existing large photo source (e.g. image databases and social media sites) with all the information at once. Especially, it is extremely difﬁcult to ensure if the recorded time stamp is trustworthy or not, as discussed in Section 1. Also, for the task of joint alignment from multiple photos (cf. Section 4.3), some photos have to be known that they are taken by the same user (camera) in a sequence. Therefore, we manually collected a set of photos to form the testing dataset in three main ways: (1) We collected dataset from several subjects who are in travelling to record the needed information by using their smartphones when taking photos, e.g. builtin smartphone GPS receivers can help us to locate the exact camera location. We ﬁnally have three photo collections, namely PAR, LAX, CFF, as part of our dataset in Table 1. The photos in PAR,

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

Fig. 8. Sample photos in the dataset.

LAX, and CFF are taken in the Paris urban area, the Los Angeles urban area, and on a freeway in California, respectively. (2) We searched for photos containing GPS tags from a photo sharing website, Panoramio.5 In order to know the true local time, the Big Ben, a famous clock tower in London, is chosen as our search target. We kept its photos that the time on the tower clock is visually recognizable to human eyes and then adopted it to be the corresponding time of capture. This collection is named LON in Table 1. (3) We looked for live online cameras with known geographical information and grabbed the captured images from the camera. In Table 1, we have 1 camera in German's Dresden urban area (DRE), 2 cameras around the Mount Fuji in Japan (MFJ-1 and MFJ-2), and 2 cameras in the Taipei urban area (TPE-1 and TPE-2). In summary, we have totally 2102 photos in our dataset and they are all currently from the Temperate Zone of the North Hemisphere (23.5° North to 66.5° North in latitude) [11,12]. Some sample photos in the dataset and the statistic of the dataset are shown in Fig. 8 and 9, respectively. Except for LON in which no photos belong to the same owner in the original, all the other photo subsets can be used for the joint alignment task. In addition, it is known that the local time (clock time) of everywhere within the same time zone (generally 15° of 5

Panoramio: http://www.panoramio.com/

longitude wide) is set to be the same [11,12]. We then need to correct it to the true local time (solar time) according to the geographical location (GPS coordinates) where a photo was actually taken, i.e. by using linear interpolation. 5.2. Evaluations of the initial estimated time of capture In order to validate the estimation effectiveness of our approach, we ﬁrst formulate the Expectation of the Estimated Time Error (E-ETE) for a single photo Pi (without the joint alignment operated) in a photo collection. Let Xt and Xe denote random variables of the ground truth and the estimated time associated with Pi, respectively. The E-ETE is deﬁned as: E X t X e ¼

ZDE

ZDE

t X X e f t e ðt; eÞ dt de; X ;X

ð13Þ

t ¼ DS e ¼ DS

where f X t ;X e ðt; eÞ is the joint probability of X t ¼ t and X e ¼ e. DS and DE are the start and the end of a day, respectively. Thus, if we assume that a naive estimator is always to guess the estimated time Xe in a random fashion, we can obtain the performance baseline for each of the collected photo collection, as given as the E-ETEbase values in Table 2(a). Note that an E-ETE value is

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

8

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 9. The statistics (the ground truth time distribution) of our dataset.

represented by a real number in unit of hours, e.g., the E-ETEbase value of LON equivalent to 4.3 means an estimated time error averagely in 4 h plus 18 min. The statistical results (distribution) of the initial estimated time of capture (cf. Section 4.1) in each photo collection is shown in Fig. 10. The accumulated E-ETE from time error equal to 0–14 h is given in Fig. 11 and the accumulated E-ETE value is labelled at the end of each curve. In comparison with the E-ETEbase (cf. Table 2(a)), the accumulated E-ETE shows that the average estimation error of our singlephoto based initialization is around 2.5 hours, competitive to the state-of-the-art results reported by Lalonde et al. [14]. However, our single-photo based initialization is more powerful than the state-ofthe-art method for practical use because the proposed approximation method (cf. Section 4.1) can still work for the cases when a measured sun position is inaccurate. We summarize the total mean of the estimated time error (ETE) and the expectation of the estimated time error (E-ETE), denoted as Esmean and E-ETE in Table 2(a), respectively. In Fig. 9, we observed the fact that the ground truth time distributions affect their individual E-ETEbase values. For example, the time distributions of LON, PAR, DRE, MFJ-1 and MFJ-2 all resemble a uniform distribution and obviously have larger E-ETEbase values ð Z 4:0Þ. On the contrary, the time distributions of LAX, CFF, TPE-1 and TPE-2 cluster in certain time intervals, which have lower EETEbase values. We collected these two types of datasets (uniform and non-uniform) to demonstrate that the results of initial estimation are not affected by the ground truth time distribution. As observed in Figs. 10 and 11, based on the error distributions of the initial estimation and accumulated E-ETE curves of each photo collection, we cannot differentiate what types of the ground truth time distribution this photo collection belongs to. Also, in Table 2 (a), the E-ETE values of CFF and TPE-1 are 3.2307 and 2.9347 respectively, which are as high as PAR and MFJ-2. It shows that it makes no different in the experimental results between a uniform and a non-uniform ground truth time distributions. These error distributions can be further categorized into three types. The photo collection of TPE-2 and LAX (Fig. 10(i) and (f)) are

Table 2 Statistics of the experimental results in our dataset. (a) Initial estimation

(b) Joint alignment

Esmean

E-ETEbase

E-ETE

Em mean

RI %

LON PAR DRE MFJ-1 MFJ-2 LAX CFF TPE-1 TPE-2

3.8589 2.7223 2.4146 2.3041 3.1116 1.0618 3.0190 2.7084 1.3764

4.3019 4.3980 4.3641 4.1795 4.3817 3.5000 3.6295 3.8377 3.7857

3.7857 3.0098 3.8803 2.5805 3.2373 1.2647 3.2307 2.9347 1.6574

– 0.2195 0.3740 0.4582 1.1611 0.9268 1.8203 1.0695 1.0900

– 91.94 84.51 80.11 62.68 12.71 39.71 60.51 20.81

Avg.

2.5086

4.0420

2.8423

0.8899

64.53

the ﬁrst type with more concentration of ETE values around zero. Hence, the E-ETEs shown in Fig. 11(i) and (f) are near to a constant (E-ETE ¼1.65 and 1.26, respectively) as the accumulated range Z 2:5 h over all photos, which is also the ideal case generated by single-photo based approach (cf. Section 4.1). And the second type is like the ﬁrst type but has one more isolated peak, say the TPE-1 collection (Fig. 10). Therefore, the E-ETE in Fig. 11(h) shows a sudden increment (from 1.06 to 2.20) as the accumulated range ¼10.5 h. The third case is the worst case generated by the singlephoto based approach, including LON, PAR, DRE, MFJ-1, MFJ-2, and CFF. As shown in Fig. 11(a)–(e) and (g), these collections all have an incremental E-ETE. The above results show that the single-photo based initialization is not good enough. Therefore, in the next section, we will demonstrate how the accuracy can be improved using our joint alignment approach. 5.3. Joint alignment from multiple photos The statistics of the ETE values jointly estimated from multiple photos (cf. Section 4.3) are given in Table 2(b). In order to evaluate

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9

Fig. 10. Time error distributions of the initial estimation (Section 4.1) in each photo collection.

the relative improvement in time estimation performances, we use the Relative Improvement Index formulated as follows: RI ¼

s j Em mean E mean j 100% s Emean m

ð14Þ

where Emean is the average of ETE values obtained by using the joint alignment. Overall, in Table 2(b), we observe a signiﬁcant improvement of Emeanm compared to Emeans for all photo collections, from 2.5086 down to 0.8899 on average, with a 64.53% relative improvement. For the ones with a large Emeans, the improvement is much more obvious. For example, the average ETE values of PAR and MFJ-1 decrease from 2.7223 and 2.3041 down to as low as 0.2195 (E 13 min) and 0.4582 ( E27 min), respectively. For those with a smaller Emeans in the original, say LAX and TPE-2, we are still able to obtain a 10% to 20% relative improvement. As mentioned in Section 1, since having multiple photos in a single trip is an often case in daily life, the results are quite encouraging and also imply the potential of our system for practical applications. We can observe that the Em mean of CFF is relatively high than those of the other photo collections, although the value is at an acceptable level of 2 h (RI ¼ 39.71%). Several typical CFF photo samples are illustrated in Fig. 12. As it is shown in Fig. 12(d), it had been very late in the afternoon, GT ¼17:57. Sunrise and sunset photos are sometimes uneasy to be distinguished even by human eyes (c.f. Fig. 13(a)). If a sunset photo is incorrectly recognized as a sunrise one, in this CFF case (Fig. 12(d)), the initial estimation time (ES) is equal to 08:25, we would have a ETE value to be as large as about 10 h. It reveals a limitation of single-photo based estimation. However, such outliers can be corrected by our proposed joint

alignment. The estimation time (EM) using the joint alignment of this photo is corrected as 16:22 and the ETE is largely reduced.

5.4. Comparisons of camera-taking angle estimation strategies Experiments are further performed to compare and discuss the three camera-taking angle estimation strategies we proposed in Section 4.2. We took the LON photo collection in our dataset as the testing benchmark, since its photos are collected from different users and are relatively diverse in capture conditions. For the ﬁrst strategy (Section 4.2.1), since both the GPS and geoname tags are readily available, the estimation is straightforward based on Eq. (10) and we found that the inaccuracy might be due to the case when a geo-tagged landmark deviates from the center of the photo, e.g., in Fig. 13(c) the Big Ben is placed towards the side instead of being in the center of the photo. Based on the photo information (i.e. the photo size and the deviation of a landmark in pixels from the center to the size of a photo) and the camera parameters (i.e. the focal length and the CCD (Charge-coupled Device) size), we can calculate the “shifting degrees” [28] on estimating the viewing direction of a camera. The average deviation is 7.05°, i.e. around 28.2 min. For the second strategy (Section 4.2.2), since only the GPS tags are readily available, the inaccuracy is mainly caused by the corresponding visual matching task phase. The average error is 11.17°, i.e. around 44.68 min. For the third strategy (Section 4.2.3), the inaccuracy is similarly due to visual matching. Note that the visual matching task here is different from the one in the second strategy. The average error is 10.79°, i.e. around 43.16 min.

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

10

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 11. The accumulated E-ETE curves and the expectation of the estimated time error (E-ETE) of each photo collection.

Fig. 12. Sample photos in the CFF dataset, all taken on 24 June, 2011, along a freeway in California, USA. The red circles mark the location where the corresponding photo was taken and the yellow arrows indicate the camera viewing direction. The original time stamps (OT), the true local time (GT), the initial estimation time (ES), and the estimation time using the joint alignment (EM) are also given in 24 h format (HH:MM). (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

5.5. Effects of different components of the proposed approach The purpose of this section is to investigate the effects of different components of the proposed approach on the performance. For clarity, we label the different components as C1 (Section 4.1), C2-1 (Section 4.2.1), C2-2 (Section 4.2.2), C2-3 (Section 4.2.3), and C3 (Section 4.3). We took the CFF photo collection in our dataset as the testing benchmark, since its photos are captured from a moving camera and the number of photos is relatively large.

Trial#1 – How many photos that the method in [14] resulted in producing imaginary solutions. Based on our experiments, photos at an average of approximately 5% will produce imaginary solutions by the method in [14]. For example, there are 4 photos with imaginary solutions in the CFF photo collection. Trial#2 (C1 only) – How is the performance of the proposed method only applying the approximation method. We have the resultant Esmean ¼ 7:8603 (single-photo based average time estimation error). A large value is expected since the exact camera

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

11

Fig. 13. Negative factors to the time estimation.

viewing direction is unknown and we assumed that it is aligned to the true north. Trial#3 (C1 þC2-1) – How does C2-1 improve the performance of Trial#2. We have the resultant Esmean ¼ 3:0190, with a relative improvement of 61.59%. Trial#4 (C1 þC2-2) – How does C2-2 improve the performance of Trial#2. We have the resultant Esmean ¼ 3:3018, with a relative improvement of 57.99%. Trial#5 (C1 þ C2-3) – How does C2-3 improve the performance of Trial#2. We have the resultant Esmean ¼ 3:2975, with a relative improvement of 58.05%. Trial#6 (C1 þC2-1 þC3) – How does the combination of C1 þC2-1 þC3 improve the performance of Trial#3. We have the resultant Em mean ¼ 1:8203 (joint-alignment based average time estimation error), with a relative improvement of 39.71%. 5.6. Discussions Based on the observations in Section 5.2 and Section 5.3, the estimation accuracy of our approach are inﬂuenced by a variety of factors. In this section, we further divide the inﬂuential factors into three categories and make a brief discussion on the impacts to the time estimation, as follows. 5.6.1. Factors related to sky Sky is the fundamental basis for estimating the sun position (cf. Section 4.1), but its effectiveness is naturally proportional to the sky quality in a given photo. For example, we found that few TPE-1 and TPE-2 photos are with smog, which reduces visibility and causes inaccurate estimation. Also, the weather condition sometimes decreases the estimation performance. The estimation results of photos with a clear sky are more reliable than those with

a patchy sky, cf. Fig. 13(b). As described in [14], if the sun is behind the camera, the sky models are uncertain about the sun's position and introduce estimation errors. In addition, we observed that some photos collected from photo sharing websites look unreal and synthetic-like, e.g., Fig. 13(c). People might edit these photos by applying special effects before uploading them, such that the statistical properties of a “natural sky” in the photos would be altered accordingly. Besides, as described in Section 4, the current segmentation of an image into sky and non-sky regions are based on a pixel-wise labeling mechanism [24]. More advanced algorithms for scene parsing and scene understanding (e.g., [29,30]) can be employed to improve the segmentation quality. 5.6.2. Factors related to astronomy According to astronomy [11,12], the path of the sun across the sky in a day changes with seasons periodically. In other words, the length of daylight and the sun's direction and height in the sky of a clock time vary accordingly. In order to analyze such impact on the accuracy of estimated time of capture, we picked out totally 1094 photos from the MFJ-2 photo subset. They were taken at every hour exactly between 06:00 and 18:00 during the whole year of 2011 by the same live camera located in the east end of Mount Fuji. The average ETE values for each pair of the months and hours are illustrated in Fig. 14. We can observe that the early morning hours generally have the largest ETE values. One reason is that the viewing direction of the live camera is set to the west. When the sun rises in the east (around the early morning hours), the relative position between the sun and the camera (i.e., the sun behind the camera's viewing direction) would cause large ETE values, as above mentioned. In addition, between May and August, it is found that even the late morning hours also have relatively large errors. We know that this period is summer in the latitude areas of

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

12

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

extended to the time estimation task of videos by taking beneﬁt of possibly richer spatial-temporal information embedded in the videos.

Acknowledgements This work was supported by the Ministry of Science and Technology of Taiwan under Grants MOST-103-2221-E-001-007MY2 and MOST-103-2221-E-011-105. The authors would like to extend their sincere appreciation to the Deanship of Scientiﬁc Research at King Saud University for its funding of this International Research Group (IRG14-28). Fig. 14. Statistics of the average ETE values over the photos taken by a live camera to Mount Fuji, Japan, during the daytime of the whole year of 2011. Each block indicates the average ETE values of the corresponding month (row) at the corresponding hour (column). The values are encoded by the color spectrum from blue (low values) to red (high values). (best view in color). (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

Mount Fuji, such that the sun stays longer in the east sky and it will extend the sun's negative effects on the estimation accuracy. 5.6.3. Factors related to camera position and pose Since the camera viewing direction is involved in the determination of a true azimuth angle of the sun, inaccurate estimates of the camera viewing direction would cause errors in the estimated time of capture. First, inaccurate GPS introduces computation errors with the Haversine formula (cf. Section 4.2.1). We observed that the GPS tags of some collected photos is unreliable. The reason might be that these photos were downloaded from public photo sharing websites and the associated GPS tags are manually given by the user rather than actually obtained from a GPS receiver. Further, the azimuth angle adjustment methods in Sections 4.2.2 and 4.2.3 rely on Google Street View images to determine the camera viewing direction and they also suffer the drawbacks of image matching [26]. For example, if the street views around the location are not up to date or are captured in bad quality (cf. Fig. 13(d)), the matched view for the comparing photo could be wrong.

6. Conclusions and future work In this work, we proposed a novel system, Photo Sundial, for estimating the time of capture of outdoor photos by exploiting the astronomical theory, as motivated by the use of traditional sundials. The proposed system was validated with a large dataset of 2102 consumer photos and achieved competitive performances. Our joint alignment approach can effectively reduce the average estimated time error to as low as less than 0.9 h, with a signiﬁcant 65% relative improvement compared to the state-of-the-art method [14] (2.5 h). In daily life, since it is more often that people would take multiple photos in a single trip, the results demonstrate that our system is effective for practical applications. We believe that the proposed system is fundamental for probing the temporal context of media and has wide applications in various multimedia tasks. Witnessing a huge amount of real-world photos have been created and keep increasing by 375 petabytes annually [31], our system is also able to alleviate the problem that a large part of the existing and producing photos has problematic time stamps and degrades the user experience in most time-aware multimedia applications. Further, this work can be complementary study to the active research in spatial aspects of photos, e.g. geo-location estimation. Also, the proposed approach can be

References [1] M. Cooper, J. Foote, A. Girgensohn, L. Wilcox, Temporal event clustering for digital photo collections, ACM Trans. Multimed. Comput. Commun. Appl. 1 (3) (2005) 269–288. [2] M. Boutell, J. Luo, Beyond pixels: exploiting camera metadata for photo classiﬁcation, Pattern Recognit. 38 (6) (2005) 935–946. [3] J. Yuan, J. Luo, Y. Wu, Mining compositional features from gps and visual cues for event recognition in photo collections, IEEE Trans. Multimed. 12 (7) (2010) 705–716. [4] L. Cao, J. Luo, T.S. Huang, Annotating photo collections by label propagation according to multiple similarity cues, in: Proceedings of the ACM International Conference on Multimedia (MM'08), 2008, pp. 121–130. [5] Y. Arase, X. Xie, T. Hara, S. Nishio, Mining people's trips from large scale geotagged photos, in: Proceedings of the ACM International Conference on Multimedia (MM'10), 2010. [6] X. Lu, C. Wang, J.-M. Yang, Y. Pang, L. Zhang, Photo2trip: generating travel routes from geo-tagged photos for trip planning, in: Proceedings of the ACM International Conference on Multimedia (MM'10), 2010, pp. 143–152. [7] C.-Y. Fu, M.-C. Hu, J.-H. Lai, H. Wang, J.-L. Wu, Travelbuddy: interactive travel route recommendation with a visual scene interface, in: MultiMedia Modeling, 2014, pp. 219–230. [8] C.-C. Hsieh, W.-H. Cheng, C.-H. Chang, Y.-Y. Chuang, J.-L. Wu, Photo navigator, in: Proceedings of the ACM International Conference on Multimedia (MM'08), 2008, pp. 419–428. [9] N. Snavely, R. Garg, S.M. Seitz, R. Szeliski, Finding paths through the world's photos, ACM Trans. Graph. 27 (3) (2008) 11–21. [10] R.N. Mayall, M.W. Mayall, Sundials: Their Construction and Use, Dover Publications, New York, NY, USA, 2000. [11] F.H. Shu, The Physical Universe: An Introduction to Astronomy, University Science Books, Sausalito, CA, USA, 1982. [12] H. Karttunen, P. Kroger, H. Oja, M. Poutanen, K.J. Donner, Fundamental Astronomy, Springer, New York, NY, USA, 2007. [13] A.-J. Cheng, Y.-Y. Chen, Y.-T. Huang, W.H. Hsu, H.-Y. M. Liao, Personalized travel recommendation by mining people attributes from community-contributed photos, in: Proceedings of the 19th ACM International Conference on Multimedia, 2011, pp. 83–92. [14] J.-F. Lalonde, A.A. Efros, S.G. Narasimhan, Estimating natural illumination from a single outdoor image, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV'09), 2009. [15] I. Reda, A. Andreas, Solar position algorithm for solar radiation applications, Solar Energy 76 (5) (2004) 577–589. [16] K. Sunkavalli, F. Romeiro, W. Matusik, T. Zickler, H. Pﬁster, What do color changes reveal about an outdoor scene? in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), 2008, pp. 1–8. [17] R. Garg, A.L. Varna, M. Wu, seeing ENF: natural time stamp for digital video via optical sensing and signal processing, in: Proceedings of the ACM International Conference on Multimedia (MM'11), 2011. [18] M. Park, J. Luo, R.T. Collins, Y. Liu, Beyond gps: determining the camera viewing direction of a geotagged image, in: Proceedings of the ACM International Conference on Multimedia (MM'10), 2010, pp. 631–634. [19] M. Bansal, H.S. Sawhney, H. Cheng, K. Daniilidis, Geo-localization of street views with aerial image databases, in: Proceedings of the ACM International Conference on Multimedia (MM'11), 2011, pp. 1125–1128. [20] N. Snavely, S.M. Seitz, R. Szeliski, Photo tourism: exploring photo collections in 3D, ACM Trans. Graph. 25 (3) (2006) 835–846. [21] J.-F. Lalonde, A.A. Efros, S.G. Narasimhan, What do the sun and the sky tell us about the camera? Int. J. Comput. Vis. 88 (1) (2010) 24–51. [22] J. Hays, A.A. Efros, IM2GPS: estimating geographic information from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), 2008, pp. 1–8. [23] A. Shrivastava, T. Malisiewicz, A. Gupta, A.A. Efros, Data-driven visual similarity for cross-domain image matching, ACM Trans. Graph. 30 (6) (2011) 154:1–154:10.

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [24] D. Hoiem, A.A. Efros, M. Hebert, Automatic photo pop-up, ACM Trans. Graph. 24 (3) (2005) 577–584. [25] M. Ames, M. Naaman, Why we tag: motivations for annotation in mobile and online media, in: Proceedings of the SIGCHI conference on Human factors in computing systems (CHI'07), 2007. [26] W.-L. Zhao, C.-W. Ngo, H.-K. Tan, X. Wu, Near-duplicate keyframe identiﬁcation with interest point matching and pattern learning, IEEE Trans. Multimed. 9 (5) (2007) 1037–1048. [27] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, NY, USA, 2006. [28] O. Faugeras, Three-dimensional Computer Vision: A Geometric Viewpoint, MIT Press, Cambridge, MA, USA, 1993. [29] W. Wang, W. Lin, Y. Chen, J. Wu, J. Wang, B. Sheng, Finding coherent motions and semantic regions in crowd scenes: a diffusion and clustering approach, in: Proceedings of the European Conference on Computer Vision (ECCV'14), 2014. [30] C. Liu, J. Yuen, A. Torralba, Nonparametric scene parsing via label transfer, IEEE Trans. Pattern Anal. Mach. Intell. 33 (12) (2011) 2368–2382. [31] M.M. Tufﬁeld, S. Harris, C. Brewster, N. Gibbins, F. Ciravegna, D. Sleeman, N.R. Shadbolt, Y. Wilks, Image annotation with photocopain, in: Proceedings of the Semantic Web Annotation of Multimedia Workshop (SWAMM-06), 2006, pp. 22–26.

Tsung-Hung Tsai received the B.S. degree from the department of computer science and information engineering, National Central University, Taiwan in 2007, and the M.S. degree in electronics engineering from the National Taiwan University in 2009. During 2010–2012, he was a Research Assistant of Multimedia Computing Laboratory (MCLab), the Research Center for Information Technology Innovation (CITI), Academia Sinica, Taipei, Taiwan. His research interests include multimedia content analysis, pattern recognition and computer vision.

Wei-Cih Jhou received the B.S. and M.S. degrees in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 2009 and 2011. She is currently working as a Research Assistant in Multimedia Computing Laboratory in Research Center for Information Technology Innovation (CITI), Academia Sinica, Taipei, Taiwan. Her current research interests include image rendering, image and video processing, deep learning and multimedia content analysis.

Wen-Huang Cheng received the B.S. and M.S. degrees in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 2002 and 2004, respectively, where he received the Ph.D. (Hons.) degree from the Graduate Institute of Networking and Multimedia in 2008. He is an Associate Research Fellow with the Research Center for Information Technology Innovation (CITI), Academia Sinica, Taipei, Taiwan, where he is the Founding Leader with the Multimedia Computing Laboratory (MCLab), CITI, and an Assistant Research Fellow with a joint appointment in the Institute of Information Science. Before joining Academia Sinica, he was a Principal Researcher with MagicLabs, HTC Corporation, Taoyuan, Taiwan, from 2009 to 2010. His current research interests include multimedia content analysis, multimedia big data, deep learning, computer vision, mobile multimedia computing, social media, and human computer interaction. He has received numerous research awards, including the Outstanding Youth Electrical Engineer Award from the Chinese Institute of Electrical Engineering in 2015, the Top 10% Paper Award from the 2015 IEEE International Workshop on Multimedia Signal Processing, the Outstanding Reviewer Award from the 2015 ACM International Conference on Internet Multimedia Computing and Service, the Prize Award of Multimedia Grand Challenge from the 2014 ACM Multimedia Conference, the K. T. Li Young Researcher Award from the ACM Taipei/ Taiwan Chapter in 2014, the Outstanding Young Scholar Awards from the Ministry of Science and Technology in 2014 and 2012, the Outstanding Social Youth of Taipei Municipal in 2014, the Best Reviewer Award from the 2013 Paciﬁc-Rim Conference on Multimedia, the Best Poster Paper Award from the 2012 International Conference on 3D Systems and Applications. He supervised his post-doctoral fellows to award the Academia Sinica Postdoctoral Fellowship in 2013 and 2011.

13

Min-Chun Hu is also known as Min-Chun Tien and Ming-Chun Tien. She is an Assistant Professor with the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan. She received the B.S. and M.S. degrees in computer science and information engineering from National Chiao-Tung University, Hsinchu, Taiwan, in 2004 and 2006, respectively, and the Ph.D. degree from the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan, in 2011. She was a Post-Doctoral Research Fellow with the Research Center for Information Technology Innovation, Academia Sinica, from 2011 to 2012. Her research interests include digital signal processing, digital content analysis, pattern recognition, computer vision, and multimedia information system.

I-Chao Shen received the B.B.A. and M.B.A. degrees in information management from National Taiwan University, Taipei City, Taiwan, in 2009 and 2011, respectively, and he is currently working toward the Ph.D. degree in computer science at the University of British Columbia, Vancouver, BC, Canada. His research interests include visual data analysis, geometry processing, and digital fabrication.

Tekoing Lim received the BS and the MS. degrees in Mathematics from the University of Lille 1, France, in 2005 and 2007 respectively, and the Ph.D. degree in Physics from Ecole Polytechnique in 2011 for his research at CEA (Commissariat l'Energie Atomique, France) on computational electromagnetics. He is currently working on multimedia content analysis at Academia Sinica, Taiwan. His research interests include machine learning, computer vision and artiﬁcial intelligence.

Kai-Lung Hua received the B.S. degree in electrical engineering from National Tsing Hua University in 2000, and the M.S. degree in communication engineering from National Chiao Tung University in 2002, both in Hsinchu, Taiwan. He received the Ph.D. degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, in 2010. Since 2010, Dr. Hua has been with National Taiwan University of Science and Technology, where he is currently an Associate Professor in the Department of Computer Science and Information Engineering. He is a member of Eta Kappa Nu and Phi Tau Phi, as well as a recipient of MediaTek Doctoral Fellowship. His current research interests include digital image and video processing, computer vision, and multimedia networking.

Ahmed Ghoneim received his M.Sc. degree in Software Modeling from University of Menouﬁa, Egypt, and the Ph. D. degree from the University of Magdeburg (Germany) in the area of software engineering, in 1999 and 2007 respectively. He is currently an Assistant Professor at the department of software engineering, King Saud University. His research activities address software evolution; service oriented engineering, software development methodologies, Quality of Services, Net-Centric Computing, and Human Computer Interaction (HCI).

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

14

T.-H. Tsai et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

M. Anwar Hossain is an Associate Professor in the Department of Software Engineering, College of Computer and Information Sciences (CCIS) at King Saud University (KSU), Riyadh. He completed his Master and Ph.D. in Electrical and Computer Engineering from the University of Ottawa, Canada, where he was associated with Multimedia Computing Research Laboratories. At KSU, Dr. Hossain received IBM faculty award. His current research interests include multimedia cloud, multimedia surveillance and privacy, Internet of Things, smart cities and ambient intelligence. He has authored/ co-authored over 90 research articles as journals, conference papers, and book chapters. Dr. Hossain has coorganized more than ten IEEE/ACM workshops including IEEE ICME AAMS-PS 201113, IEEE ICME AMUSE 2014, ACM MM EMASC-2014, IEEE ISM CMAS-CITY2015. He is also involved as TPC in several other conferences. He served as a guest editor of Springer Multimedia Tools and Applications journal. Also, currently he serves as a guest editor of another MTAP issue and International Journal of Distributed Sensor Networks. He has secured several grants for research and innovation totaling more than $5 million. He is currently supervising a number of research students at KSU. He is a member of IEEE, ACM and ACM SIGMM. He is also the co-editor of SIG MM Records.

Shintami C. Hidayati received the B.S. degree in informatics from Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia, in 2009, and the M.S. degree in computer science and information engineering from National Taiwan University of Science and Technology, Taipei, Taiwan, in 2012, where she is currently working towards her Ph.D. degree. Prior to joining the master's program, she worked at Institut Teknologi Sepuluh Nopember as a research staff member. Her research interests include machine learning and data mining and their applications to multimedia analysis, information retrieval, and computer vision.

Please cite this article as: T.-H. Tsai, et al., Photo sundial: Estimating the time of capture in consumer photos, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.050i

Photo sundial: Estimating the time of capture in consumer photos

Photo sundial: Estimating the time of capture in consumer photos

Recommend Documents