Automated open-source data collection and processing: an example of OpenStreetMap and bike-sharing

Automated open-source data collection and processing: an example of OpenStreetMap and bike-sharing

Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Available online at www.sciencedirect.com Transpor...

965KB Sizes 0 Downloads 49 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect

Available online at www.sciencedirect.com Transportation Research Procedia 41 (2019) 688–693 Transportation Research Procedia 00 (2018) 000–000

www.elsevier.com/locate/procedia

Transportation Research Procedia 00 (2018) 000–000

www.elsevier.com/locate/procedia mobil.TUM 2018 ”Urban Mobility - Shaping the Future Together” - International Scientific Conference on Mobility and Transport mobil.TUM 2018 ”Urban Mobility - Shaping the Future Together” - International Scientific Automated open-source dataoncollection processing: an example Conference Mobility andand Transport

of OpenStreetMap and bike-sharing Automated open-source data collection and processing: an example David Duran-Rodas*, Emmanouil Chaniotakis, Constantinos Antoniou of OpenStreetMap and bike-sharing Technical University of Munich, Arcisstrasse 21, 80333, Munich, Germany

David Duran-Rodas*, Emmanouil Chaniotakis, Constantinos Antoniou

© 2019 The Authors. Published by Elsevier Ltd. Technical University of Munich, Arcisstrasse 21, 80333, Munich, Germany This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the mobil.TUM18. Keywords: bike-sharing ; influencing factors; data mining; automated Keywords: bike-sharing ; influencing factors; data mining; automated

1. Introduction vehicles allow users for a short-term rental of private transport modes (e.g., car, bike, scooter) on an “as1. Shared Introduction needed basis” offered in the public space (1; 2). Users usually have to join an organization that maintains a fleet and usually payvehicles a fee forallow the vehicle’s (3; 4). These shared transportation have partially Shared users forusage a short-term rental of private transportsystems modes (e.g., car, bike,mitigated scooter) transport’s on an “asnegative impacts by reducing private cars ownership, vehicle kilometers traveled, increasing efficiency of aroads needed basis” offered in the public space (1; 2). Users usually have to join an organization that maintains fleet and infrastructure usage, and changing trips of private vehicles to alternative transport modes (4; 5; 6; 7; 8; 3). These usually pay a fee for the vehicle’s usage (3; 4). These shared transportation systems have partially mitigated transport’s benefits attracted research to identify influencing factors on the ridership of shared vehicles (see Table 1) mainly to negative impacts by reducing private cars ownership, vehicle kilometers traveled, increasing efficiency of roads and assist operators and policymakers on the deployment of these shared transport systems, to increase the reliability of infrastructure usage, and changing trips of private vehicles to alternative transport modes (4; 5; 6; 7; 8; 3). These implementations, and their to expansion benefits attracted research identify (9). influencing factors on the ridership of shared vehicles (see Table 1) mainly to On the other hand, the development of information communications technology (ICT) has allowed mainly pubassist operators and policymakers on the deployment and of these shared transport systems, to increase the reliability of lic and collaborative organizations to collect and publish large databases as open-source data (19). Open-source data is implementations, and their expansion (9). freely data that be used, modified and shared without copyright restrictions (20;has 21). Its availability has On available the other hand, the can development of information and communications technology (ICT) allowed mainly pubgrown significantly in domains of traffic, weather, geography, tourism, among others. OpenStreetMap (OSM) (22) lic and collaborative organizations to collect and publish large databases as open-source data (19). Open-source data is an example of open-source platforms in theand field of geographic information, which is(20; based collectedhas by freely available data that candata be used, modified shared without copyright restrictions 21).onItsdata availability volunteers. In the field of transport, mainly governmental organizations have opened, for example, the access to public grown significantly in domains of traffic, weather, geography, tourism, among others. OpenStreetMap (OSM) (22) is transport andofshared vehicles data. These databases areoflarge and have presented issues their an example open-source data platforms in the field geographic information, which related is basedtoon dataanalyzing collectedand by processing (23). However, they can be managed by different automated and analytic technologies. volunteers. In the field of transport, mainly governmental organizations have opened, for example, the access to public So far and to the best of the data. author’s knowledge, not provide consistent automated method transport shared vehicles These databasesthe arestate-of-the-art large and havedoes presented issuesa related to their analyzing and that collects, analyzes and processes spatial built environment factors of multiple cities and correlates them to an processing (23). However, they can be managed by different automated and analytic technologies. external variable the ridership of shared vehicles. Therefore, thisprovide research aims to develop a method to So farindependent and to the best of the as author’s knowledge, the state-of-the-art does not a consistent automated method that collects, analyzes and processes spatial built environment factors of multiple cities and correlates them to an external independent variable as the ridership of shared vehicles. Therefore, this research aims to develop a method to ∗ Corresponding author. Tel.: +49-8928910455 E-mail address: [email protected]



Corresponding author. Tel.: +49-8928910455 E-mail address: [email protected]

c 2018 The Authors. Published by Elsevier B.V. 2352-1465  2352-1465 2019responsibility The Authors. Published Elsevier of Ltd. Peer-reviewunder of the scientificbycommittee the mobil.TUM 2018 conference. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review responsibility of the scientific committee of the mobil.TUM18. c under 2352-1465  2018 The Authors. Published by Elsevier B.V. 10.1016/j.trpro.2019.09.117 Peer-review under responsibility of the scientific committee of the mobil.TUM 2018 conference.

2

David Duran-Rodas et al. / Transportation Research Procedia 41 (2019) 688–693 Duran-Rodas et. al / Transportation Research Procedia 00 (2017) 000–000

689

Table 1. Spatial features influencing ridership of shared vehicles from a literature review

Variables

A

B

C

D

E

F

Population density Job density Neighborhood age Education level share Vehicle ownership rate Inhabitants age share Household size share Rent prices Land use type share Registered vehicles Bus lines serving the district Mode to commute Public transport station Bike-sharing station CBD Railways Services Restaurants, Coffee Commercial enterprises Roads length Cycling ways length Number of stations On-street parking capacity Walkability PT stations Frequency of use Use of the stations Traveled distance Vehicle Availability Docks per station Frequency OD pair



 







 

   

 

 



  

H

I





    



 

   

 

  

G

  





   

A=(10), B= (11), C=(12), D=(13), E=(14), =(15), G=(16), H=(17), I=(18)

collect, analyze and process spatial open-source information automatically and correlate it to arrivals and departures of shared vehicles to build a model that estimates the ridership and identifies the most influencing spatial factors. Multiple cities were considered to treat individual shared systems as one thus abstracting geographic boundaries to a further estimation of the transferability to other cities. We implemented this method by automatically assigning indicators to the built environment features included in the OSM database to correlate them with the ridership data of station-based bike-sharing systems (SBBS) in multiple cities. Furthermore, we built linear and non-linear models to identify the most influencing factors of the built environment affecting the ridership of SBBS. 2. Method The method starts with an automated data collection, analysis, and processing to automatically built models in different time intervals to finally identify the most influencing factors. The dataset is downloaded from open-source databases (e.g, OSM). It includes arrivals and departures of shared transportation systems (dependent variable) and the built environment (independent variables). The arrivals and departures are aggregated in day intervals and zones of influence. The day intervals are time periods within the day based on the hourly ridership. According to the spatial unit, for station-based systems, the boundaries of the zones of influence corresponds to a buffer area of each station, as well as, natural and human-made barriers. Since free-floating-based systems do not have stations, the arrivals and

David Duran-Rodas et al. / Transportation Research Procedia 41 (2019) 688–693 Duran-Rodas et. al / Transportation Research Procedia 00 (2017) 000–000

690

3

departures are grouped into clusters by the distance-based algorithm DBSCAN (24), where the centroid of each cluster would represent a station. According to the built environment, two indicators per zone of influence are automatically set to each spatial feature, one based on the distance to the station or centroid, and other based on the quantity in the influence zone (see Figure 1). Four possible families of indicators are derived per zone of influence: • • • •

Density: Frequency per area unit. InArea: Boolean 1 if the variable is present in the zone of influence, or 0 if it is not. Distance min: Minimum distance from a station to a spatial feature. Distance min all: Minimum distance from a station/centroid to all the variables within a city.

Once the spatial features are intersected with the zones of influence, the procedure continues by calculating the distance from the stations/centroids to all the spatial features. Then, Distance min is taken as the minimum distance from a station (centroid) to a variable inside the zone of influence. If a feature is not inside the zone of influence, the indicator Distance min is assigned to a significantly large distance (e.g., 99999). If a spatial feature is present in few zones of influence (e.g., less than 3% of the zones), the indicator Distance min all is taken into account instead of Distance min because this features might influence the mobility of the whole city e.g., city center, university. Furthermore, the spatial features are classified into points, lines or polygons. The indicator Density is considered for lines and points. Nevertheless, for points, the standard deviation (SD) of their frequency in the zones of influence is used to determine if the features are equally spread in the city or not. If their frequency is uniform in the most of the zones of influence, the presence (i.e. 1 or 0) is considered and not the quantity. Finally, the more equally distributed variables are assigned the indicator InArea, otherwise Density. Finally, if some spatial variables are not relevant for the study (e.g., telephone boxes), they are removed from the dataset. The model building and variables selection procedure (Figure 2) begins by detecting collinearity by a Pearson correlation with a threshold index of 0.7, as in (25), to avoid redundant variables. Then, stepwise Ordinary Least Square regression (stepwise OLS) (26), Generalized Linear Models (GLM) with a lasso selection technique (27), and Gradient Boosting Machine (GBM) (28) models are automatically performed correlating the arrivals and departures per day interval with the different indicators of the spatial features. These three models are implemented to test if the data present a linear or a nonlinear relationship. Diagnostics of the models are carried out to detect and deal with outliers and heteroscedasticity. Finally, after an assessment, the model that fits better the data and selects the lowest number of variables is chosen. The variables selected are those that influence the most the models’ outcome. 3. Application The automated method was applied to the open-source data of the bike-sharing system “Call a bike” (29) in six cities in Germany where these systems are based on stations: Hamburg, Frankfurt am Main, Stuttgart, Kassel, Darmstadt and Marburg (689 stations). These cities were chosen out of fifty because they presented the highest ridership and the best data quality. Arrivals and departures were aggregated into day intervals representing peak and off-peak periods in the morning, afternoon and night. The days of the week were clustered into one random workday (e.g., Monday), Saturday and Sunday after a correlation analysis, resulting on 18 time units (6 day intervals in three days of the week). Additionally, they were aggregated into zones of influence represented by the intersection of postal zones, riverbanks, Voronoi diagrams and the buffer area from the stations. Spatial features were obtained from the OSM dataset including points (points of interest, transport, and trafficrelated points), lines (roadways, waterways, railways) and polygons (land-use). After a sensitivity analysis to calculate the indicators, a buffer radius of 400 meters and a S D = 5 were adapted since they showed better the behavior of the variables. Finally, 144 independent variables were considered to be part of the model building out of the initial 800 possible variables. Five cities were selected as a training set to build the models and a sixth city (Kassel) as a test set to validate the models. Model diagnostics were carried out concluding that logarithmic and Box-Cox (30) transformations solved heteroscedasticity issues. In total, 324 models were built including arrivals or departures to the stations and the different day intervals per regression method. Multiple linear regression with a stepwise regression in both directions was

4

Duran-Rodas et. al / Transportation Research Procedia (2017) 000–000 David Duran-Rodas et al. / Transportation Research00Procedia 41 (2019) 688–693

691

Fig. 1. Calculation of indicators

the method with the least number of selected variables in a range from 10 to 25. However, GBM usually presented higher fitting and validation scores with an average R2 of 0.84 for the fitting and 0.47 for the validation test. The most influencing variables in all the built models and through all day intervals were the population of the city and the distance from the city center (old town) to the stations, followed to the distance to bakeries. In general, the most influencing variables were related to leisure activities locations, such as parks, green areas, and water bodies on the weekends, pubs, cinemas and clubs at night, shops on Saturdays, and memorials outside of working hours. Just a few transport-related variables significantly influenced the models as the distance to bicycle parking and car-sharing stations. Logical relationships between the variables with the time intervals were displayed, such as higher ridership at nights for stations close to nightclubs.

4. Conclusion This study presented an automated method to collect, analyze and process open-source data to find relationships between the ridership of shared vehicle systems with the built environment. This method helped to automatically identify influencing factors of SBBS in multiple cities in Germany as the distance to the city center (old town). The built models can be used to estimate as a first approach the transferability of SBBS in other cities in Germany considering only open-source built environment features.

692

David Duran-Rodas et al. / Transportation Research Procedia 41 (2019) 688–693 Duran-Rodas et. al / Transportation Research Procedia 00 (2017) 000–000

5

Fig. 2. Model building and selection

References [1] S. Shaheen, N. Chan, A. Bansal, A. Cohen, Shared mobility: Definitions, industry development, and early understanding, Transportation Sustainability Research Center (TSRC), UC Berkeley. [2] J. B¨uttner, T. Petersen, Optimising Bike Sharing in European Cities: A Handbook, OBIS, 2011. URL https://books.google.de/books?id=Gqq3kQEACAAJ



6

David Duran-Rodas et al. / Transportation Research Procedia 41 (2019) 688–693 Duran-Rodas et. al / Transportation Research Procedia 00 (2017) 000–000

693

[3] S. Shaheen, E. Martin, A. Cohen, R. Finson, Public bikesharing in north america: Early operator and user understanding, mti report 11-19, Tech. rep., Mineta Transportation Institute (2012). [4] S. A. Shaheen, A. P. Cohen, Carsharing and personal vehicle services: Worldwide market developments and emerging trends, International Journal of Sustainable Transportation 7 (1) (2012) 5–34. [5] E. Martin, S. Shaheen, J. Lidicker, Impact of carsharing on household vehicle holdings, Transportation Research Record (2143) (2010) 150–158, cited By 94. doi:10.3141/2143-19. URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-78651338056&doi=10.3141%2f2143-19&partnerID=40& md5=fb078b557acbf811dddab207774f5224 [6] F. Giesel, C. Nobis, The impact of carsharing on car ownership in german cities, Transportation Research Procedia 19 (2016) 215 – 224, transforming Urban Mobility. mobil.TUM 2016. International Scientific Conference on Mobility and Transport. Conference Proceedings. doi:http://dx.doi.org/10.1016/j.trpro.2016.12.082. [7] S. Shaheen, N. Chan, Mobility and the sharing economy: Impacts synopsis, Transportation Sustainability Research Center, University of California, Berkeley. http://tsrc. berkeley. edu/sites/default/files/Innovative-Mobility-Industry-Outlook SM-Spring-2015. pdf. [8] E. Fishman, S. Washington, N. Haworth, Bike shares impact on car use: evidence from the united states, great britain, and australia, Transportation Research Part D: Transport and Environment 31 (2014) 13–20. [9] K. Kortum, R. Schnduwe, B. Stolte, B. Bock, Free-floating carsharing: City-specific growth rates and success factors, Transportation Research Procedia 19 (2016) 328 – 340. doi:http://dx.doi.org/10.1016/j.trpro.2016.12.092. [10] C. M. D. Chardon, G. Caruso, I. Thomas, Bicycle sharing system success determinants, Transportation Research Part A: Policy and Practice 100 (2017) 202 – 214. doi:https://doi.org/10.1016/j.tra.2017.04.020. [11] S. Schmller, S. Weikl, J. Mller, K. Bogenberger, Empirical analysis of free-floating carsharing usage: The munich and berlin case, Transportation Research Part C: Emerging Technologies 56 (2015) 34 – 51. [12] J. Kang, K. Hwang, S. Park, Finding factors that influence carsharing usage: Case study in seoul, Sustainability 8 (2016) 709. doi:doi: 10.3390/su8080709. [13] C. Celsor, A. Millard-Ball, Where does carsharing work?: Using geographic information systems to assess market potential, Transportation Research Record: Journal of the Transportation Research Board (2007) 61–69. [14] J. Comendador, M. E. Lopez-Lambas, A. Monzon, Urban built environment analysis: Evidence from a mobility survey in madrid, Procedia Social and Behavioral Sciences 160 (2014) 362 – 371. doi:http://dx.doi.org/10.1016/j.sbspro.2014.12.148. [15] A. Faghih-Imani, N. Eluru, A. M. El-Geneidy, M. Rabbat, U. Haq, How land-use and urban form impact bicycle flows: evidence from the bicycle-sharing system (bixi) in montreal, Journal of Transport Geography 41 (2014) 306 – 314. doi:https://doi.org/10.1016/j. jtrangeo.2014.01.013. [16] B. Caulfield, M. O’Mahony, W. Brazil, P. Weldon, Examining usage patterns of a bike-sharing scheme in a medium sized city, Transportation Research Part A: Policy and Practice 100 (2017) 152 – 161. doi:https://doi.org/10.1016/j.tra.2017.04.023. [17] S. Schmller, K. Bogenberger, Analyzing external factors on the spatial and temporal demand of car sharing systems, Procedia - Social and Behavioral Sciences 111 (2014) 8 – 17. doi:http://dx.doi.org/10.1016/j.sbspro.2014.01.033. [18] C. Willing, K. Klemmer, T. Brandt, D. Neumann, Moving in time and space location intelligence for carsharing decision support, Decision Support Systems 99 (2017) 75 – 85, location Analytics and Decision Support. doi:http://dx.doi.org/10.1016/j.dss.2017.05.005. [19] M. Janssen, Y. Charalabidis, A. Zuiderwijk, Benefits, adoption barriers and myths of open data and open government, Information systems management 29 (4) (2012) 258–268. [20] S. Sadiq, M. Indulska, Open data: Quality over quantity, International Journal of Information Management 37 (3) (2017) 150 – 154. doi: https://doi.org/10.1016/j.ijinfomgt.2017.01.003. [21] A. Vetr, L. Canova, M. Torchiano, C. O. Minotas, R. Iemma, F. Morando, Open data quality measurement framework: Definition and application to open government data, Government Information Quarterly 33 (2) (2016) 325 – 337. doi:https://doi.org/10.1016/j.giq. 2016.02.001. [22] OpenStreetMap-contributors, Planet dump retrieved from https://planet.osm.org , https://www.openstreetmap.org , accessed on: 31.10.2017 (2017). [23] W. A. Gnther, M. H. R. Mehrizi, M. Huysman, F. Feldberg, Debating big data: A literature review on realizing value from big data, The Journal of Strategic Information Systems 26 (3) (2017) 191 – 209. doi:https://doi.org/10.1016/j.jsis.2017.07.003. [24] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise., in: Kdd, Vol. 96, 1996, pp. 226–231. [25] J. Zhao, W. Deng, Y. Song, Ridership and effectiveness of bikesharing: The effects of urban features and system characteristics on daily use and turnover rate of public bikes in china, Transport Policy 35 (2014) 253 – 264. doi:https://doi.org/10.1016/j.tranpol.2014.06.008. [26] S. Chatterjee, A. Hadi, Regression analysis by example, John Wiley and Sons, 2015. [27] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological) 58 (1) (1996) 267–288. URL http://www.jstor.org/stable/2346178 [28] J. Friedman, Greedy function approximation: A gradient boosting machine, Annals of Statistics 29 (5) (2001) 1189–1232, cited By 2353. [29] Deutsche Bahn AG, Das smarte leihfahrrad der deutschen bahn — call a bike, https://www.callabike-interaktiv.de/de, accessed on: 31.10.2017 (2017). [30] G. E. P. Box, D. R. Cox, An analysis of transformations, Journal of the Royal Statistical Society. Series B (Methodological) 26 (2) (1964) 211–252. URL http://www.jstor.org/stable/2984418