Preprocessing

The preprocessing module contains a variety of functions to transform mobility and tracking data into richer data sources.

Filtering

trackintel.preprocessing.filter.spatial_filter(source, areas, method='within', re_project=False)[source]

Filter staypoints, locations or triplegs with a geo extent.

Parameters
  • source (GeoDataFrame (as trackintel datamodels)) – The source feature to perform the spatial filtering

  • areas (GeoDataFrame) – The areas used to perform the spatial filtering. Note, you can have multiple Polygons and it will return all the features intersect with ANY of those geometries.

  • method ({'within', 'intersects', 'crosses'}) –

    The method to filter the ‘source’ GeoDataFrame

    • ’within’ : return instances in ‘source’ where no points of these instances lies in the exterior of the ‘areas’ and at least one point of the interior of these instances lies in the interior of ‘areas’.

    • ’intersects’: return instances in ‘source’ where the boundary or interior of these instances intersect in any way with those of the ‘areas’

    • ’crosses’ : return instances in ‘source’ where the interior of these instances intersects the interior of the ‘areas’ but does not contain it, and the dimension of the intersection is less than the dimension of the one of the ‘areas’.

  • re_project (bool, default False) – If this is set to True, the ‘source’ will be projected to the coordinate reference system of ‘areas’

Returns

ret_gdf – A new GeoDataFrame containing the features after the spatial filtering.

Return type

GeoDataFrame (as trackintel datamodels)

Examples

>>> sp.as_staypoints.spatial_filter(areas, method="within", re_project=False)

Positionfixes

As positionfixes are usually the data we receive from a tracking application of some sort, there are various functions that extract meaningful information from it (and in the process turn it into a higher-level trackintel data structure).

In particular, we can generate staypoints and triplegs from positionfixes.

trackintel.preprocessing.positionfixes.generate_staypoints(positionfixes, method='sliding', distance_metric='haversine', dist_threshold=100, time_threshold=5.0, gap_threshold=15.0, include_last=False, print_progress=False, exclude_duplicate_pfs=True, n_jobs=1)[source]

Generate staypoints from positionfixes.

Parameters
  • positionfixes (GeoDataFrame (as trackintel positionfixes)) – The positionfixes have to follow the standard definition for positionfixes DataFrames.

  • method ({'sliding'}) – Method to create staypoints. ‘sliding’ applies a sliding window over the data.

  • distance_metric ({'haversine'}) – The distance metric used by the applied method.

  • dist_threshold (float, default 100) – The distance threshold for the ‘sliding’ method, i.e., how far someone has to travel to generate a new staypoint. Units depend on the dist_func parameter. If ‘distance_metric’ is ‘haversine’ the unit is in meters

  • time_threshold (float, default 5.0 (minutes)) – The time threshold for the ‘sliding’ method in minutes.

  • gap_threshold (float, default 15.0 (minutes)) – The time threshold of determine whether a gap exists between consecutive pfs. Consecutive pfs with temporal gaps larger than ‘gap_threshold’ will be excluded from staypoints generation. Only valid in ‘sliding’ method.

  • include_last (boolean, default False) – The algorithm in Li et al. (2008) only detects staypoint if the user steps out of that staypoint. This will omit the last staypoint (if any). Set ‘include_last’ to True to include this last staypoint.

  • print_progress (boolean, default False) – Show per-user progress if set to True.

  • exclude_duplicate_pfs (boolean, default True) – Filters duplicate positionfixes before generating staypoints. Duplicates can lead to problems in later processing steps (e.g., when generating triplegs). It is not recommended to set this to False.

  • n_jobs (int, default 1) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. See https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation for a detailed description

Returns

  • pfs (GeoDataFrame (as trackintel positionfixes)) – The original positionfixes with a new column [`staypoint_id`].

  • sp (GeoDataFrame (as trackintel staypoints)) – The generated staypoints.

Notes

The ‘sliding’ method is adapted from Li et al. (2008). In the original algorithm, the ‘finished_at’ time for the current staypoint lasts until the ‘tracked_at’ time of the first positionfix outside this staypoint. Users are assumed to be stationary during this missing period and potential tracking gaps may be included in staypoints. To avoid including too large missing signal gaps, set ‘gap_threshold’ to a small value, e.g., 15 min.

Examples

>>> pfs.as_positionfixes.generate_staypoints('sliding', dist_threshold=100)

References

Zheng, Y. (2015). Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology (TIST), 6(3), 29.

Li, Q., Zheng, Y., Xie, X., Chen, Y., Liu, W., & Ma, W. Y. (2008, November). Mining user similarity based on location history. In Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems (p. 34). ACM.

trackintel.preprocessing.positionfixes.generate_triplegs(positionfixes, staypoints=None, method='between_staypoints', gap_threshold=15)[source]

Generate triplegs from positionfixes.

Parameters
  • positionfixes (GeoDataFrame (as trackintel positionfixes)) – The positionfixes have to follow the standard definition for positionfixes DataFrames. If ‘staypoint_id’ column is not found, ‘staypoints’ needs to be given.

  • staypoints (GeoDataFrame (as trackintel staypoints), optional) – The staypoints (corresponding to the positionfixes). If this is not passed, the positionfixes need ‘staypoint_id’ associated with them.

  • method ({'between_staypoints'}) – Method to create triplegs. ‘between_staypoints’ method defines a tripleg as all positionfixes between two staypoints (no overlap). This method requires either a column ‘staypoint_id’ on the positionfixes or passing staypoints as an input.

  • gap_threshold (float, default 15 (minutes)) – Maximum allowed temporal gap size in minutes. If tracking data is missing for more than gap_threshold minutes, a new tripleg will be generated.

Returns

  • pfs (GeoDataFrame (as trackintel positionfixes)) – The original positionfixes with a new column [`tripleg_id`].

  • tpls (GeoDataFrame (as trackintel triplegs)) – The generated triplegs.

Notes

Methods ‘between_staypoints’ requires either a column ‘staypoint_id’ on the positionfixes or passing some staypoints that correspond to the positionfixes! This means you usually should call generate_staypoints() first.

The first positionfix after a staypoint is regarded as the first positionfix of the generated tripleg. The generated tripleg will not have overlapping positionfix with the existing staypoints. This means a small temporal gap in user’s trace will occur between the first positionfix of staypoint and the last positionfix of tripleg: pfs_stp_first[‘tracked_at’] - pfs_tpl_last[‘tracked_at’].

Examples

>>> pfs.as_positionfixes.generate_triplegs('between_staypoints', gap_threshold=15)

Staypoints

Staypoints are points where someone stayed for a longer period of time (e.g., during a transfer between two transport modes). We can cluster these into locations that a user frequently visits and/or infer if they correspond to activities.

trackintel.preprocessing.staypoints.generate_locations(staypoints, method='dbscan', epsilon=100, num_samples=1, distance_metric='haversine', agg_level='user', print_progress=False, n_jobs=1)[source]

Generate locations from the staypoints.

Parameters
  • staypoints (GeoDataFrame (as trackintel staypoints)) – The staypoints have to follow the standard definition for staypoints DataFrames.

  • method ({'dbscan'}) –

    Method to create locations.

    • ’dbscan’ : Uses the DBSCAN algorithm to cluster staypoints.

  • epsilon (float, default 100) – The epsilon for the ‘dbscan’ method. if ‘distance_metric’ is ‘haversine’ or ‘euclidean’, the unit is in meters.

  • num_samples (int, default 1) – The minimal number of samples in a cluster.

  • distance_metric ({'haversine', 'euclidean'}) – The distance metric used by the applied method. Any mentioned below are possible: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

  • agg_level ({'user','dataset'}) – The level of aggregation when generating locations: - ‘user’ : locations are generated independently per-user. - ‘dataset’ : shared locations are generated for all users.

  • print_progress (bool, default False) – If print_progress is True, the progress bar is displayed

  • n_jobs (int, default 1) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. See https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation for a detailed description

Returns

  • sp (GeoDataFrame (as trackintel staypoints)) – The original staypoints with a new column [`location_id`].

  • locs (GeoDataFrame (as trackintel locations)) – The generated locations.

Examples

>>> sp.as_staypoints.generate_locations(method='dbscan', epsilon=100, num_samples=1)

Due to tracking artifacts, it can occur that one activity is split into several staypoints. We can aggregate the staypoints horizontally that are close in time and at the same location.

trackintel.preprocessing.staypoints.merge_staypoints(staypoints, triplegs, max_time_gap='10min', agg={})[source]

Aggregate staypoints horizontally via time threshold.

Parameters
  • staypoints (GeoDataFrame (as trackintel staypoints)) – The staypoints must contain a column location_id (see generate_locations function) and have to follow the standard trackintel definition for staypoints DataFrames.

  • triplegs (GeoDataFrame (as trackintel triplegs)) – The triplegs have to follow the standard definition for triplegs DataFrames.

  • max_time_gap (str or pd.Timedelta, default "10min") – Maximum duration between staypoints to still be merged. If str must be parsable by pd.to_timedelta.

  • agg (dict, optional) – Dictionary to aggregate the rows after merging staypoints. This dictionary is used as input to the pandas aggregate function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html If empty, only the required columns of staypoints (which are [‘user_id’, ‘started_at’, ‘finished_at’]) are aggregated and returned. In order to return for example also the geometry column of the merged staypoints, set ‘agg={“geom”:”first”}’ to return the first geometry of the merged staypoints, or ‘agg={“geom”:”last”}’ to use the last one.

Returns

sp – The new staypoints with the default columns and columns in agg, where staypoints at same location and close in time are aggregated.

Return type

DataFrame

Notes

  • Due to the modification of the staypoint index, the relation between the staypoints and the corresponding positionfixes is broken after execution of this function! In explanation, the staypoint_id column of pfs does not necessarily correspond to an id in the new sp table that is returned from this function. The same holds for trips (if generated yet) where the staypoints contained in a trip might be merged in this function.

  • If there is a tripleg between two staypoints, the staypoints are not merged. If you for some reason want to merge such staypoints, simply pass an empty DataFrame for the tpls argument.

Examples

>>> # direct function call
>>> ti.preprocessing.staypoints.merge_staypoints(staypoints=sp, triplegs=tpls)
>>> # or using the trackintel datamodel
>>> sp.as_staypoints.merge_staypoints(triplegs, max_time_gap="1h", agg={"geom":"first"})

Triplegs

Triplegs denote routes taken between two consecutive staypoint. Usually, these are traveled with a single mode of transport.

From staypoints and triplegs, we can generate trips that summarize all movement and all non-essential actions (e.g., waiting) between two relevant activity staypoints.

trackintel.preprocessing.triplegs.generate_trips(staypoints, triplegs, gap_threshold=15, add_geometry=True)[source]

Generate trips based on staypoints and triplegs.

Parameters
  • staypoints (GeoDataFrame (as trackintel staypoints)) –

  • triplegs (GeoDataFrame (as trackintel triplegs)) –

  • gap_threshold (float, default 15 (minutes)) – Maximum allowed temporal gap size in minutes. If tracking data is missing for more than gap_threshold minutes, then a new trip begins after the gap.

  • add_geometry (bool default True) – If True, the start and end coordinates of each trip are added to the output table in a geometry column “geom” of type MultiPoint. Set add_geometry=False for better runtime performance (if coordinates are not required).

  • print_progress (bool, default False) – If print_progress is True, the progress bar is displayed

Returns

  • sp (GeoDataFrame (as trackintel staypoints)) – The original staypoints with new columns [`trip_id`, `prev_trip_id`, `next_trip_id`].

  • tpls (GeoDataFrame (as trackintel triplegs)) – The original triplegs with a new column [`trip_id`].

  • trips ((Geo)DataFrame (as trackintel trips)) – The generated trips.

Notes

Trips are an aggregation level in transport planning that summarize all movement and all non-essential actions (e.g., waiting) between two relevant activities. The function returns altered versions of the input staypoints and triplegs. Staypoints receive the fields [trip_id prev_trip_id and next_trip_id], triplegs receive the field [trip_id]. The following assumptions are implemented

  • If we do not record a person for more than gap_threshold minutes, we assume that the person performed an activity in the recording gap and split the trip at the gap.

  • Trips that start/end in a recording gap can have an unknown origin/destination

  • There are no trips without a (recorded) tripleg

  • Trips optionally have their start and end point as geometry of type MultiPoint, if add_geometry==True

  • If the origin (or destination) staypoint is unknown, and add_geometry==True, the origin (and destination) geometry is set as the first coordinate of the first tripleg (or the last coordinate of the last tripleg), respectively. Trips with missing values can still be identified via col origin_staypoint_id.

Examples

>>> from trackintel.preprocessing.triplegs import generate_trips
>>> staypoints, triplegs, trips = generate_trips(staypoints, triplegs)

trips can also be directly generated using the tripleg accessor >>> staypoints, triplegs, trips = triplegs.as_triplegs.generate_trips(staypoints)

The function generate_trips follows this algorithm:

../_images/tripalgorithm.png

Trips

Trips denote the sequence of all triplegs between two consecutive activities. These can be composed of multiple means of transports. A further aggregation of Trips are Tours, which is a sequence of trips such that it starts and ends at the same location. Using the trips, we can generate tours.

trackintel.preprocessing.trips.generate_tours(trips, staypoints=None, max_dist=100, max_time='1d', max_nr_gaps=0, print_progress=False)[source]

Generate trackintel-tours from trips

Parameters
  • trips (GeoDataFrame (as trackintel trips)) – The trips have to follow the standard definition for trips DataFrames

  • staypoints (GeoDataFrame (as trackintel staypoints, preprocessed to contain location IDs), default None) – The staypoints have to follow the standard definition for staypoints DataFrames. The location ID column is necessary to connect trips via locations to a tour. If None, trips will be connected based only on a distance threshold max_dist.

  • max_dist (float, default 100 (meters)) – Maximum distance between the end point of one trip and the start point of the next trip on a tour. This is parameter is only used if staypoints is None! Also, if max_nr_gaps > 0, a tour can contain larger spatial gaps (see Notes below)

  • max_time (str or pd.Timedelta, default "1d" (1 day)) – Maximum time that a tour is allowed to take

  • max_nr_gaps (int, default 0) – Maximum number of spatial gaps on the tour. Use with caution - see notes below.

  • print_progress (bool, default False) – If print_progress is True, the progress bar is displayed

Returns

  • trips_with_tours (GeoDataFrame (as trackintel trips)) – Same as trips, but with column tour_id, containing a list of the tours that the trip is part of (see notes).

  • tours (GeoDataFrame (as trackintel tours)) – The generated tours

Examples

>>> trips.as_trips.generate_tours(staypoints)

Notes

  • Tours are defined as a collection of trips in a certain time frame that start and end at the same point

  • Tours and trips have an N:N relationship: One tour consists of multiple trips, but also one trip can be part of multiple tours, due to nested tours or overlapping tours.

  • This function implements two possibilities to generate tours of trips: Via the location ID in the staypoints df, or via a maximum distance. Thus, note that only one of the parameters staypoints or max_dist is used!

  • Nested tours are possible and will be regarded as 2 (or more tours).

  • It is possible to allow spatial gaps to occur on the tour, which might be useful to deal with missing data. Example: The two trips home-work, supermarket-home would still be detected as a tour when max_nr_gaps >= 1, although the work-supermarket trip is missing. Warning: This only counts the number of gaps, but neither temporal or spatial distance of gaps, nor the number of missing trips in a gap are bounded. Thus, this parameter should be set with caution, because trips that are hours apart might still be connected to a tour if max_nr_gaps > 0.

Trips and Tours have an n:n relationship: One tour consists of multiple trips, but due to nested or overlapping tours, one trip can also be part of mulitple tours. A helper function can be used to get the trips grouped by tour.

trackintel.preprocessing.trips.get_trips_grouped(trips, tours)[source]

Helper function to get grouped trips by tour id

Parameters
  • trips (GeoDataFrame (as trackintel trips)) – Trips dataframe

  • tours (GeoDataFrame (as trackintel tours)) – Output of generate_tours function, must contain column “trips” with list of trip ids on tour

Returns

trips_grouped_by_tour – Trips grouped by tour id

Return type

DataFrameGroupBy object

Examples

>>> get_trips_grouped(trips, tours)

Notes

This function is necessary because when running generate_tours, one trip only gets the tour ID of the smallest tour it belongs to assigned. Here, we return all trips for each tour, which might contain a nested tour.