Improving Trip Segmentation by reducing DB calls#958
Improving Trip Segmentation by reducing DB calls#958humbleOldSage wants to merge 2 commits intoe-mission:masterfrom
Conversation
The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are : 1. db calls for transition and motion dataframes are moved upstream from `is_tracking_restarted_in_range` function and `get_ongoing_motion_in_range` in `restart_checking.py` to `trip_segmentaiton.py`. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting. 2. All the other changes in `trip_segmentation.py`,`dwell_segmentation_dist_filter.py` and `dwell_segmentation_time_filter.py` are just to support the change in point 1 ( above).
|
Replying to the comment #956 (comment) here :
The current PR just has reduced DB calls. In this case, the runtime for androidWrapper is at ~2.1s ( as mentioned here e-mission/e-mission-docs#1041 (comment) in the issue) , IOS Wrapper at 0.4s and CombinedWrapper at ~2.6s.
I am using the
When the data in The for loop then iterates on these points |
This is in regards to the fact that the db call for "background/filtered_location": Is present in these 3 files This causes an overhead of ~0.2s for androidWrapper and IosWrapper in the current test setup. However, there is nothing else failing here. |
And what was it before? Just reporting the new number does not capture the difference. I don't see a comment in the issue for the time taken in
This is not a sufficient test. |
On second look, for the iOSWrapper, the test reads data from
I believe the IosWrapper is at 0.4s due to lesser trips, which we come to know after running the pipeline. ( 2 in case of iOS vs 8 for android ). I tried other real example files with platform as iOS. The other files with iOS data are : iphone_2016-02-22 , once the example is set (using all of them one by one), there are are either 0 or 2 trips for iOS, giving the similar runtime of ~0.4s - 0.5s. |
|
The runtimes are :
|
|
@JGreenlee scalability improvement |
shankari
left a comment
There was a problem hiding this comment.
I don't think that < 10 trips is going to give us a real sense of the scalability improvement. Please run more complex tests:
TestPipelineRealDatafor a combination of correctness and performance- against a real large dataset, after resetting the pipeline, for a rough estimate of performance.
| transition_df_start_idx=transition_df.ts.searchsorted(start_ts,side='left') | ||
| transition_df_end_idx=transition_df.ts.searchsorted(end_ts,side='right') | ||
| transition_df_for_current=transition_df.iloc[transition_df_start_idx:transition_df_end_idx] |
There was a problem hiding this comment.
Is there a reason you are using this instead of something like
transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]
or
transition_df_for_current=transition_df.query('ts >= start_ts & .ts <= end_ts')
There was a problem hiding this comment.
First was the performance reason, O(log (n)) here vs O (n) in others.
Second, in case of query(), it creates a boolean series that marks rows to keep or discard, which happens internally, which increases temporary memory usage. For very large DataFrames, this can be an issue.
can use this one ,if log(n) vs O(n) is not an issue :
transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]
| motion_df_start_idx=motion_df.ts.searchsorted(start_ts,side='left') | ||
| motion_df_end_idx=motion_df.ts.searchsorted(end_ts,side='right') | ||
| filtered_motion_df=motion_df.iloc[motion_df_start_idx:motion_df_end_idx] |
| from builtins import * | ||
| from builtins import object | ||
| import logging | ||
| import pandas as pd |
There was a problem hiding this comment.
why do we need to add this import?
There was a problem hiding this comment.
Unused currently. Removed.
| return segmentation_points | ||
|
|
||
| def has_trip_ended(self, lastPoint, currPoint, timeseries): | ||
| def has_trip_ended(self, lastPoint, currPoint, timeseries, motion_df): |
There was a problem hiding this comment.
if you are already passing in the motion_df here, why do you need to also pass in the timeseries?
There was a problem hiding this comment.
it's being used here :
timeseries.invalidate_raw_entry(currPoint["_id"])
| speedThreshold = old_div(float(self.distance_threshold * 2), (old_div(self.time_threshold, 2))) | ||
|
|
||
| if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, timeseries): | ||
| if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, self.transition_df): |
There was a problem hiding this comment.
why is transition_df a module field?
| # between two location points, and there is a large gap between the last location and the first | ||
| # motion activity as well, let us just assume that there was a restart | ||
| ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, timeseries) | ||
| ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, motion_df) |
There was a problem hiding this comment.
while motion_df is not. They are both read upfront and should be treated the same way for clarity
| non_still_motions = [ma for ma in motionInRange if ma["data"]["type"] not in ignore_modes_list and ma["data"]["confidence"] == 100] | ||
| logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) | ||
| non_still_motions=motionInRange[~motionInRange['type'].isin(ignore_modes_list) & (motionInRange['confidence'] ==100)] | ||
| #logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) |
There was a problem hiding this comment.
this change doesn't appear to be related to this fix.
| return False | ||
|
|
||
| def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes): | ||
| def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes, transition_df, motion_df): |
There was a problem hiding this comment.
Ditto. if we are passing in the transition_df and motion_df, why do we also need the timeseries?
There was a problem hiding this comment.
Here, it's unused. Cleared.
Code cleanup as per the comments.
a2a9a44 to
e50e9f3
Compare
cd45974 to
30c4273
Compare
The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are :
db calls for transition and motion dataframes are moved upstream from
is_tracking_restarted_in_rangefunction andget_ongoing_motion_in_rangeinrestart_checking.pytotrip_segmentaiton.py. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting.All the other changes in
trip_segmentation.py,dwell_segmentation_dist_filter.pyanddwell_segmentation_time_filter.pyare just to support the change in point 1 ( above).