的
haversinefunc本质是:
# convert all latitudes/longitudes from decimal degrees to radianslat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))# calculate haversinelat = lat2 - lat1lng = lng2 - lng1d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
这是一种利用强大功能
NumPybroadcasting并
NumPyufuncs替换那些数学模块功能的矢量化方法,以便我们可以一次性处理整个数组-
# Get array data; convert to radians to simulate 'map(radians,...)' part coords_arr = np.deg2rad(coords_list)a = np.deg2rad(df.values)# Get the differentiationslat = coords_arr[:,0] - a[:,0,None]lng = coords_arr[:,1] - a[:,1,None]# Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.# Add into "sin(lat * 0.5) ** 2" part.add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2d = np.sin(lat * 0.5) ** 2 + add0# Get h and assign into dataframeh = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))df['Min_Distance'] = h.min(1)
为了进一步提高性能,我们可以使用
numexpr模块来代替先验功能。
运行时测试和验证
方法-
def loopy_app(df, coords_list): for row in df.itertuples(): df['Min_Distance1'] = df.apply(min_distance, axis=1)def vectorized_app(df, coords_list): coords_arr = np.deg2rad(coords_list) a = np.deg2rad(df.values) lat = coords_arr[:,0] - a[:,0,None] lng = coords_arr[:,1] - a[:,1,None] add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2 d = np.sin(lat * 0.5) ** 2 + add0 h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d)) df['Min_Distance2'] = h.min(1)
验证-
In [158]: dfOut[158]: Latitude Longitude0 39.989 -89.9801 39.923 -89.9012 39.990 -89.9873 39.884 -89.9434 39.030 -89.931In [159]: loopy_app(df, coords_list)In [160]: vectorized_app(df, coords_list)In [161]: dfOut[161]: Latitude Longitude Min_Distance1 Min_Distance20 39.989 -89.980 126.637607 126.6376071 39.923 -89.901 121.266241 121.2662412 39.990 -89.987 126.037388 126.0373883 39.884 -89.943 118.901195 118.9011954 39.030 -89.931 53.765506 53.765506
时间-
In [163]: dfOut[163]: Latitude Longitude0 39.989 -89.9801 39.923 -89.9012 39.990 -89.9873 39.884 -89.9434 39.030 -89.931In [164]: %timeit loopy_app(df, coords_list)100 loops, best of 3: 2.41 ms per loopIn [165]: %timeit vectorized_app(df, coords_list)10000 loops, best of 3: 96.8 µs per loop



