栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

为什么Numpy函数在熊猫系列/数据帧上这么慢?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

为什么Numpy函数在熊猫系列/数据帧上这么慢?

是的,似乎

np.clip
pandas.Series
numpy.ndarray
s。没错,但实际上(至少在无症状的情况下)还不错。8000个元素仍然处于运行状态,其中恒定因素是运行时的主要贡献者。我认为这是问题的一个非常重要的方面,因此我正在形象地看待(从另一个答案中借用):

# Setupimport pandas as pdimport numpy as npdef on_series(s):    return np.clip(s, a_min=None, a_max=1)def on_values_of_series(s):    return np.clip(s.values, a_min=None, a_max=1)# Timing setuptimings = {on_series: [], on_values_of_series: []}sizes = [2**i for i in range(1, 26, 2)]# Timingfor size in sizes:    func_input = pd.Series(np.random.randint(0, 30, size=size))    for func in timings:        res = %timeit -o func(func_input)        timings[func].append(res)%matplotlib notebookimport matplotlib.pyplot as pltimport numpy as npfig, (ax1, ax2) = plt.subplots(1, 2)for func in timings:    ax1.plot(sizes,   [time.best for time in timings[func]],   label=str(func.__name__))ax1.set_xscale('log')ax1.set_yscale('log')ax1.set_xlabel('size')ax1.set_ylabel('time [seconds]')ax1.grid(which='both')ax1.legend()baseline = on_values_of_series # choose one function as baselinefor func in timings:    ax2.plot(sizes,   [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])],   label=str(func.__name__))ax2.set_yscale('log')ax2.set_xscale('log')ax2.set_xlabel('size')ax2.set_ylabel('time relative to {}'.format(baseline.__name__))ax2.grid(which='both')ax2.legend()plt.tight_layout()

这是一个对数-
对数图,因为我认为这更清楚地显示了重要功能。例如,它表明

np.clip
a
numpy.ndarray
上的速度更快,但在那种情况下它的常数因子也小得多。大型阵列的差异仅为〜3!这仍然是一个很大的差异,但是比小型阵列的差异要小。

但是,这仍然不能解决时差来自何处的问题。

解决方案实际上非常简单:

np.clip
将第一个参数的
clip
方法 委托给:

>>> np.clip??Source:   def clip(a, a_min, a_max, out=None):    """    ...    """    return _wrapfunc(a, 'clip', a_min, a_max, out=out)>>> np.core.fromnumeric._wrapfunc??Source:   def _wrapfunc(obj, method, *args, **kwds):    try:        return getattr(obj, method)(*args, **kwds)    # ...    except (AttributeError, TypeError):        return _wrapit(obj, method, *args, **kwds)

getattr
该行
_wrapfunc
的功能是重要的线在这里,因为
np.ndarray.clip
pd.Series.clip
不同的方法,是的,
完全不同的方法

>>> np.ndarray.clip<method 'clip' of 'numpy.ndarray' objects>>>> pd.Series.clip<function pandas.core.generic.NDframe.clip>

不幸的是

np.ndarray.clip
,它是一个C函数,因此很难对其进行分析,但是它
pd.Series.clip
是常规的Python函数,因此它易于分析。让我们在这里使用一系列5000个整数:

s = pd.Series(np.random.randint(0, 100, 5000))

对于

np.clip
values
我得到以下行剖析:

%load_ext line_profiler%lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1)Timer unit: 4.10256e-07 sTotal time: 2.25641e-05 sFile: numpycorefromnumeric.pyFunction: clip at line 1673Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  1673          def clip(a, a_min, a_max, out=None):  1674   """  ...  1726   """  1727         155     55.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)Total time: 1.51795e-05 sFile: numpycorefromnumeric.pyFunction: _wrapfunc at line 55Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================    55          def _wrapfunc(obj, method, *args, **kwds):    56         1 2      2.0      5.4      try:    57         135     35.0     94.6          return getattr(obj, method)(*args, **kwds)    58   59   # An AttributeError occurs if the object does not have    60   # such a method in its class.    61   62   # A TypeError occurs if the object does have such a method    63   # in its class, but its signature is not identical to that    64   # of NumPy's. This situation has occurred in the case of    65   # a downstream library like 'pandas'.    66   except (AttributeError, TypeError):    67       return _wrapit(obj, method, *args, **kwds)

但是对于

np.clip
Series
我得到了完全不同的分析结果:

%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)Timer unit: 4.10256e-07 sTotal time: 0.000823794 sFile: numpycorefromnumeric.pyFunction: clip at line 1673Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  1673          def clip(a, a_min, a_max, out=None):  1674   """  ...  1726   """  1727         1         2008   2008.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)Total time: 0.00081846 sFile: numpycorefromnumeric.pyFunction: _wrapfunc at line 55Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================    55          def _wrapfunc(obj, method, *args, **kwds):    56         1 2      2.0      0.1      try:    57         1         1993   1993.0     99.9          return getattr(obj, method)(*args, **kwds)    58   59   # An AttributeError occurs if the object does not have    60   # such a method in its class.    61   62   # A TypeError occurs if the object does have such a method    63   # in its class, but its signature is not identical to that    64   # of NumPy's. This situation has occurred in the case of    65   # a downstream library like 'pandas'.    66   except (AttributeError, TypeError):    67       return _wrapit(obj, method, *args, **kwds)Total time: 0.000804922 sFile: pandascoregeneric.pyFunction: clip at line 4969Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  4969   def clip(self, lower=None, upper=None, axis=None, inplace=False,  4970 *args, **kwargs):  4971       """  ...  5021       """  5022         112     12.0      0.6          if isinstance(self, ABCPanel):  5023raise NotImplementedError("clip is not supported yet for panels")  5024 5025         110     10.0      0.5          inplace = validate_bool_kwarg(inplace, 'inplace')  5026 5027         169     69.0      3.5          axis = nv.validate_clip_with_axis(axis, args, kwargs)  5028 5029       # GH 17276  5030       # numpy doesn't like NaN as a clip value  5031       # so ignore  5032         1          158    158.0      8.1          if np.any(pd.isnull(lower)):  5033         1 3      3.0      0.2   lower = None  5034         126     26.0      1.3          if np.any(pd.isnull(upper)):  5035upper = None  5036 5037       # GH 2747 (arguments were reversed)  5038         1 1      1.0      0.1          if lower is not None and upper is not None:  5039if is_scalar(lower) and is_scalar(upper):  5040    lower, upper = min(lower, upper), max(lower, upper)  5041 5042       # fast-path for scalars  5043         1 1      1.0      0.1          if ((lower is None or (is_scalar(lower) and is_number(lower))) and  5044         128     28.0      1.4       (upper is None or (is_scalar(upper) and is_number(upper)))):  5045         1         1654   1654.0     84.3   return self._clip_with_scalar(lower, upper, inplace=inplace)  5046 5047       result = self  5048       if lower is not None:  5049result = result.clip_lower(lower, axis, inplace=inplace)  5050       if upper is not None:  5051if inplace:  5052    result = self  5053result = result.clip_upper(upper, axis, inplace=inplace)  5054 5055       return resultTotal time: 0.000662153 sFile: pandascoregeneric.pyFunction: _clip_with_scalar at line 4920Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  4920   def _clip_with_scalar(self, lower, upper, inplace=False):  4921         1 2      2.0      0.1          if ((lower is not None and np.any(isna(lower))) or  4922         125     25.0      1.5       (upper is not None and np.any(isna(upper)))):  4923raise ValueError("Cannot use an NA value as a clip threshold")  4924 4925         122     22.0      1.4          result = self.values  4926         1          571    571.0     35.4          mask = isna(result)  4927 4928         195     95.0      5.9          with np.errstate(all='ignore'):  4929         1 1      1.0      0.1   if upper is not None:  4930         1          141    141.0      8.7       result = np.where(result >= upper, upper, result)  4931         133     33.0      2.0   if lower is not None:  4932    result = np.where(result <= lower, lower, result)  4933         173     73.0      4.5          if np.any(mask):  4934result[mask] = np.nan  4935 4936         190     90.0      5.6          axes_dict = self._construct_axes_dict()  4937         1          558    558.0     34.6          result = self._constructor(result, **axes_dict).__finalize__(self)  4938 4939         1 2      2.0      0.1          if inplace:  4940self._update_inplace(result)  4941       else:  4942         1 1      1.0      0.1   return result

那时我不再进入子例程,因为它已经突出显示了在哪里

pd.Series.clip
执行的工作比在处更多
np.ndarray.clip
。只需将(55个计时器单位)
np.clip
上的调用总时间与该方法
values
中的第一个检查(158个计时器单位)进行比较即可。那时,pandas方法甚至没有从裁剪开始,它已经花费了3倍的时间。
pandas.Series.clip``ifnp.any(pd.isnull(lower))

但是,当数组很大时,这些“开销”中的几个就变得微不足道了:

s = pd.Series(np.random.randint(0, 100, 1000000))%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)Timer unit: 4.10256e-07 sTotal time: 0.00593476 sFile: numpycorefromnumeric.pyFunction: clip at line 1673Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  1673          def clip(a, a_min, a_max, out=None):  1674   """  ...  1726   """  1727         1        14466  14466.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)Total time: 0.00592779 sFile: numpycorefromnumeric.pyFunction: _wrapfunc at line 55Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================    55          def _wrapfunc(obj, method, *args, **kwds):    56         1 1      1.0      0.0      try:    57         1        14448  14448.0    100.0          return getattr(obj, method)(*args, **kwds)    58   59   # An AttributeError occurs if the object does not have    60   # such a method in its class.    61   62   # A TypeError occurs if the object does have such a method    63   # in its class, but its signature is not identical to that    64   # of NumPy's. This situation has occurred in the case of    65   # a downstream library like 'pandas'.    66   except (AttributeError, TypeError):    67       return _wrapit(obj, method, *args, **kwds)Total time: 0.00591302 sFile: pandascoregeneric.pyFunction: clip at line 4969Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  4969   def clip(self, lower=None, upper=None, axis=None, inplace=False,  4970 *args, **kwargs):  4971       """  ...  5021       """  5022         117     17.0      0.1          if isinstance(self, ABCPanel):  5023raise NotImplementedError("clip is not supported yet for panels")  5024 5025         114     14.0      0.1          inplace = validate_bool_kwarg(inplace, 'inplace')  5026 5027         197     97.0      0.7          axis = nv.validate_clip_with_axis(axis, args, kwargs)  5028 5029       # GH 17276  5030       # numpy doesn't like NaN as a clip value  5031       # so ignore  5032         1          125    125.0      0.9          if np.any(pd.isnull(lower)):  5033         1 2      2.0      0.0   lower = None  5034         130     30.0      0.2          if np.any(pd.isnull(upper)):  5035upper = None  5036 5037       # GH 2747 (arguments were reversed)  5038         1 2      2.0      0.0          if lower is not None and upper is not None:  5039if is_scalar(lower) and is_scalar(upper):  5040    lower, upper = min(lower, upper), max(lower, upper)  5041 5042       # fast-path for scalars  5043         1 2      2.0      0.0          if ((lower is None or (is_scalar(lower) and is_number(lower))) and  5044         132     32.0      0.2       (upper is None or (is_scalar(upper) and is_number(upper)))):  5045         1        14092  14092.0     97.8   return self._clip_with_scalar(lower, upper, inplace=inplace)  5046 5047       result = self  5048       if lower is not None:  5049result = result.clip_lower(lower, axis, inplace=inplace)  5050       if upper is not None:  5051if inplace:  5052    result = self  5053result = result.clip_upper(upper, axis, inplace=inplace)  5054 5055       return resultTotal time: 0.00575753 sFile: pandascoregeneric.pyFunction: _clip_with_scalar at line 4920Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================  4920   def _clip_with_scalar(self, lower, upper, inplace=False):  4921         1 2      2.0      0.0          if ((lower is not None and np.any(isna(lower))) or  4922         128     28.0      0.2       (upper is not None and np.any(isna(upper)))):  4923raise ValueError("Cannot use an NA value as a clip threshold")  4924 4925         1          120    120.0      0.9          result = self.values  4926         1         3525   3525.0     25.1          mask = isna(result)  4927 4928         186     86.0      0.6          with np.errstate(all='ignore'):  4929         1 2      2.0      0.0   if upper is not None:  4930         1         9314   9314.0     66.4       result = np.where(result >= upper, upper, result)  4931         161     61.0      0.4   if lower is not None:  4932    result = np.where(result <= lower, lower, result)  4933         1          283    283.0      2.0          if np.any(mask):  4934result[mask] = np.nan  4935 4936         178     78.0      0.6          axes_dict = self._construct_axes_dict()  4937         1          532    532.0      3.8          result = self._constructor(result, **axes_dict).__finalize__(self)  4938 4939         1 2      2.0      0.0          if inplace:  4940self._update_inplace(result)  4941       else:  4942         1 1      1.0      0.0   return result

仍然存在多个函数调用,例如

isna
np.where
,这需要花费大量时间,但是总的来说,这至少与该
np.ndarray.clip
时间相当(这是在我的计算机上,时间差约为3的状态)。

外卖可能应该是:

  • 许多NumPy函数只是委托给传入对象的方法,因此当您传入不同对象时,可能会有巨大差异。
  • 剖析,尤其是行剖析,可以成为查找性能差异来源的好工具。
  • 在这种情况下,请务必确保测试大小不同的对象。您可能正在比较可能无关紧要的常数因子,除非您处理许多小数组。

使用的版本:

Python 3.6.3 64-bit on Windows 10Numpy 1.13.3Pandas 0.21.1


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/370067.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号