单机多卡DistributedDataParallel训练流程

DistributedSampler原理

class DistributedSampler(Sampler[T_co]):

    Sampler that restricts data loading to a subset of the dataset.
> 将数据加载限制到数据集子集的采样器。

It is especially useful in conjunction with
:class:`torch.nn.parallel.DistributedDataParallel`. In such a case, each
process can pass a :class:`~torch.utils.data.DistributedSampler` instance as a
:class:`~torch.utils.data.DataLoader` sampler, and load a subset of the
original dataset that is exclusive to it.

> 它与 torch.nn.parallel.DistributedDataParallel 结合使用特别有用。
> 在这种情况下，每个进程都可以传递一个 DistributedSampler 实例作为 DataLoader采样器，
> 并加载它独有的原始数据集的子集。

    .. note::
        Dataset is assumed to be of constant size.

 Args:
> dataset – 用于采样的数据集。
> 
> num_replicas (int, optional) – 参与分布式训练的进程数。默认情况下，从当前分布式组中检索 world_size。
> 
> rank (int, optional) – 当前进程在 num_replicas 中的排名。默认情况下，排名是从当前分布式组中检索的。
> 
> shuffle (bool, optional) – 如果为 True（默认），采样器将打乱索引。
> 
> 种子 (int, 可选) – 如果shuffle=True，则用于对采样器进行混洗的随机种子。这个数字在分布式组中的所有进程中应该是相同的。默认值：0。
> 
> drop_last (bool, optional) – 如果为 True，则采样器将删除数据的尾部，使其可在副本数量上均匀整除。如果为False，采样器将添加额外的索引以使数据在副本之间均匀划分。默认值：假。

  .. warning::
        In distributed mode, calling the :meth:`set_epoch` method at
        the beginning of each epoch **before** creating the :class:`DataLoader` iterator
        is necessary to make shuffling work properly across multiple epochs. Otherwise,
        the same ordering will be always used.

> 在分布式模式下，在创建 DataLoader 迭代器之前，在每个 epoch 开始时调用 set_epoch()
> 方法是必要的，以使混洗在多个 epoch 中正常工作。否则，将始终使用相同的顺序。

    Example::

        >>> sampler = DistributedSampler(dataset) if is_distributed else None
        >>> loader = DataLoader(dataset, shuffle=(sampler is None),
        ...                     sampler=sampler)
        >>> for epoch in range(start_epoch, n_epochs):
        ...     if is_distributed:
        ...         sampler.set_epoch(epoch)
        ...     train(loader)
    """

单机多卡DistributedDataParallel训练流程

Python相关栏目本月热门文章