不幸的是,我发现完成此操作的最佳方法是完全按照所声明的进行操作:使用查看项目是否存在于数据库中
django_model.objects.get,然后进行更新。
在设置文件中,添加了新管道:
ITEM_PIPELINES = { # ... # Last pipeline, because further changes won't be saved. 'apps.scrapy.pipelines.ItemPersistencePipeline': 999}我创建了一些辅助方法来处理创建项目模型的工作,并在必要时创建一个新的方法:
def item_to_model(item): model_class = getattr(item, 'django_model') if not model_class: raise TypeError("Item is not a `DjangoItem` or is misconfigured") return item.instancedef get_or_create(model): model_class = type(model) created = False # Normally, we would use `get_or_create`. However, `get_or_create` would # match all properties of an object (i.e. create a new object # anytime it changed) rather than update an existing object. # # Instead, we do the two steps separately try: # We have no unique identifier at the moment; use the name for now. obj = model_class.objects.get(name=model.name) except model_class.DoesNotExist: created = True obj = model # DjangoItem created a model for us. return (obj, created)def update_model(destination, source, commit=True): pk = destination.pk source_dict = model_to_dict(source) for (key, value) in source_dict.items(): setattr(destination, key, value) setattr(destination, 'pk', pk) if commit: destination.save() return destination然后,最后的管道非常简单:
class ItemPersistencePipeline(object): def process_item(self, item, spider): try: item_model = item_to_model(item) except TypeError: return item model, created = get_or_create(item_model) update_model(model, item_model) return item



