使用 Elasticsearch 进行抽取式 QA

本教程系列将涵盖txtai的主要用例，这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码，可也可以在colab 中使用。
colab 地址

安装依赖

安装txtai和Elasticsearch.

# Install txtai and elasticsearch python client
pip install txtai elasticsearch

# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1

启动一个 Elasticsearch 实例。

import os
from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))

sleep 30

此示例将处理CORD-19数据集的一个子集。COVID-19 开放研究数据集 (CORD-19) 是学术文章的免费资源，由领先的研究小组联盟汇总，涵盖 COVID-19 和冠状病毒家族。

以下下载是从Kaggle notebook生成的 SQLite 数据库。有关此数据格式的更多信息，可以在CORD-19 分析笔记本中找到。

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite
`
## 将数据加载到 Elasticsearch
以下块将行从 SQLite 复制到 Elasticsearch。

```python
import sqlite3

import regex as re

from elasticsearch import Elasticsearch, helpers

# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

# Elasticsearch bulk buffer
buffer = []
rows = 0

# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERe (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
  # Build dict of name-value pairs for fields
  article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
  name = article["name"]

  # only process certain document sections
  if not name or not re.search(r"background|(? 
查询数据 
import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

query = {
    "_source": ["article", "title", "published", "reference", "text"],
    "size": 5,
    "query": {
        "query_string": {"query": "risk factors"}
    }
}

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]
  results.append((source["title"], source["published"], source["reference"], source["text"]))

df = pd.Dataframe(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))
 
使用 Extractive QA 导出列 
下一部分使用 Extractive QA 来派生其他列。对于每篇文章，都会检索全文并针对文档提出一系列问题。答案作为每篇文章的派生列添加。 
from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")
 
document = {
    "_source": ["id", "name", "text"],
    "size": 1000,
    "query": {
        "term": {"article": None}
    },
    "sort" : ["id"]
}

def sections(article):
  rows = []

  search = document.copy()
  search["query"]["term"]["article"] = article

  for result in es.search(index="articles", body=search)["hits"]["hits"]:
    source = result["_source"]
    name, text = source["name"], source["text"]

    if not name or not re.search(r"background|(? 
参考 
https://dev.to/neuml/tutorial-series-on-txtai-ibg

使用 Elasticsearch 进行抽取式 QA

大数据系统相关栏目本月热门文章