本教程系列将涵盖txtai的主要用例,这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码,可也可以在colab 中使用。
colab 地址
安装txtai和Elasticsearch.
# Install txtai and elasticsearch python client pip install txtai elasticsearch # Download and extract elasticsearch wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz chown -R daemon:daemon elasticsearch-7.10.1
启动一个 Elasticsearch 实例。
import os from subprocess import Popen, PIPE, STDOUT # Start and wait for server server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
sleep 30
此示例将处理CORD-19数据集的一个子集。COVID-19 开放研究数据集 (CORD-19) 是学术文章的免费资源,由领先的研究小组联盟汇总,涵盖 COVID-19 和冠状病毒家族。
以下下载是从Kaggle notebook生成的 SQLite 数据库。有关此数据格式的更多信息,可以在CORD-19 分析笔记本中找到。
wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite
`
## 将数据加载到 Elasticsearch
以下块将行从 SQLite 复制到 Elasticsearch。
```python
import sqlite3
import regex as re
from elasticsearch import Elasticsearch, helpers
# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)
# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()
# Elasticsearch bulk buffer
buffer = []
rows = 0
# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERe (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
# Build dict of name-value pairs for fields
article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
name = article["name"]
# only process certain document sections
if not name or not re.search(r"background|(?
查询数据
import pandas as pd
from IPython.display import display, HTML
pd.set_option("display.max_colwidth", None)
query = {
"_source": ["article", "title", "published", "reference", "text"],
"size": 5,
"query": {
"query_string": {"query": "risk factors"}
}
}
results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
source = result["_source"]
results.append((source["title"], source["published"], source["reference"], source["text"]))
df = pd.Dataframe(results, columns=["Title", "Published", "Reference", "Match"])
display(HTML(df.to_html(index=False)))
使用 Extractive QA 导出列
下一部分使用 Extractive QA 来派生其他列。对于每篇文章,都会检索全文并针对文档提出一系列问题。答案作为每篇文章的派生列添加。
from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")
document = {
"_source": ["id", "name", "text"],
"size": 1000,
"query": {
"term": {"article": None}
},
"sort" : ["id"]
}
def sections(article):
rows = []
search = document.copy()
search["query"]["term"]["article"] = article
for result in es.search(index="articles", body=search)["hits"]["hits"]:
source = result["_source"]
name, text = source["name"], source["text"]
if not name or not re.search(r"background|(?
参考
https://dev.to/neuml/tutorial-series-on-txtai-ibg



