对于某些集合,此方法可能是可行的,但需要注意以下几点:
- 基数聚合使用HyperLogLog ++算法来 近似 基数。对于低基数字段,此近似值可能完全准确,而对于高基数字段,则近似值不那么准确。
- 术语对于 许多 术语而言,聚合可能在计算上很昂贵,因为每个存储桶都需要构建在内存中,然后序列化以响应。
您可能可以跳过基数汇总来获取大小,而只需将其
int.MaxValue作为术语汇总的大小即可。在速度方面效率较低的另一种方法是滚动浏览范围内的所有文档,使用源过滤器仅返回您感兴趣的字段。我希望使用Scroll方法可以减轻群集的压力,但我建议您监视您采用的任何方法。
这是对Stack Overflow数据集(2016年6月,IIRC)上这两种方法的比较,研究了两年前的今天和一年前的今天的独特提问者。
术语汇总
void Main(){ var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200")); var connectionSettings = new ConnectionSettings(pool) .MapDefaultTypeIndices(d => d .Add(typeof(Question), NDC.StackOverflowIndex) ); var client = new ElasticClient(connectionSettings); var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2); var yearAgo = DateTime.UtcNow.Date.AddYears(-1); var searchResponse = client.Search<Question>(s => s .Size(0) .Query(q => q .DateRange(c => c.Field(p => p.CreationDate) .GreaterThan(twoYearsAgo) .LessThan(yearAgo) ) ) .Aggregations(a => a .Terms("unique_users", c => c .Field(f => f.OwnerUserId) .Size(int.MaxValue) ) ) ); var uniqueOwnerUserIds = searchResponse.Aggs.Terms("unique_users").Buckets.Select(b => b.KeyAsString).ToList(); // 3.83 seconds // unique question askers: 795352 Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");}滚动API
void Main(){ var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200")); var connectionSettings = new ConnectionSettings(pool) .MapDefaultTypeIndices(d => d .Add(typeof(Question), NDC.StackOverflowIndex) ); var client = new ElasticClient(connectionSettings); var uniqueOwnerUserIds = new HashSet<int>(); var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2); var yearAgo = DateTime.UtcNow.Date.AddYears(-1); var searchResponse = client.Search<Question>(s => s .Source(sf => sf .Include(ff => ff .Field(f => f.OwnerUserId) ) ) .Size(10000) .Scroll("1m") .Query(q => q .DateRange(c => c .Field(p => p.CreationDate) .GreaterThan(twoYearsAgo) .LessThan(yearAgo) ) ) ); while (searchResponse.documents.Any()) { foreach (var document in searchResponse.documents) { if (document.OwnerUserId.HasValue) uniqueOwnerUserIds.Add(document.OwnerUserId.Value); } searchResponse = client.Scroll<Question>("1m", searchResponse.ScrollId); } client.ClearScroll(c => c.ScrollId(searchResponse.ScrollId)); // 91.8 seconds // unique question askers: 795352 Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");}术语汇总比Scroll API方法快24倍。



