如何解决获取连接表中聚合值的增量计数
在 JOIN 条件中使用函数或表达式通常是一个坏主意,我说这通常是因为一些优化器可以很好地处理它并无论如何利用索引。我建议为权重创建一个表。就像是:
CREATE TABLE weights
( weight int not null primary key
);
INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);
确保你有索引posts_reasons
:
CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);
像这样的查询:
SELECT w.weight
, COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
) as x
ON w.weight > x.sum_weight
GROUP BY w.weight;
我家的机器可能有 5-6 年的历史,它有一个 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz 和 8Gb 内存。
uname -a Linuxdustbite 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
我测试了:
https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5
MariaDB [test3]> select @@version;
+-----------------+
| @@version |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)
SELECT w.weight
, COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
) as x
ON w.weight > x.sum_weight
GROUP BY w.weight;
+--------+------------+
| weight | post_count |
+--------+------------+
| 0 | 1 |
| 10 | 2591 |
| 20 | 4264 |
| 30 | 4386 |
| 40 | 5415 |
| 50 | 7499 |
[...]
| 1270 | 119283 |
| 1320 | 119286 |
| 1330 | 119286 |
[...]
| 2590 | 119286 |
+--------+------------+
256 rows in set (9.89 sec)
如果性能至关重要且没有其他帮助,您可以为以下内容创建汇总表:
SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
您可以通过触发器维护此表
由于重量中的每个重量都需要完成一定的工作量,因此限制此表可能是有益的。
ON w.weight > x.sum_weight
WHERE w.weight <= (select MAX(sum_weights)
from (SELECT SUM(weight) as sum_weights
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id) a
)
GROUP BY w.weight
由于我的权重表中有很多不必要的行(最多 2590 行),因此上述限制将执行时间从 9 秒减少到 4 秒。
解决方法
我在 MySQL 5.7.22 数据库中有两个表:posts
和reasons
. 每个帖子行都有并属于许多原因行。每个原因都有一个与之关联的权重,因此每个帖子都有一个与之关联的总聚合权重。
对于 10 个权重点的每个增量(即 0、10、20、30 等),我想获得总权重小于或等于该增量的帖子计数。我希望结果看起来像这样:
weight | post_count
--------+------------
0 | 0
10 | 5
20 | 12
30 | 18
... | ...
280 | 20918
290 | 21102
... | ...
1250 | 118005
1260 | 118039
1270 | 118040
总权重大致呈正态分布,有一些非常低的值和一些非常高的值(目前最大值为 1277),但大多数位于中间。有不到 120,000 行posts
,大约 120行reasons
。每个帖子平均有 5 或 6 个原因。
表格的相关部分如下所示:
CREATE TABLE `posts` (
id BIGINT PRIMARY KEY
);
CREATE TABLE `reasons` (
id BIGINT PRIMARY KEY,weight INT(11) NOT NULL
);
CREATE TABLE `posts_reasons` (
post_id BIGINT NOT NULL,reason_id BIGINT NOT NULL,CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);
到目前为止,我已经尝试将帖子 ID 和总权重放到一个视图中,然后将该视图连接到自身以获得聚合计数:
CREATE VIEW `post_weights` AS (
SELECT
posts.id,SUM(reasons.weight) AS reason_weight
FROM posts
INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
GROUP BY posts.id
);
SELECT
FLOOR(p1.reason_weight / 10) AS weight,COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;
然而,这非常慢 - 我让它运行了 15 分钟而不终止,这在生产中是做不到的。
有没有更有效的方法来做到这一点?
如果您有兴趣测试整个数据集,可在此处下载。该文件大约 60MB,它扩展到大约 250MB。或者,这里的 GitHub gist 中有 12,000 行。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。