以下数据ID和物理存储相关性非常差。
postgres=# insert into t2 select id,md5(random()::text) from generate_series(1,10000000) as t(id) order by random();
INSERT 0 10000000
查询他们的相关性。显然T2表的物理存储和实际值顺序相关性很差。
postgres=# select correlation from pg_stats where tablename='t1' and attname='id';
correlation
-------------
1
(1 row)
postgres=# select correlation from pg_stats where tablename='t2' and attname='id';
correlation
-------------
0.00805771
(1 row)
创建索引,创建索引的速度明显比BTREE索引快,因为BRIN只需要存储值区间,瘦得很。
postgres=# create index idx_t1_id on t1 using brin (id);
CREATE INDEX
postgres=# create index idx_t2_id on t2 using brin (id);
CREATE INDEX
我们看看索引的大小和表的大小,从BRIN的原理我们可以想象索引肯定很小,表650MB,索引才192K。
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-----------+-------+----------+-------+--------+-------------
public | idx_t1_id | index | postgres | t1 | 192 kB |
public | idx_t2_id | index | postgres | t2 | 192 kB |
(2 rows)
postgres=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+--------+-------------
public | t1 | table | postgres | 650 MB |
(1 row)
postgres=# \dt+ t2
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+--------+-------------
public | t2 | table | postgres | 650 MB |
(1 row)
来看看实际的查询差别就知道,BRIN有多么适合流式数据了。
postgres=# explain analyze select * from t1 where id>=1000 and id<=5000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=50.98..9767.60 rows=3803 width=37) (actual time=0.351..13.732 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 57567
Heap Blocks: lossy=128
-> Bitmap Index Scan on idx_t1_id (cost=0.00..50.03 rows=3803 width=0) (actual time=0.104..0.104 rows=1280 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.111 ms
Execution time: 14.019 ms
(8 rows)
对于相关性差的,还不如全表扫描。
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t2 (cost=49.78..9549.73 rows=3686 width=37) (actual time=2.806..2268.044 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 9995999
Heap Blocks: lossy=20791
-> Bitmap Index Scan on idx_t2_id (cost=0.00..48.86 rows=3686 width=0) (actual time=2.019..2.019 rows=208640 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.195 ms
Execution time: 2268.590 ms
(8 rows)
t2全表扫描
postgres=# set enable_bitmapscan=off;
SET
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Seq Scan on t2 (cost=0.00..170791.00 rows=3686 width=37) (actual time=0.593..1881.929 rows=4001 loops=1)
Filter: ((id >= 1000) AND (id <= 5000))
Rows Removed by Filter: 9995999
Planning time: 0.109 ms
Execution time: 1882.397 ms
(5 rows)
接下来BRIN和BTREE索引对比一下。
postgres=# create index idx_t1_id_bt on t1 using btree (id);
CREATE INDEX
postgres=# create index idx_t2_id_bt on t2 using btree (id);
CREATE INDEX
postgres=# set enable_bitmapscan=on;
SET
postgres=# drop index idx_t1_id;
DROP INDEX
postgres=# drop index idx_t2_id;
DROP INDEX
postgres=# explain analyze select * from t1 where id>=1000 and id<=5000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t1_id_bt on t1 (cost=0.43..102.04 rows=3880 width=37) (actual time=0.023..1.048 rows=4001 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.412 ms
Execution time: 1.318 ms
(4 rows)
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t2 (cost=53.05..10056.68 rows=3962 width=37) (actual time=1.932..8.304 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Heap Blocks: exact=3642
-> Bitmap Index Scan on idx_t2_id_bt (cost=0.00..52.05 rows=3962 width=0) (actual time=1.143..1.143 rows=4001 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.379 ms
Execution time: 8.621 ms
(7 rows)
我们看到btree索引查询性能是提高了,但是索引大小你看看有多大?
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+----------+-------+--------+-------------
public | idx_t1_id_bt | index | postgres | t1 | 213 MB |
public | idx_t2_id_bt | index | postgres | t2 | 213 MB |
(2 rows)
接下调整brin索引的精度提高查询效率,我们了解到默认的brin是存储128个连续的数据块区间的,这个值越小,精度越高。
postgres=# create index idx_t1_id on t1 using brin (id) with (pages_per_range=1);
CREATE INDEX
postgres=# create index idx_t2_id on t2 using brin (id) with (pages_per_range=1);
CREATE INDEX
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+----------+-------+--------+-------------
public | idx_t1_id | index | postgres | t1 | 672 kB |
public | idx_t1_id_bt | index | postgres | t1 | 213 MB |
public | idx_t2_id | index | postgres | t2 | 672 kB |
public | idx_t2_id_bt | index | postgres | t2 | 213 MB |
(4 rows)
postgres=# drop index idx_t1_id_bt;
DROP INDEX
postgres=# drop index idx_t2_id_bt;
DROP INDEX
postgres=# explain analyze select * from t1 where id>=1000 and id<=5000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=110.98..9827.60 rows=3803 width=37) (actual time=9.487..10.571 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 328
Heap Blocks: lossy=9
-> Bitmap Index Scan on idx_t1_id (cost=0.00..110.03 rows=3803 width=0) (actual time=9.449..9.449 rows=90 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.141 ms
Execution time: 10.853 ms
(8 rows)
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t2 (cost=109.78..9609.73 rows=3686 width=37) (actual time=10.407..481.673 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 2125867 # 看看精度不高的后果,取4001条数据却额外扫描了2125867条无用数据
Heap Blocks: lossy=4428
-> Bitmap Index Scan on idx_t2_id (cost=0.00..108.86 rows=3686 width=0) (actual time=10.364..10.364 rows=44280 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.106 ms
Execution time: 482.077 ms
(8 rows)
精度提高后,扫描效率有一定的提升。(对于相关度不高的就不要用BRIN了,精度提高到1都于事无补的,无用功太多)当然相比btree还有差距,不过对于大数据场景,我们还要考虑数据的插入性能,对于btree插入性能好还是brin的插入性能好呢?
我这里简单的测试了一下,并未涉及并发处理,已经可以明显的了解到btree索引对数据插入带来的开销更大。
postgres=# \d t1
Table "public.t1"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
info | text |
Indexes:
"idx_t1_id" brin (id) WITH (pages_per_range=1)
postgres=# \timing
Timing is on.
postgres=# insert into t1 select generate_series(1,1000000);
INSERT 0 1000000
Time: 2152.527 ms
postgres=# drop index idx_t1_id;
DROP INDEX
Time: 9.527 ms
postgres=# create index idx_t1_id_bt on t1 using btree (id);
CREATE INDEX
Time: 29659.752 ms
postgres=# insert into t1 select generate_series(1,1000000);
INSERT 0 1000000
Time: 5407.971 ms
pages_per_range
Defines the number of table blocks that make up one block range for each entry of a BRIN index (see Section 60.1 for more details). The default is 128.