关于in、exists、not in、not exists的性能分析简介 - 168大数据

In和exists这两个一直有朋友提到，到底具体用哪一个比较高效，这里我们来借助例子简单的分析下

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> select * from table02 b where exists (select 1 from table01 a where a.id=b.object_id);

----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 75413 | 7806K| | 776 (1)| 00:00:10 |
|* 1 | HASH JOIN RIGHT SEMI| | 75413 | 7806K| 1848K| 776 (1)| 00:00:10 |
| 2 | VIEW | VW_SQ_1 | 75443 | 957K| | 78 (2)| 00:00:01 |
| 3 | TABLE ACCESS FULL | TABLE01 | 75443 | 368K| | 78 (2)| 00:00:01 |
| 4 | TABLE ACCESS FULL | TABLE02 | 75443 | 6851K| | 231 (2)| 00:00:03 |
----------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
6335 consistent gets
0 physical reads
0 redo size
4098634 bytes sent via SQL*Net to client
55800 bytes received via SQL*Net from client
5030 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
75425 rows processed

-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 75413 | 7806K| | 776 (1)| 00:00:10 |
|* 1 | HASH JOIN RIGHT SEMI| | 75413 | 7806K| 1848K| 776 (1)| 00:00:10 |
| 2 | VIEW | VW_NSO_1 | 75443 | 957K| | 78 (2)| 00:00:01 |
| 3 | TABLE ACCESS FULL | TABLE01 | 75443 | 368K| | 78 (2)| 00:00:01 |
| 4 | TABLE ACCESS FULL | TABLE02 | 75443 | 6851K| | 231 (2)| 00:00:03 |
-----------------------------------------------------------------------------------------

可能有朋友提到这个例子table01表返回的数据较多，没有有效的过滤条件，导致分别走的全表扫描和hash join，不具有代表性，同样我们随便选择一个过滤条件，这些列上面是有高效索引的
SQL> select * from table02 b where b.object_id in (select a.id from table01 a where a.name='XIAOYU01');

----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 212 | 10 (10)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID | TABLE02 | 1 | 93 | 2 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 2 | 212 | 10 (10)| 00:00:01 |
| 3 | VIEW | VW_NSO_1 | 2 | 26 | 5 (0)| 00:00:01 |
| 4 | HASH UNIQUE | | 2 | 54 | | |
| 5 | TABLE ACCESS BY INDEX ROWID| TABLE01 | 2 | 54 | 5 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | IND_TABLE01_NAME | 2 | | 3 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | IND_TABLE02_ID | 1 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
8 consistent gets
0 physical reads
0 redo size
1404 bytes sent via SQL*Net to client
492 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 212 | 10 (10)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID | TABLE02 | 1 | 93 | 2 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 2 | 212 | 10 (10)| 00:00:01 |
| 3 | VIEW | VW_SQ_1 | 2 | 26 | 5 (0)| 00:00:01 |
| 4 | HASH UNIQUE | | 2 | 54 | | |
| 5 | TABLE ACCESS BY INDEX ROWID| TABLE01 | 2 | 54 | 5 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | IND_TABLE01_NAME | 2 | | 3 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | IND_TABLE02_ID | 1 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

这里就要提到下之前网上有过很多文章提到了关于in和exists适用规则：
1当子查询返回数据较少，而主查询中表又比较大而且有高效的索引时，in比较高效
2 当子查询返回数据较多，而主查询的表数据量又较小，适用exists比较高效。

这里还举例下网上有文章摘要了in和exists执行流程，我们来简单看下：
IN的执行流程
SELECT * FROM TABLE01 a WHERE a.x IN (SELECT b.x FROM TABLE02)
table02表首先有个对x去重的操作，然后再和TABLE01进行关联

SELECT * FROM TABLE01 a WHERE EXISTS (SELECT 1 FROM TABLE02 b WHERE a.x = b.x)
--可以理解为：
for x in ( select * from TABLE01 a ) LOOP
if ( exists ( select 1 from TABLE02 b where a.x = b.x )THEN
OUTPUT THE RECORD
end if
end loop
这里也看出是先以外部的表table01作为驱动表，然后遍历每个结果集去和内部的表或者结果集联合

In和exists的处理方式当真如此吗，下面我们再创建两个表进行验证
SQL> create table t1 as select object_id,object_name from dba_objects;
Table created.
SQL> create table t2 as select * from dba_objects where rownum<1000;
Table created.
SQL> create index ind_t1_objid on t1(object_id);
Index created.
SQL> create index ind_t2_objid on t2(object_id);
Index created.

这里看出in写法完全没有必要以子查询的表t2作为驱动表，cbo会选择驱动结果集尽量少的作为驱动表；
SQL> select * from t2 where t2.object_id in (select t1.object_id from t1);
999 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3980768476

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999 | 185K| 48 (5)| 00:00:01 |
|* 1 | HASH JOIN SEMI | | 999 | 185K| 48 (5)| 00:00:01 |
| 2 | TABLE ACCESS FULL | T2 | 999 | 172K| 5 (0)| 00:00:01 |
| 3 | INDEX FAST FULL SCAN| IND_T1_OBJID | 81275 | 1031K| 42 (3)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
180 consistent gets
4 physical reads
0 redo size
48355 bytes sent via SQL*Net to client
1218 bytes received via SQL*Net from client
68 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
999 rows processed

同样exists一定要以外部的查询作为驱动表吗，其实也不然：
SQL> select * from t1 where exists (select 1 from t2 where t1.object_id=t2.object_id);

Execution Plan
----------------------------------------------------------
Plan hash value: 4255530270

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999 | 91908 | 86 (4)| 00:00:02 |
|* 1 | HASH JOIN RIGHT SEMI | | 999 | 91908 | 86 (4)| 00:00:02 |
| 2 | INDEX FAST FULL SCAN| IND_T2_OBJID | 999 | 12987 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | T1 | 81275 | 6270K| 81 (2)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

Statistics
----------------------------------------------------------
30 recursive calls
0 db block gets
510 consistent gets
3 physical reads
0 redo size
32764 bytes sent via SQL*Net to client
1218 bytes received via SQL*Net from client
68 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
999 rows processed

这里同样是以内部表t2的表作为驱动表，然后t1表作为被驱动表。

其实in和exists不同的只是他们对null值的处理方式，in对null是敏感的，而exists则不然。

在oracle 8i时候，默认没有unnest特性，也就是子查询不能展开为表连接，此时in操作是先对子查询做操作，然后驱动外部的查询，exists则是先进行外部的查询，然后去驱动子查询，

而在oracle 9I后，默认就启用了unnest特性，子查询可以被展开为表连接，此时in和exists的写法往往可能不能改变其执行计划
下面我们调用一个no_unnest hint写法来进行验证
SQL> select * from t1 where exists (select /*+no_unnest*/1 from t2 where t1.object_id=t2.object_id);

Execution Plan
----------------------------------------------------------
Plan hash value: 895956251

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 93 | 14M (1)| 47:03:35 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL| T1 | 73615 | 6685K| 226 (2)| 00:00:03 |
|* 3 | TABLE ACCESS FULL| T2 | 1 | 4 | 225 (1)| 00:00:03 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE
"OBJECT_ID"=:B1))
3 - filter("OBJECT_ID"=:B1)

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
37251283 consistent gets
0 physical reads
0 redo size
4037293 bytes sent via SQL*Net to client
54458 bytes received via SQL*Net from client
4908 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
73600 rows processed

这里t1表是大表，然后t2表是小表，没加no_unnest的hint时，跑出来的逻辑读510，cost则是86，而如果我们加上no_unnest的hint，不让子查询展开为表连接

Cost达到了14M，逻辑读则高达了37251283，这里选择的是t1表为驱动表，然后filter去关联t2表，而在没有unnest特性情况下，都是采用扫描驱动表然后filter循环遍历另一个表或者结果集，而这个执行成本如果驱动结果集较大，或者被驱动表没有高效索引，这个执行成本往往非常的高，小鱼已经预见好几次因为rows估算错误该选择hash join的方式走成了nested loop或者filter，此时cbo估算的成本值往往是错误的。

Unnest特性是可以展示子查询为表连接的，此时in和exists的两种sql写法往往都不会改变其执行计划，个人觉得上面这两个结论是具有片面性的，只有in和exists的两种写法改变了执行计划，比如驱动表的变化，表的关联方式等才能在执行计划和执行效率上产生较大的变化。

下面来简单看下not in和not exists是否有些许的变化，这里需要注意的还是小鱼上面所说的in是对null值敏感的

Execution Plan
----------------------------------------------------------
Plan hash value: 895956251

---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 73614 | 6685K| 14M (1)| 47:03:35 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL| T1 | 73615 | 6685K| 226 (2)| 00:00:03 |
|* 3 | TABLE ACCESS FULL| T2 | 1 | 4 | 225 (1)| 00:00:03 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter( NOT EXISTS (SELECT /*+ */ 0 FROM "T2" "T2" WHERE
LNNVL("OBJECT_ID"<>:B1)))
3 - filter(LNNVL("OBJECT_ID"<>:B1))

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
37245435 consistent gets
0 physical reads
0 redo size
1139 bytes sent via SQL*Net to client
481 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
0 rows processed

SQL> select * from t1 where not exists (select 1 from t2 where t1.object_id=t2.object_id);

Execution Plan
----------------------------------------------------------
Plan hash value: 629543484

-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 30 | 2910 | | 876 (1)| 00:00:11 |
|* 1 | HASH JOIN RIGHT ANTI| | 30 | 2910 | 1152K| 876 (1)| 00:00:11 |
| 2 | TABLE ACCESS FULL | T2 | 73616 | 287K| | 225 (1)| 00:00:03 |
| 3 | TABLE ACCESS FULL | T1 | 73615 | 6685K| | 226 (2)| 00:00:03 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2033 consistent gets
0 physical reads
0 redo size
1702 bytes sent via SQL*Net to client
492 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
15 rows processed

上面看出not in时，oracle并不能将子查询展开成表连接，此时执行计划为filter的关联方式，执行成本较高，而not exists则能将子查询展开为表连接，此时filter改为更优秀的hash join的连接方式

因为in对null值敏感，而oracle 11g之前没有null-aware anti join的算法，此时not in对应的执行计划往往是filter，而filter如果没有合适的驱动表，执行成本往往非常高。

下面我们添加一个not null的约束给t1和t2表分别添加not null的约束
SQL> alter table t1 modify object_id not null;
SQL> alter table t2 modify object_id not null;

如果在对应的列添加not nul约束，再来查看not in执行计划
此时oracle采用了hash join right anti的连接方式，
SQL> select * from t1 where t1.object_id not in (select object_id from t2);

Execution Plan
----------------------------------------------------------
Plan hash value: 629543484

Predicate Information (identified by operation id):
---------------------------------------------------

Statistics
----------------------------------------------------------
285 recursive calls
0 db block gets
2082 consistent gets
0 physical reads
0 redo size
1139 bytes sent via SQL*Net to client
481 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
8 sorts (memory)
0 sorts (disk)
0 rows processed

oracle将关联操作改为了hash join right anti，同样我们上面基本所有的测试用例中都出现了hash join right anti和hash join right semi，这个是oracle对于not in和not exists、in和exists的一种子查询的优化，当然我们必然能够unnest展开子查询为表连接，这个默认就是开启的。

而在oracle 11g中，由于有null-aware anti join的算法，即使没有not nul约束，此时hash join right也能够计算比较null值，所以此时执行计划转变为HASH JOIN RIGHT ANTI NA
[oracle@ora11g ~]$ sqlplus / as sysdba

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table t1 as select * from dba_objects;
Table created.
SQL> create table t2 as select * from dba_objects;
Table created.
SQL> select * from t1 where t1.object_id not in (select object_id from t2);

no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 2739594415

----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 95457 | 20M| | 1765 (1)| 00:00:22 |
|* 1 | HASH JOIN RIGHT ANTI NA| | 95457 | 20M| 1952K| 1765 (1)| 00:00:22 |
| 2 | TABLE ACCESS FULL | T2 | 79725 | 1012K| | 339 (1)| 00:00:05 |
| 3 | TABLE ACCESS FULL | T1 | 95457 | 18M| | 340 (1)| 00:00:05 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

Statistics
----------------------------------------------------------
7 recursive calls
0 db block gets
1382 consistent gets
1242 physical reads
0 redo size
1351 bytes sent via SQL*Net to client
509 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
0 rows processed

那么对于not in和not exists而言，在oracle 11g下由于新的算法的引入，执行计划都是hash join right anti na，执行计划都是一样的。

而在oracle 10g而言，如果没有not null约束，not in还是走的filter循环，这个一般对于没有合适的驱动表或者结果集而言执行成本相比是比较大的。而not exists由于对于null值不敏感，还是可以采用高效的hash join right anti算法去联合，所以这种情况下，也就是我们经常所说的not exists可能比not in更加高效一点。

而如果你使用的数据库是oracle 10g，而且又不能指定not null约束，那么可能你需要改写为not exists了，或者改写成下列的sql来进行优化调整：
SQL> select * from t1 minus
2 select * from t1 where t1.object_id in (select object_id from t2);

Execution Plan
----------------------------------------------------------
Plan hash value: 3848269147

-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 73615 | 13M| | 4317 (59)| 00:00:52 |
| 1 | MINUS | | | | | | |
| 2 | SORT UNIQUE | | 73615 | 6685K| 17M| 1807 (1)| 00:00:22 |
| 3 | TABLE ACCESS FULL | T1 | 73615 | 6685K| | 226 (2)| 00:00:03 |
| 4 | SORT UNIQUE | | 73600 | 6971K| 14M| 2510 (1)| 00:00:31 |
|* 5 | HASH JOIN | | 73600 | 6971K| 1152K| 876 (1)| 00:00:11 |
| 6 | TABLE ACCESS FULL| T2 | 73616 | 287K| | 225 (1)| 00:00:03 |
| 7 | TABLE ACCESS FULL| T1 | 73615 | 6685K| | 226 (2)| 00:00:03 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - access("T1"."OBJECT_ID"="OBJECT_ID")
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
3048 consistent gets
0 physical reads
0 redo size
1139 bytes sent via SQL*Net to client
481 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
0 rows processed