Thursday, 22 November 2012

“IN”, “EXISTS” clause and their performance

Introduction
To improve the performance of the Query, the general guideline is not to prefer the "IN" Clause. The guideline of the MS SQL query performance says that if we needs "IN" clause, instead of using "IN" clause we must use the "EXISTS" clause.   As the "EXISTS" clause improve the performance of the query.
This article is related to "IN", "EXISTS" clause and their performance factors.

Is "IN" and "EXISTS" clause are same
IN Clause
Returns true if a specified value matches any value in a sub query or a list.
EXISTS Clause
Returns true if a sub query contains any rows.
So we see that the "IN" and the "EXISTS" cluse are not same. To support the above definition lets takes an example.
-- Base Tabe
IF OBJECT_ID('emp_DtlTbl') IS NOT NULL
   BEGIN
     DROP TABLE emp_DtlTbl;
   END
GO
IF OBJECT_ID('emp_GradeTbl') IS NOT NULL
   BEGIN
     DROP TABLE emp_GradeTbl;
   END
GO     

CREATE TABLE emp_DtlTbl
       (EMPID   INT        NOT NULL IDENTITY PRIMARY KEY, 
        EMPNAME VARCHAR(50)NOT NULL);
GO

CREATE TABLE emp_GradeTbl
       (EMPID   INT        NOT NULL IDENTITY PRIMARY KEY, 
        GRADE   VARCHAR(1) NOT NULL);
GO

-- Insert Records
INSERT INTO  emp_DtlTbl
       (EMPNAME)
VALUES ('Joydeep Das'), ('Sukamal Jana'), ('Sudip Das');
GO

INSERT INTO  emp_GradeTbl
       (GRADE)
VALUES ('B'), ('B'), ('A');
GO

-- [ IN ] Clause Example-1
SELECT *
FROM   emp_DtlTbl;
WHERE  EMPID IN(SELECT EMPID FROM  emp_DtlTbl);           


-- [ IN ] Clause Example-2
SELECT *
FROM   emp_DtlTbl
WHERE  EMPID IN(1,2,3);


-- [ EXISTS ] Clause Example
SELECT a.*
FROM   emp_DtlTbl a
WHERE  EXISTS(SELECT b.*
              FROM   emp_DtlTbl b
              WHERE  b.EMPID = a.EMPID);
  

Performance Factors
To understand the performance factors let see the actual execution plan for "IN" and "EXISTS" clauses.
Take this example:
-- [ IN ] Clause Exampl
SELECT *
FROM   emp_DtlTbl
WHERE  EMPID =(SELECT EMPID
               FROM   emp_DtlTbl
               WHERE  EMPID = 2);           

-- [ EXISTS ] Clause Example
SELECT a.*
FROM   emp_DtlTbl a
WHERE  EXISTS(SELECT b.*
              FROM   emp_DtlTbl b
              WHERE  b.EMPID = 2
                     AND b.EMPID = a.EMPID);

If we compare the total query costs of the both MS SQL query, we see that the IN clause query cost is higher than the EXISTS clause query costs.

Special notes
Please note that: Here the data of the table is limited, so we cannot measure the performance factors.


Hope you like it.


Posted by: MR. JOYDEEP DAS

13 comments:

  1. In both casess (IN, EXISTS) you are running the subquery for each row.

    It would be better to simply use an INNER JOIN.

    Your examples are confusing. You create an populate emp_DtlTbl and emp_GradeTbl, but the example queries only reference emp_DtlTbl.... effectively a self join. What were you attempting to demonstrate?

    ReplyDelete
    Replies
    1. Thanks "Marc"
      Here i am trying to demonstrate only "IN" or "EXISTS". I know that the JOIN statement is Best approach for that. But in my article it is out of scope.

      Delete
  2. Thanks! Joydeep for giving nice article

    I think if we use EXISTS statement in following way to get more better performance :

    SELECT a.*
    FROM emp_DtlTbl a
    WHERE EXISTS(SELECT 1
    FROM emp_DtlTbl b
    WHERE b.EMPID = 2
    AND b.EMPID = a.EMPID);


    ReplyDelete
    Replies
    1. Thanks "Vikash"
      You are correct. The idle statement is the statement that you mentioned in the remarks.

      Delete
    2. SELECT a.*
      FROM emp_DtlTbl a
      WHERE EXISTS ( SELECT TOP 1
      1
      FROM emp_DtlTbl b
      WHERE b.EMPID = 2
      AND b.EMPID = a.EMPID );

      WITH ( NOLOCK FASTFIRSTROW ) if you want to feel extra special :)

      Delete
    3. FASTFIRSTROW assuming that you KNOW that there is 1 and only 1 entry, else it could perform worse. And NOLOCK assuming that you allow dirty reads.

      Delete
  3. sorry but i didn't get the point from this example .
    i tried it and didn't sow any different between the two:
    CPU cost ,i/o cost ...

    logically and from knowledge that EXISTS is better then in
    because in exists when he find the first value he stops searching
    when in IN clues for every value he start from the beginning to search for the next one .

    THANKS

    ReplyDelete
    Replies
    1. Thanks to "Itzik".

      That execution plans can differ depending on:
      ---------------------------------------------
      1. SQL Server Engine is smart enough to determine optimal
      execution plan for both of the different query and uses
      the same.
      2. Statistics. Execution plan can differ when we run it the
      first time and when the server has accumulated enough
      statistics to make more sensible decision.
      3. How much data there is in the table where SQL Server will
      revert to full index scan instead of index seek.
      4. Hardware. Depends on how many CPUs there are.

      Delete
  4. Marc and Joy.... I'm bit disagree with Inner join vs "IN/Exist" that you said above. I think IN/Exist will be more faster than Inner join as in this case it will go for "semi join" access path which will be much faster than Inner join option... I could see this behavior on Oracle and SQL 2008 R2 both.

    Also in case of IN Vs Exists, There are few cases where IN goes much better than Exists !!! I seen the same on Oracle...

    ReplyDelete
  5. Either way, whether you use "In" or "Exists" (Or their respective inverses) you need to watch for the 'null' or 'unknown' values. SQL Server treats nulls as logical unknowns, which cannot have a value of true for 'not-in' conditions.

    For proper handling of null values, use "Exists".

    ReplyDelete
  6. I see a whole lot of speculation, both pro and con, of performance but I see no proof. My recommendation would be to learn how to build enough test data to do some performance testing. And, no, the execution plan on 3 rows isn't going to tell you much. In fact, execution plans can be 100% incorrect especially on % of Batch.

    ReplyDelete