Tuesday, 31 July 2012

Index and Statistics


Introductions
Microsoft SQL Server collects statistical information related to indexes and column data stored in the database. These statistics are used by the SQL Server query optimizer to choose the most efficient plan for retrieving or updating data.
If we understand the index statistics in better way, it helps us to optimize and fine tune our queries.
As the Statistics play a very important role in the performance of MS SQL Server query.
In this article I am trying discussing related to it. Before writing this article, I read several article and MSDN to gathering facts related to it.

Point to Focus
In this article I am trying to focus related to the following things of the SQL Server Index statistics.
1.    What we mean by MS SQL server Index statistics
2.    When the Index Statistics are Updated
3.    Rules used by AUTO_UPDATE_STATISTICS
4.    Determine that the Index statistics are out of date.
5.    Updating the statistics
6.    How to see the content of Index statistics
7.    New feature for statistics in MS SQL Server 2008

What we mean by MS SQL server Index statistics
Before going to the actual definition of Index statistics, I am going to take an example so that we can understand it easily.

"Suppose you are a football coach and u have 20 players in your team and for paying final tournament you need 11 players out of 20.  Now, you have to know (or maintain a list) all the statistics of each payer before making your final team." 

Index statistics contains information related to the distribution of index key values. It means that the number of rows associated with each key value. To determine what kind of execution plan to be used when processing a query MS SQL Server query optimizer uses this information.

After creating the table Index when we perform the insert, update, or delete operations in the table, the statistics of the table become out of date.  The MS SQL Server doesn't update the statistics information every time we made some changes (Insert/Update/Delete Operations) on the table. Statistics track the distribution of values within an index or within a particular column.

Note that, if a column is not indexed but can benefit from an index, SQL Server will automatically create statistics for that column.

When the Index Statistics are updated
It is very important to understand, when the MS SQL Server update the statistics. The database settings AUTO_UPDATE_STATISTICS controls when the statistics are automatically updated. By default the AUTO_UPDATE_STATISTICS is true that means that the statistics are automatically updated.
MS SQL Server determines that when to update the statistics after creating it. It is based on how old the statics is.  It determines the outdated statistics based on number of Insert, Update and Delete from the date when the statistics is last updated and then recreate the statistics based on a threshold.
This threshold is relative to the number of records in the table. It means that the when the DML operation performs the index statistics slowly get older, until SQL Server determines to update the statistics.
Limitations:
When we have very large table and we are doing bulk Insert/Update/Delete operation the AUTO_UPDATE_STATISTICS can be overhead on our system. Because the                        AUTO_ UPDATE_STATISTICS busy to update the statistics and causes the system overhead. In such type of situation we must turn off the AUTO_UPDATE_STATISTICS and later we must manually update the statistics.
Rules used by AUTO_UPDATE_STATISTICS
"rowmodctr" column of the sysindexes table is used to determine the number of changes made since the last update of the statistics.
When the MS SQL Server updates the statistics it will follow the following rules.
Ø  If our table have 6 or fewer rows, the statistics will be updated after 6 changes.
Ø  If a table has 500 or fewer rows, statistics will be updated after 500 changes
Ø  If a table has more than 500 rows, statistics will be updated after 20% of the total rows    plus 500 rows are changed.  
Syntax related to AUTO_UPDATE_STATISTICS

The syntax is mentioned bellow:

sp_helpdb DBName
GO

ALTER DATABASE DBName SET AUTO_UPDATE_STATISTICS ON
GO


Determine that the Index statistics are out of date
MS SQL Server uses the sampling methods to keep track of the last time when the statistics were updated. This information is used to determine how old your statistics is.
The function STATS_DATE is used to determine when the statistics was last updated. The sample script is uses this function to display index statistic date for all user defined indexes.

SELECT  schema_name(o.schema_id)AS SchemaName,
        OBJECT_NAME(si.object_id)AS TableName,
        si.nameAS IndexName,
        STATS_DATE(i.object_id, i.index_id)AS StatDate
FROM    sys.indexes si
        INNER JOIN sys.objects o
             ON si.object_id= o.object_id
        INNER JOIN sys.indexes i
             ON i.object_id= si.object_id
                AND i.index_id = si.index_id
WHERE   o.type<>'S'
        AND STATS_DATE(i.object_id, i.index_id)IS NOT NULL;

Updating the statistics
As we see that the problem related to MS SQL Server updating statistics automatically, to get the optimal output we need to manually update it when needed.
To update an Index statistics we can drop the Index and then recreate the Index, it will automatically update the statistics information. It works but it is not the good way to update statistics information manually.
The system stored procedure named "sp_updatestats" helps us to update the statistical information.
The syntax is mentioned below:
sp_updatestats[ [ @resample = ] 'resample']

Parameters
@resample =] 'resample'

Specifies that sp_updatestats will use the RESAMPLE option of the UPDATE STATISTICS statement. If 'resample' is not specified, sp_updatestats updates statistics by using the default sampling. Resample is varchar(8) with a default value of NO.
Example:
USE AdventureWorks2012;
GO
EXECsp_updatestats;
The UPDATE STATISTICS is another option to update the statistical information.
The syntax is mentioned bellow:

UPDATE STATISTICS table_or_indexed_view_name
    [
        {
            { index_or_statistics__name }
          | ( { index_or_statistics_name } [ ,...n ] )
                }
    ]
    [    WITH
        [
            FULLSCAN
            | SAMPLE number { PERCENT | ROWS }
            | RESAMPLE
            | <update_stats_stream_option> [ ,...n ]
        ]
        [ [ , ] [ ALL | COLUMNS | INDEX ]
        [ [ , ] NORECOMPUTE ]
    ] ;

<update_stats_stream_option> ::=
    [ STATS_STREAM = stats_stream ]
    [ ROWCOUNT = numeric_constant ]
    [ PAGECOUNT = numeric_contant ]

For more information about syntax, follow the MSDN
Example:
USE AdventureWorks2012;
GO
UPDATE STATISTICS Sales.SalesOrderDetail WITH FULLSCAN;
GO

How to see the content of Index statistics
To see the actual contents of the statistics we use the DBCC SHOW_STATISTICS.
The syntax is mentioned bellow.
DBCC SHOW_STATISTICS (Table_Name, Index_Name)

For more information about SHOW_STATISTICS, follow the MSDN
Parameters:
Table_Name : The name of the Table.
Index_Name : The Index name of the Mentioned Table.

New feature for statistics in MS SQL Server 2008
The new feature of statistics in SQL 2008 are mentioned as per MSDN
SQL Server 2008 provides the option to create filtered statistics, which are restricted to a subset of the rows in the table defined by a predicate, also called a filter. Filtered statistics are either created explicitly, or more commonly, through creating a filtered index.
·         Automatic creation: As mentioned, filtered statistics are automatically created as a by-product of filtered indexes. Unlike regular statistics, filtered statistics are never automatically created by the query optimizer as a result of query and update statements.
·         Automatic update: To determine when statistics need to be automatically updated, the query optimizer uses column modification counters. For filtered statistics, the counter is multiplied by the selectivity of the filter, which ranges from 0 to 1, to compensate for the fact that the column counter tracks changes to all rows in the table, not the changes to the rows in the filtered statistics object.
·         Sampling: To build the statistics, SQL Server 2008 reads every nth data page, according to the sampling rate, and then it applies the filter, if any.
·         If you don't specify sampling rate, the default behaviour is to sample based on the number of rows in the table and, for filtered statistics, based on the selectivity of the filter. The more selective the filter, the more rows need to be sampled to gather enough data to build the histogram.
·         Cardinality estimate: As with regular statistics, the optimizer attempts to pick the most relevant statistics when performing cardinality estimate. There is an additional check that the optimizer must make for filtered statistics: whether the statistics predicate contains the query predicate (that is, whether the query predicate must be true for a row if the statistics predicate is true for that row). Out of the possible filtered statistics, the query optimizer favours those that are as close to the query predicate as possible.

For more information about it, follow the MSDN

Related tropics



Referential Sources


Hope you like it.

Posted by: MR. JOYDEEP DAS

Tuesday, 24 July 2012

Scenario based Index type selection


Introduction

When we are thinking about index we have different options in mid to use them, such as Clustered / Non Clustered etc.
It is very important to understand in which scenario what type of Index gives us the performance boost up. As we all know that the too many indexes are not good for performance.
In this article I am trying to demonstrate the Scenario wise index selection process.   
Scenario with Index Type
Here I am trying to create a table type structure, in one side I use the type of index and in opposite side I am using the scenario related to it.

Type
   Scenario
Clustered Index
Clustered Index used in columns in the Query
1.    The query that returns the range of values by using operators such as BETWEEN clause, >, >=, < and <= etc.
2.    When the Query returns the larger result set.
3.    When using JOIN clauses with FOREIGN key columns.
4.    When ORDER BY or GROUP BY clauses in the query.

Use the Clustered Index on the following columns of Table

1.    If the columns have the UNIQUE values.
2.    The columns that are frequently used to retrieve the range of values.
3.    The IDENTITY columns that are guaranteed to be UNIQUE and auto increasing numbers.
4.    Columns that are used to sort data by using ORDER BY clause in SELECT statements.

Clustered Index is NOT good for the following columns of Table

1.    The Columns those are frequently modified or changed.
2.    Columns that use the composite key.
Non Clustered Index
Non Clustered Index used in columns in the Query
1.    Query that use the JOIN or GROUP BY Clause.
2.    Query that does not return large result set or data.
3.    The columns of the query where we frequently use the WHERE clauses.

Use the non Clustered Index on the following columns of Table

1.    Non clustered index is used in the non key columns that are frequently used in the query set.
2.    In the columns where data is not distinct such as combination of columns where the clustered index is used in the others columns.


Covering Non Clustered Index

If we have to use a non-clustered index and if we know that your application will be performing the same query over and over on the same table, consider creating a covering index on the table for the query

Guideline to create Covered Index

·         If the query or queries we run using the covering index are  
      seldom run, then the overhead of the covering index may  
      outweigh the benefits it provides.

·         The covering index should not add significantly to the size of 
      the key. If it does, then it its use may outweigh the benefits it 
      provides.

·         The covering index must include all columns found in the 
      SELECT list, the JOIN clause, and the WHERE clause.


Filtered Non Clustered Index
Use the filtered index in the following scenario

1.    There is a table that has huge amount of data and we often select only a subset of that data.  For an example, we have a lot of NULL values in a column and we want to retrieve records with only non-NULL values. In MS SQL Server 2008 this type of columns is called Sparse Column.

2.    In another scenario we have several categories of data in a particular column, but we often retrieve data only for a particular category value.

By this I mean to say that in the Query of Indexed column the WHERE statement is more or less fixed and the same WHERE statement is used in non clustered Filtered Index.




Related tropics


Hope you like it.

Posted by: MR. JOYDEEP DAS

Friday, 20 July 2012

Index and Sort Order


Introduction
When we write the SQL Query we frequently use the ORDER BY clause to get the output in definite order (Ascending or descending).
But ORDER BY clause has deep impacts in the performance of the query executions. In this article I am trying to discuss related to it.
How it Works
When designing the index we should consider whether the data for the index key column should be stored in ascending or descending order. Ascending is the default order and maintains compatibility with earlier versions of SQL Server.  It is most logical one, that the smallest values in the top and the biggest one is the bottoms.
Bust everything depends on the situations or the way we represent the records or how we constructs the SQL statements.
To understand it properly, I am taking an simple example where I am trying to demonstrate the performance issue with index data ordering.
Step-1 [ First create a Base Table ]
Here in this example I am taking my favorite employee table objects with attributes employee id, employee name and the department.
 -- Base Table

CREATE TABLE my_Emp
       (EmpID     INT NOT NULL IDENTITY(1,1),
        EmpName   VARCHAR(50) NOT NULL,
        EmpDept   VARCHAR(1)  NOT NULL)

Step-2 [ Now Insert some records on my base table ]
-- Record Insertion

INSERT INTO my_Emp
       (EmpName,  EmpDept)
VALUES ('Joydeep Das', 'A'),
       ('Tuhin Shinah', 'B'),
       ('Sangram Jit', 'C'),
       ('Sukamal Jana', 'A'),
       ('Sudip Das', 'B'),
       ('Manishankar Bhowmik', 'C'),
       ('Ratan Das', 'A'),
       ('Sidhu Jetha', 'B'),
       ('Subrata Kar', 'C')  
GO
Step-3 [ Now I am creating a non clustered index on table  ]
-- Index Created 

DROP INDEX ix_nonclus_my_Emp ON my_Emp    
CREATE NONCLUSTERED INDEX ix_nonclus_my_Emp
ON  my_Emp (EmpID, EmpName)

Step-4 [ Now execute the following Query and find the Actual Execution Plan ]
SELECT   EmpID, EmpName
FROM     my_Emp WITH(INDEX(ix_nonclus_my_Emp))
ORDER BY EmpID ASC, EmpName DESC



Step-5 [ Now Drop the Index and Recreate it again by using ASC and DESC clause]
-- Drop Index

DROP INDEX ix_nonclus_my_Emp ON my_Emp
GO
-- Recreate Index

CREATE NONCLUSTERED INDEX ix_nonclus_my_Emp
ON  my_Emp (EmpID ASC, EmpName DESC)


GO

Step-6 [ Now execute the following Query and find the Actual Execution Plan ]
SELECT   EmpID, EmpName
FROM     my_Emp WITH(INDEX(ix_nonclus_my_Emp))
ORDER BY EmpID ASC, EmpName DESC


Step-7 [ Now Compare Both the Execution Plan ]


In this case the SORT operator is found in the first execution plan but not in second hence increase the performance.

Summary
Sort order can be specified only for key columns. The sys.index_columns catalog view and the INDEXKEY_PROPERTY function report whether an index column is stored in ascending or descending order.

Hope you like it.


Posted by: MR. JOYDEEP DAS