Saturday 7 November 2015

Understanding CDC and Implementation in SSIS (PART-I)

Introduction
Change Data Capture (CDC) is introduced by Microsoft from SQL Server 2008. It is most popular in case of Incremental load. We are going to study a scenario and understand why the CDC is so important. The CDC is little complicated in MS SQL Server 2008 but Microsoft improve the functionality and made it easier for us from MS SQL 2012. In this article we are going to discuss CDC related to MS SQL 2012.

The article based on two things

1.    How CDC works on MS SQL Server 2012
2.    Implementation of MS SQL Server 2012 CDC into SSIS

Hope it will be informative and you enjoy the session.

Why CDC is needed a Scenario
We have a table named tbl_EmployeeDetails and it contains EmpID, EmpName, DesigNation and DOJ. The data is loaded from tbl_EmployeeDetails to Production server table named tbl_Employee.

Suppose we have 50 thousand records in the tbl_EmployeeDetails and data of that table is frequently changed by Insert/Update and Delete. So it is not possible to delete the destination table records and load it every time due to performance factors.
So we have to understand which records are Inserted/Updated/Deleted and make the operation on this sets only. Here in this source table tbl_EmployeeDetails we are unable to understand it and hence the CDC comes into the picture.
Before MS SQL 2008 we used alternate approaches that include timestamp columns, triggers, or complex queries often hurt performance and increase complexity.

How CDC Works
To understand it properly here we provide a pictorial diagram.



When any Insert/Update/Delete occurs in the database table it logged into the log file. The Scheduler reads the log file and stores the data into CDC table. This is done by LSN (Log Sequence Number). From this CDC table we can track the changes of the records in the table.


Implementation of CDC in MS SQL Server 2012
Before Working with CDC the SQL Server Agent Services must be STARTED

Step-1 [  Enabling Change Data Capture (CDC) in Database ]
USE TEST
GO

EXEC sys.sp_cdc_enable_db;
GO

To check the CDC is properly activated on the database
SELECT [name], database_id, is_cdc_enabled 
FROM   sys.databases WHERE name = 'TEST';     
GO

name
database_id
is_cdc_enabled
TEST
7
1

Here is_cdc_enabled = 1 means the CDC is enabled in the database named TEST.
                                               
After this execution of the command we find some table objects is created automatically as a system table in the database and we have to understand those table clearly.



                                                           
cdc.captured_columns 
This table returns result for list of captured column.

cdc.change_tables 
This table returns list of all the tables which are enabled for capture.

cdc.ddl_history 
This table contains history of all the DDL changes since capture data enabled.

cdc.index_columns 
This table contains indexes associated with change table.

cdc.lsn_time_mapping 
This table maps LSN number (for which we will learn later) and time.

Step-2 [ Create the Source Table ]

IF OBJECT_ID(N'[dbo].[tbl_EmployeeDetails]', N'U') IS NOT NULL
   BEGIN
       DROP TABLE [dbo].[tbl_EmployeeDetails];
   END
GO

CREATE TABLE [dbo].[tbl_EmployeeDetails]
(
EmpID        INT           NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
EmpName      VARCHAR(50)   NOT NULL,
DesigNation  CHAR(1)       NOT NULL,
DOJ          DATETIME NOT NULL
);

Step-3  [Enabling Change Data Capture (CDC) in One or More Table ]

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name   = N'tbl_EmployeeDetails',
@role_name     = NULL
GO

@source_schema is the table schema where we want to enable the CDC. In our case it is [dbo].

@source_name is the name of the table objects where we want to enable CDC.

@role_name   If there is any restriction of how data should be extracted from database, this option is used to specify any role which is following restrictions and gating access to data to this option if there is one.  If you do not specify any role and, instead, pass a NULL value, data access to this changed table will not be tracked and will be available to access by everybody.

This will create two jobs in the SQL Server Agent.

Job 'cdc.TEST_capture' started successfully.
Job 'cdc.TEST_cleanup' started successfully.

TEST_Capture
These jobs execute a system stored procedure named sys.sp_MScdc_capture_job. This procedure cannot be executed explicitly when a change data capture log scan operation is already active or when the database is enabled for transactional replication. This system SP enables SQL Server Agent, which in facts enable Change Data Capture feature.



TEST_Cleanup 

When this job is executed it runs the system stored procedure sys.sp_MScdc_cleanup_job. This system SP cleans up database changes tables.

After this operation there is another table is automatically created with name of the table that we created earlier (tbl_EmployeeDeatails). The name of the mirror table is [cdc].[dbo_tbl_EmployeeDetails_CT]

The [cdc].[dbo_tbl_EmployeeDetails_CT] table will contains all the changes (Insert/Update/Delete) in the table named [dbo].[ tbl_EmployeeDeatails].

Step – 4 [ Observing the cdc].[dbo_tbl_EmployeeDetails_CT] Table ]

SELECT * FROM [cdc].[dbo_tbl_EmployeeDetails_CT];

Here we got Five new columns and others are the columns from original table



Step- 5 [ Now Make some Operation on Source table and See the CDC changes ]

Insert Records
INSERT INTO [dbo].[tbl_EmployeeDetails]
     (EmpName, DesigNation, DOJ)
VALUES('Joydeep Das', 'A', '2012-09-08'),
      ('Deepasree Das', 'B', '2013-05-01');
GO

SELECT * FROM [dbo].[tbl_EmployeeDetails];
SELECT * FROM [cdc].[dbo_tbl_EmployeeDetails_CT];

EmpID
EmpName
DesigNation
DOJ
1
Joydeep Das
A
2012-09-08 00:00:00.000
2
Deepasree Das
B
2013-05-01 00:00:00.000

__$start_lsn
__$end_lsn
__$seqval
__$operation
__$update_mask
EmpID
0x00000086000000A80020
NULL
0x00000086000000A8001D
2
0x0F
1
0x00000086000000A80020
NULL
0x00000086000000A8001F
2
0x0F
2

EmpName
DesigNation
DOJ
Joydeep Das
A
2012-09-08 00:00:00.000
Deepasree Das
B
2013-05-01 00:00:00.000

Here _$operation = 2 means data is Inserted. Look at the _$start_lsn = 0x00000086000000A80020 means both the records in same transaction.
Update Records
UPDATE [dbo].[tbl_EmployeeDetails]
    SET EmpName = 'Shipra Das'
WHERE EmpID = 1;

SELECT * FROM [dbo].[tbl_EmployeeDetails];
SELECT * FROM [cdc].[dbo_tbl_EmployeeDetails_CT];

EmpID
EmpName
DesigNation
DOJ
1
Shipra Das
A
2012-09-08 00:00:00.000
2
Deepasree Das
B
2013-05-01 00:00:00.000
                                   
__$start_lsn
__$end_lsn
__$seqval
__$operation
__$update_mask
EmpID
0x00000086000000A80020
NULL
0x00000086000000A8001D
2
0x0F
1
0x00000086000000A80020
NULL
0x00000086000000A8001F
2
0x0F
2
0x00000088000001980003
NULL
0x00000088000001980002
3
0x02
1
0x00000088000001980003
NULL
0x00000088000001980002
4
0x02
2

EmpName
DesigNation
DOJ
Joydeep Das
A
2012-09-08 00:00:00.000
Deepasree Das
B
2013-05-01 00:00:00.000
Joydeep Das
A
2012-09-08 00:00:00.000
Shipra Das
A
2012-09-08 00:00:00.000

After Update operation there 2 new records inserted _$start_lsn = 0x00000088000001980003 which have _$operation = 3 which is Value Before Update And _$operation = 4 which is Value After Update

Delete Records
DELETE [dbo].[tbl_EmployeeDetails]
WHERE  EmpID = 1;
                                                           
SELECT * FROM [dbo].[tbl_EmployeeDetails];
SELECT * FROM [cdc].[dbo_tbl_EmployeeDetails_CT];

EmpID
EmpName
DesigNation
DOJ
2
Deepasree Das
B
2013-05-01 00:00:00.000

__$start_lsn
__$end_lsn
__$seqval
__$operation
__$update_mask
EmpID
0x00000086000000A80020
NULL
0x00000086000000A8001D
2
0x0F
1
0x00000086000000A80020
NULL
0x00000086000000A8001F
2
0x0F
2
0x00000088000001980003
NULL
0x00000088000001980002
3
0x02
1
0x00000088000001980003
NULL
0x00000088000001980002
4
0x02
2
0x00000097000001680005
NULL
0x00000097000001680002
1
0x0F
1

EmpName
DesigNation
DOJ
Joydeep Das
A
2012-09-08 00:00:00.000
Deepasree Das
B
2013-05-01 00:00:00.000
Joydeep Das
A
2012-09-08 00:00:00.000
Shipra Das
A
2012-09-08 00:00:00.000
Shipra Das
A
2012-09-08 00:00:00.000

For delete operation a new row is added where _$start_lns = 0x00000097000001680005 and _$operation = 1 which means Deleted Records.

In my next article I am going demonstrate the using of CDC in SSIS component.
Hope you like it.





Posted by: MR. JOYDEEP DAS

2 comments:

  1. Hello Sir grt Post.. I have query regarding the performance related issue.Is it going to effect the performance of the application.?

    ReplyDelete