A Procedure for generating Data Warehouse Update code
Background
Transactional tables exist in Database A. These tables have partner archive tables in the same or another database. For each insert or change of a value in the transactional database a timestamped new row is entered into the counterpart history table. Standard archiving, similar to the temporal archiving offered in SQL Server 2016.
The difficulty is that writing the code to populate the archive table (or a Type 2 Slowly Changing Dimension data warehouse table) is complex and error prone. Particularly since a Merge statement cannot on its own achieve this, so a temporary table has to populated from the Merge Output clause, and so on.
Code Generator
The procedure UTIL_GenerateArchiveProcedureDDL will generate the code required to populate an archive table. The procedure is in the file attached at the end of this article.
It can generate code for a single specified table or all tables in a schema or the database.
The source table(s) must have an explicit primary key.
The archive table(s) must
- have same name as the source tables (TODO add option to add _Archive to the name)
- have temporal date columns named StartDate and EndDate (TODO allow other naming schemes to be used)
- have the schema as the source tables with the addition of two date columns StartDate (not null) and EndDate (nullable).
The generator and generated code requires this Error reporting procedure.
Examples
Usage and description.
Standard Table (single Primary Key)
-- Typical table with a single primary key. CREATE TABLE dbo.Products( ProductId int NOT NULL IDENTITY PRIMARY KEY, ProductName varchar(30) NOT NULL, AvailableFromDate smalldatetime NOT NULL, IntroducerId int NULL);
Run this with results to grid (it may be necessary to set the SSMS option to display carriage returns in the output grid view).
-- Generate Archive procedure for this table only by specifying table name -- (if there is a similarly named table in another schema, provide @SourceSchema too) EXEC UTIL_GenerateArchiveProcedureDDL @WarehouseDatabaseName = Archive, @SourceTable = 'Products';
This is the generated DDL.
- The main action occurs in the MERGE statement. This joins all rows in the LIVE table with all current rows, i.e. rows where the EndDate is Null, in the Warehouse table (selected by CTE TargetCTE).
- The Live and Warehouse tables are joined on all columns of the Primary Key (this example has only one).
- If there is a match it means the same record (“same” here means having the same Primary Key value) exists in live and warehouse databases.
- The procedure then compares the sets of non-PK columns in each table using the EXCEPT set operator. If a row is returned, there is a difference in one of the non-PK columns.
- The THEN clause UPDATES the EndDate column of the warehouse table to obsolete the row in the warehouse table.
- Note that at this stage no new row can be inserted into the warehouse to represent the new value; that is handled after the MERGE statement.
- The remaining two MERGE clauses are simpler
- Not matched by SOURCE: the record no longer exists in live so the warehouse row is obsoleted.
- Not matched by TARGET: the record doesn’t exist in warehouse, so a row is inserted.
- All the SOURCE columns are OUTPUT-ed to a temporary table, together with the MERGE action key.
- We still need to insert rows for records that have changed value since the warehouse was last updated. The Merge Output will contain all rows from the full join of the source and target tables. These can be filtered to only the records that resulted in an UPDATE operation, but even that is too wide since it includes records which have been deleted from the source; we don’t want to insert those again. Therefore we filter to insert only rows that exist in the merge Source – that is, those where the ouput-ed Source columns are not null. To test this, the code chooses the primary key (or the first PK column of a compound key) since this cannot be null if the row exists.
- The Merge and changed-row Insert statements are wrapped in a transaction.
- Row counts follow.
CREATE PROCEDURE [dbo].[UpdateDataWarehouse] /* ** DO NOT EDIT. ** Auto-generated by procedure UTIL_GenerateArchiveProcedureDDL ON Jan 22 2017 3:35AM. PURPOSE This procedure updates an archive warehouse database with the current values in the source database. The warehouse database must contain the same tables with the same columns plus temporal (StartDate, EndDate) columns. CODE GENERATOR HISTORY 18/01/2017 Andrew Ing 1.02. */ AS SET NOCOUNT, XACT_ABORT ON; BEGIN TRY DECLARE @EffectiveDate datetime; DECLARE @ObsoletedRowCount int; DECLARE @InsertedRowCount int; DECLARE @UpdatedRowCount int; DECLARE @TableActionSummary varchar(1000); DECLARE @DebugInfo varchar(2000); SET @EffectiveDate = GETDATE(); ------------------------------------------------------------- -- TABLE: [dbo].[Products] SET @DebugInfo = 'Create Temp Table [#MergeOutput_Products]'; CREATE TABLE [#MergeOutput_Products] ( MergeAction nvarchar(10) ,[ProductId] int NULL ,[ProductName] varchar(30) NULL ,[AvailableFromDate] smalldatetime NULL ,[IntroducerId] int NULL ); SET @DebugInfo = 'Merge [dbo].[Products]'; BEGIN TRANSACTION; -- Select only Current (EndDate IS NULL) warehouse rows for the comparison. WITH TargetCTE AS (SELECT * FROM Archive.[dbo].[Products] WHERE EndDate IS NULL ) MERGE TargetCTE AS [Target] USING [dbo].[Products] AS [Source] ON [Target].[ProductId] = [Source].[ProductId] WHEN MATCHED AND EXISTS ( SELECT [Target].[ProductName] ,[Target].[AvailableFromDate] ,[Target].[IntroducerId] EXCEPT SELECT [Source].[ProductName] ,[Source].[AvailableFromDate] ,[Source].[IntroducerId] ) THEN -- Record has changed. UPDATE SET [EndDate] = @EffectiveDate WHEN NOT MATCHED BY SOURCE THEN -- Record deleted from source. UPDATE SET [EndDate] = @EffectiveDate WHEN NOT MATCHED BY TARGET THEN -- New Record. INSERT ( [StartDate] ,[ProductId] ,[ProductName] ,[AvailableFromDate] ,[IntroducerId] ) VALUES ( @EffectiveDate ,[Source].[ProductId] ,[Source].[ProductName] ,[Source].[AvailableFromDate] ,[Source].[IntroducerId] ) OUTPUT $Action AS MergeAction ,[Source].[ProductId] ,[Source].[ProductName] ,[Source].[AvailableFromDate] ,[Source].[IntroducerId] INTO [#MergeOutput_Products]; -- Insert new rows for records that have changed (old rows were marked as expired by -- the first MATCHED clause. #MergeOutput_ has been populated only with Source columns -- but MERGE is a full outer join so exclude rows with a Null primary key column. SET @DebugInfo = 'Insert for changed rows [dbo].[Products]'; INSERT INTO Archive.[dbo].[Products] ( [StartDate] ,[ProductId] ,[ProductName] ,[AvailableFromDate] ,[IntroducerId] ) SELECT @EffectiveDate ,[ProductId] ,[ProductName] ,[AvailableFromDate] ,[IntroducerId] FROM [#MergeOutput_Products] WHERE MergeAction = 'UPDATE' AND [ProductId] IS NOT NULL; SELECT @UpdatedRowCount = @@ROWCOUNT; COMMIT TRANSACTION SET @DebugInfo = 'Get Row Counts [dbo].[Products]'; SELECT @ObsoletedRowCount = SUM(CASE WHEN MergeAction = 'UPDATE' THEN 1 ELSE 0 END), @InsertedRowCount = SUM(CASE WHEN MergeAction = 'INSERT' THEN 1 ELSE 0 END) FROM [#MergeOutput_Products]; SET @TableActionSummary = CONVERT(varchar(19), GETDATE(), 121) + ' [dbo].[Products] ' + CAST(ISNULL(@ObsoletedRowCount, 0) AS varchar(10)) + ' Rows Obsoleted, ' + CAST(ISNULL(@InsertedRowCount, 0) AS varchar(10)) + ' New Rows Inserted, ' + CAST(@UpdatedRowCount AS varchar(10)) + ' Update Rows Inserted.'; PRINT @TableActionSummary; D ROP TABLE [#MergeOutput_Products]; -- END OF TABLE [dbo].[Products] END TRY BEGIN CATCH EXECUTE dbo.ERROR_HandleCaughtError @DebugInfo; RETURN 5555; END CATCH /* END OF GENERATED SCRIPT */ GO
Many to Many tables
Or any table with just primary key columns.
CREATE TABLE dbo.StudentCourses( StudentId int NOT NULL, CourseId int NOT NULL, CONSTRAINT PK_StudentCourses PRIMARY KEY (StudentId, CourseId) );
If the table only has a primary key then it is not possible to have changed values in the same record. The record either exists in both live and warehouse tables (in which case no action is needed) or it exists in only one, in which case it must be inserted or obsoleted in the warehouse accordingly.
If no non-primary key columns are found, the WHEN MATCHED AND EXISTS… clause is omitted from generated code. Note in the section below that both primary key columns are used in the Merge join statement, and that the Matched And Exists clause is not generated.
... WITH TargetCTE AS (SELECT * FROM Archive.[dbo].[StudentCourses] WHERE EndDate IS NULL ) MERGE TargetCTE AS [Target] USING [dbo].[StudentCourses] AS [Source] ON [Target].[StudentId] = [Source].[StudentId] AND [Target].[CourseId] = [Source].[CourseId] WHEN NOT MATCHED BY SOURCE THEN -- Record deleted from source. UPDATE SET [EndDate] = @EffectiveDate WHEN NOT MATCHED BY TARGET THEN -- New Record. INSERT ( ...
Generating Updates for all Tables
Omitting the SourceTable and SourceSchema parameters will generate DDL for a procedure that will include all tables in the database. This can be restricted by providing SourceSchema (which will include all tables in the named schema) or SourceTable.