In this post we are going to explain how we can merge the content of two or more historical tables in a Sql Server database with some T-SQL queries. I’ll try to be more clear:
- with “historical table” I mean a table where each row has validity period defined by two dates
- by “merging two historical tables” I mean obtain from the two tables a single table that reports both the historicized values and the correct validity periods
Create a basic data structure
Let’s start by creating some tables that we will fill with some fake data, just to have something to work on and check our queries.
Suppose we have a data structure like the following:
DROP TABLE IF EXISTS HistEmpDept
DROP TABLE IF EXISTS HistEmpCompany
DROP TABLE IF EXISTS Employees
GO
CREATE TABLE Employees
(
ID INT PRIMARY KEY CLUSTERED IDENTITY(1,1),
LastName NVARCHAR(50) NOT NULL,
FirstName NVARCHAR(50) NOT NULL,
HireDate DATE NOT NULL,
EndDate DATE
)
CREATE TABLE HistEmpDept
(
ID INT PRIMARY KEY CLUSTERED IDENTITY(1,1),
EmployeeID INT NOT NULL,
DeptName NVARCHAR(50) NOT NULL,
FromDate DATE NOT NULL,
ToDate DATE
)
CREATE INDEX HistEmpDeptIdxEmployeeID ON HistEmpDept(EmployeeID)
CREATE TABLE HistEmpCompany
(
ID INT PRIMARY KEY CLUSTERED IDENTITY(1,1),
EmployeeID INT NOT NULL,
CompanyName NVARCHAR(50) NOT NULL,
FromDate DATE NOT NULL,
ToDate DATE
)
GO
For simplicity, I have not included the creation of foreign keys and indexes but, clearly, it is advisable to use them in production software. Let’s focus on our data structure, where we have three tables:
- Table “Employees”: is the main table that contains the employees of our hypothetical company, each with a validity period defined by the date of hiring and the date of termination of the employment relationship
- Table “HistEmpDept”: is a one-to-many relationship table with table “Employees”; contains the list of departments that the employee has changed during his life in the company
- Table “HistEmpCompany”: is a one-to-many relationship table with table “Employees”; contains the list of companies that the employee has changed during his life in the company
In all three tables above, a NULL ToDate (or EndDate for table Employees), means “infinite”, meaning the value began some time in the past and is still valid today.
Now let’s fill the tables with some test data:
INSERT INTO Employees (LastName, FirstName, HireDate, EndDate) VALUES('Davolio', 'Nancy', '2001-03-31', NULL)
INSERT INTO Employees (LastName, FirstName, HireDate, EndDate) VALUES('King', 'Robert', '2016-05-09', '2023-12-31')
INSERT INTO Employees (LastName, FirstName, HireDate, EndDate) VALUES('Fuller', 'Andrew', '2005-02-01', NULL)
GO
DECLARE @EmpID INT = (SELECT ID FROM Employees WHERE LastName = 'Davolio')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'IT', '2001-03-31', '2010-07-02')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'IT2', '2010-07-03', '2015-12-31')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'IT3', '2017-01-01', NULL)
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany1', '2001-03-31', '2005-04-30')
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany2', '2006-05-01', '2015-08-31')
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany3', '2015-09-01', '2018-06-30')
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany4', '2018-07-01', NULL)
GO
DECLARE @EmpID INT = (SELECT ID FROM Employees WHERE LastName = 'King')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'IT', '2016-05-09', '2018-03-31')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'IT2', '2022-10-01', '2022-12-31')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'IT3', '2023-01-01', '2023-12-31')
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany1', '2006-12-01', '2014-12-31')
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany2', '2016-05-09', NULL)
GO
DECLARE @EmpID INT = (SELECT ID FROM Employees WHERE LastName = 'Fuller')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'HR', '2005-02-01', '2014-07-31')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'HR2', '2015-08-01', '2020-12-31')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'HR3', '2021-01-31', '2023-06-30')
INSERT INTO HistEmpDept(EmployeeID, DeptName, FromDate, ToDate) VALUES(@EmpID, 'HR3', '2023-07-01', NULL)
INSERT INTO HistEmpCompany(EmployeeID, CompanyName, FromDate, ToDate) VALUES(@EmpID, 'SubCompany1', '2001-03-31', NULL)
GO
Execute the following queries just to check the content of the tables.
SELECT * FROM Employees ORDER BY ID
SELECT * FROM HistEmpDept ORDER BY EmployeeID, FromDate
SELECT * FROM HistEmpCompany ORDER BY EmployeeID, FromDate
Note that I also inserted gaps in the test data, that will be useful for checking the behavior of the queries we will write. What we want to obtain is something similar the following result:
As you can see, there is a new row every time the employee changes the company or the department. Each row has validity period calculated from historical tables.
Now we are going to write the queries needed to get this result.
A solution…
Let’s see together a solution to this problem.
Before moving on, I will create in the database the following two scalar functions:
CREATE FUNCTION [dbo].[MinDate]
(
@DATE1 DATE,
@DATE2 DATE
)
RETURNS DATE
AS
BEGIN
RETURN CASE WHEN @DATE1 < @DATE2 THEN @DATE1 ELSE @DATE2 END
END
GO
CREATE FUNCTION [dbo].[MaxDate]
(
@DATE1 DATE,
@DATE2 DATE
)
RETURNS DATE
AS
BEGIN
RETURN CASE WHEN @DATE1 > @DATE2 THEN @DATE1 ELSE @DATE2 END
END
GO
Given two dates, the two functions above return the smallest and largest date, respectively. We will use them in the queries we are going to write.
To merge the content of two historical tables we can use an INNER JOIN between the two tables, imposing as a join condition that the validity periods of the two records overlap. So we can write the following query:
DECLARE @MAX_DATE DATE = '99991231'
SELECT E.ID, E.LastName, E.FirstName,
HD.DeptName,
dbo.MaxDate(E.HireDate, HD.FromDate) AS FromDate,
NULLIF(dbo.MinDate(ISNULL(E.EndDate, @MAX_DATE), ISNULL(HD.ToDate, @MAX_DATE)), @MAX_DATE) AS ToDate
FROM Employees E
INNER JOIN HistEmpDept HD
ON E.ID = HD.EmployeeID
AND (E.HireDate BETWEEN HD.FromDate AND ISNULL(HD.ToDate, @MAX_DATE)
OR ISNULL(E.EndDate, @MAX_DATE) BETWEEN HD.FromDate AND ISNULL(HD.ToDate, @MAX_DATE)
OR HD.FromDate BETWEEN E.HireDate AND ISNULL(E.EndDate, @MAX_DATE)
OR ISNULL(HD.ToDate, @MAX_DATE) BETWEEN E.HireDate AND ISNULL(E.EndDate, @MAX_DATE)
)
Let’s look at the join condition:
...
ON
E.ID = HD.EmployeeID
AND
(E.HireDate BETWEEN HD.FromDate AND ISNULL(HD.ToDate, @MAX_DATE)
OR ISNULL(E.EndDate, @MAX_DATE) BETWEEN HD.FromDate AND ISNULL(HD.ToDate, @MAX_DATE)
OR HD.FromDate BETWEEN E.HireDate AND ISNULL(E.EndDate, @MAX_DATE)
OR ISNULL(HD.ToDate, @MAX_DATE) BETWEEN E.HireDate AND ISNULL(E.EndDate, @MAX_DATE)
)
...
Basically we are saying to join the rows that belong to the same employee (this is obvious!) and in which the validity period of the two records overlaps even only partially.
Regarding the SELECT part of the query, we have calculated FromDate and ToDate as follow:
...
dbo.MaxDate(E.HireDate, HD.FromDate) AS FromDate,
NULLIF(dbo.MinDate(ISNULL(E.EndDate, @MAX_DATE), ISNULL(HD.ToDate, @MAX_DATE)), @MAX_DATE) AS ToDate
...
Why this? I will try to explain with the picture below:
We have two tables, Table 1 and Table2, each containing some data. Based on the figure above, this is the content of table 1 (I won’t bore you with the content of table 2…):
Value | FromDate | ToDate |
V1 | Date 0 | (Date 1) – 1 |
V2 | Date 1 | (Date 4) – 1 |
V3 | Date 4 | (Date 5) – 1 |
and so on…. |
Look for example at V6: its validity period overlaps the one of V1 and V2, so we have the following pairs:
- V1, V6:
- From date: Date 0
- To date: Date 1
- V2, V6:
- From date: Date 1
- To Date: Date 2
So are valid the following rules:
- From date = the greater of the start date of the two records
- To date = the smaller of the end date of the two records
You can apply the same reasoning to all pairs of records generated by the INNER JOIN.
If you execute the query above, you should see the following result:
In our example we have to merge three historical tables. So we need an additional step to complete the job. This is the complete SQL script:
DECLARE @MAX_DATE DATE = '99991231'
DROP TABLE IF EXISTS #TEMP_STEP1
SELECT E.ID, E.LastName, E.FirstName,
HD.DeptName,
dbo.MaxDate(E.HireDate, HD.FromDate) AS FromDate,
NULLIF(dbo.MinDate(ISNULL(E.EndDate, @MAX_DATE), ISNULL(HD.ToDate, @MAX_DATE)), @MAX_DATE) AS ToDate
INTO #TEMP_STEP1
FROM Employees E
INNER JOIN HistEmpDept HD ON E.ID = HD.EmployeeID
AND (E.HireDate BETWEEN HD.FromDate AND ISNULL(HD.ToDate, @MAX_DATE)
OR ISNULL(E.EndDate, @MAX_DATE) BETWEEN HD.FromDate AND ISNULL(HD.ToDate, @MAX_DATE)
OR HD.FromDate BETWEEN E.HireDate AND ISNULL(E.EndDate, @MAX_DATE)
OR ISNULL(HD.ToDate, @MAX_DATE) BETWEEN E.HireDate AND ISNULL(E.EndDate, @MAX_DATE)
)
SELECT S1.ID, S1.LastName, S1.FirstName,
HC.CompanyName, S1.DeptName,
dbo.MaxDate(S1.FromDate, HC.FromDate) AS FromDate,
NULLIF(dbo.MinDate(ISNULL(S1.ToDate, @MAX_DATE), ISNULL(HC.ToDate, @MAX_DATE)), @MAX_DATE) AS ToDate
FROM #TEMP_STEP1 S1
INNER JOIN HistEmpCompany HC ON S1.ID = HC.EmployeeID
AND (S1.FromDate BETWEEN HC.FromDate AND ISNULL(HC.ToDate, @MAX_DATE)
OR ISNULL(S1.ToDate, @MAX_DATE) BETWEEN HC.FromDate AND ISNULL(HC.ToDate, @MAX_DATE)
OR HC.FromDate BETWEEN S1.FromDate AND ISNULL(S1.ToDate, @MAX_DATE)
OR ISNULL(HC.ToDate, @MAX_DATE) BETWEEN S1.FromDate AND ISNULL(S1.ToDate, @MAX_DATE)
)
ORDER BY S1.ID, S1.FromDate
In short, we merged the first two tables and inserted the result into a temporary table. Then we merged the temporary table with our third table using a similar query. Here you can see the result of the final query:
In conclusion
In this post we saw how to merge two or more historical tables. You may have noticed that there were gaps in the validity periods in the source data and these gaps have been reported in the final result. This is the case, for example, of Nancy Davolio at row number 4 and number 5.
I would also like to point out another detail. Try to execute the following update:
UPDATE HistEmpCompany SET ToDate = '20181231' WHERE EmployeeID = 1 AND FromDate = '20180701'
Then execute the full SQL script again. Here you can see the result:
As you can see, Nancy Davolio’s row number 6 has been closed with date 2018-12-31. But, if you look at her records in the tables Employees and HistEmpDept, they have NULL in the field ToDate, meaning that she is still an active employee with an active department. This situation could be compared to a gap, since from date 2019-01-01 the employee has not been assigned to any company. Is this correct? It depends on what you need… In the next post we will see a different way to manage the presence of gaps in the validity periods.
Stay tuned!