Bulk upsert data to MSSQL

March 01, 2024 - Drygast

C# Database How To MS SQL

Recently, I worked on a project where we had to keep a second database in "reasonable" sync with another. However, the first database was not ours, and we only had limited read-access to it. To make matters even worse, we did not have a reliable way to select outdated or updated rows due to an "interesting" design of the first database. Setting up a replica relationship was out of the question (due to other reasons), so we had to do our best to find another way.

We were essentially restricted to polling the first database and transferring the data to a second database. However, without a reliable "updated" flag, we were compelled to retrieve the data from database one and execute calls to database two to INSERT or UPDATE (UPSERT) in chunks. It wasn't ideal, but sometimes you have to accept that things are less than optimal.

Anyway, a previous solution existed where Azure Service Bus was used, but when it was discovered that the flag it depended on was unreliable, we had to change the entire setup. Also, due to changes by the owners of database one, it was not impossible that every row could be updated in an instance - meaning that we had to resync everything as fast as possible. A complete re-sync using the original setup was calculated to take more than 2 weeks to complete, so that was out of the question.

That is where this method came to good use. The entire setup was simplified, and a small service was created (written in C#). We polled the first database using its autogenerated primary key and then dumped every row to the second database. If the service was allowed to restart immediately after handling a chunk of data, this process took less than 15 minutes for a database of a few million rows. I was pretty happy about the solution and its performance and wanted to write a short article about it.

Multiple INSERT commands

For comparison, I wanted to show the amount of time multiple consecutive INSERT commands took. I created a simple test program in C# and made it loop through a small text file containing the 10k most commonly used English words.

for (var i = 0; i < commonWords.Length; i++) // commonWords is an array of strings loaded from file
{
    var commonWord = commonWords[i];
    string commandText = $"INSERT INTO CommonWords (sortOrder, commonWord, wordLength) VALUES ({i},'{commonWord}',{commonWord.Length});";
    SqlCommand cmd = new SqlCommand(commandText, conn);
    cmd.ExecuteNonQuery();
}

Not surprisingly, this approach is not very fast. In this test, I used a local, almost empty database on a fast computer, and it still took 00:00:02.4034310 to complete the operation of inserting 10,000 rows. That might not seem like much, but it will only get worse as the database grows. Also, it would not suffice due to only performing INSERT operations, and if I were to run the program again (with rows already existing in the table), it would crash.

Multiple UPSERT commands with SP

Let's create a second example to see how much time it takes to call a stored procedure instead. I created a simple stored procedure that UPDATES or INSERTS depending on if the row already exists or not. I'm using the sortOrder of the most common word as a primary key here, but this is only test data, so I'm not too worried about that.

CREATE PROCEDURE [dbo].[USP_UpsertCommonWord]
	@sortOrder INT,
	@commonWord NVARCHAR(32),
	@wordLength INT
AS
BEGIN
	IF EXISTS ( SELECT 1 FROM CommonWords WHERE sortOrder=@sortOrder) 
	BEGIN
		UPDATE CommonWords SET commonWord=@commonWord, wordLength=@wordLength WHERE sortOrder=@sortOrder;
	END
	ELSE
	BEGIN
		INSERT INTO CommonWords (sortOrder, commonWord, wordLength) VALUES (@sortOrder, @commonWord, @wordLength);
	END
END

After that, I added some code in the C# app to test it. It still iterates each row, and that results in a bunch of calls to the database which is not optimal.

for (var i = 0; i < commonWords.Length; i++)
{
    var commonWord = commonWords[i];
    string commandText = $"EXEC USP_UpsertCommonWord @sortOrder={i},@commonWord=&{commonWord}&,@wordLength={commonWord.Length};";
    SqlCommand cmd = new SqlCommand(commandText, conn);
    cmd.ExecuteNonQuery();
}

Again - I was not surprised when the this approach took about the same amount of time as the previous test. This SP had 2 different cases though - one with an empty table, and one with the table already filled with data. Here are the results:

Separate UPSERT calls to SP completed in: 00:00:02.1925467 (First time with only INSERT)
Separate UPSERT calls to SP completed in: 00:00:02.7963569 (Second time with only UPDATE)

Bulk Upsert with MERGE

OK, so we have determined that iterating through a long list is not the best way to do bulk upserts. It takes time and resources that could be used better.

Instead, what if we just sent the entire list into the database and let it do the UPSERT? For that to work, we first need a "User-Defined Table Type". Creating a type like this allows us to execute a stored procedure and have a dataset used as input variable.

CREATE TYPE CommonWordTableType AS TABLE 
(
	sortOrder int NOT NULL,
	commonWord nvarchar(32) NULL,
	wordLength int NULL,
	PRIMARY KEY (sortOrder)
)

With the new table type created, we can use it as an input variable for our stored procedure. In that stored procedure we are going to use the MERGE command as well in order to UPSERT the data.

CREATE PROCEDURE [dbo].[USP_BulkUpsertCommonWords]
	@commonWords dbo.CommonWordTableType READONLY
AS
BEGIN
	MERGE INTO CommonWords AS Target
	USING @commonWords AS Source
	ON Target.sortOrder = Source.sortOrder
	WHEN MATCHED THEN
		UPDATE SET
			Target.commonWord = Source.commonWord,
			Target.wordLength = Source.wordLength
	WHEN NOT MATCHED THEN
		INSERT (sortOrder, commonWord, wordLength)
		VALUES (Source.sortOrder, Source.commonWord, Source.wordLength);
END

The MERGE command can be expanded a lot more, but I'll keep it to the basics in this article. Depending on your needs, a few small additions might prove important.

Anyway, with the stored procedure created and ready to be used, we turn to the C# code. In order to pass the data to the server, we have to create a DataTable. I created a small helper function to do just that:

private static DataTable CreateUpsertParam(string[] commonWords)
{
    DataTable table = new DataTable();
    table.Columns.Add("sortOrder", typeof(Int32));
    table.Columns.Add("commonWord", typeof(string));
    table.Columns.Add("wordLength", typeof(Int32));

    for (var i = 0; i < commonWords.Length; i++)
    {
        var commonWord = commonWords[i];
        table.Rows.Add(i, commonWord, commonWord.Length);
    }

    return table;
}

And finally, call the new stored procedure with the DataTable as parameter:

using (SqlCommand cmd = conn.CreateCommand())
{
    cmd.CommandText = "USP_BulkUpsertCommonWords";
    cmd.CommandType = CommandType.StoredProcedure;
    SqlParameter parameter = cmd.Parameters.AddWithValue("@commonWords", CreateUpsertParam(commonWords));
    parameter.SqlDbType = SqlDbType.Structured;
    parameter.TypeName = "CommonWordTableType";
    cmd.ExecuteNonQuery();
}

With all that done, it only took a fraction of the time to UPSERT the data. And yes, I also included the time it took to create the DataTable in the times below:

Using SP with MERGE completed in: 00:00:00.1315941 (First time with only INSERT)
Using SP with MERGE completed in: 00:00:00.1206575 (Second time with only UPDATE)

Now that is a massive amount of time saved! During my limited tests I could see that this way of doing things was roughly 20 times faster. And again - even if this is only a small and uncomplicated sample (that completes pretty quick even with the slower methods), it will get very noticable when dealing with larger databases.

A few more thing to consider:

In our case we found that adding OPTION (LOOP JOIN); to the end of the MERGE operation would increase the performance. It depends on how the database is structured and the data provided to the SP though, but it could be something to look into to possibly improve performance even more.
Merge supports OUTPUT where it is possible to separate rows that were inserted or updated for even more processing (such as logging the event).
It's possible to further narrow WHEN MATCHED THEN with other columns such as WHEN MATCHED AND (IsNull(Target.commonWord, '') != IsNull(Source.commonWord,'')) THEN if needed. Basically making us ignoring the UPDATE if we have a match, but the columns has not changed (and therefor the UPDATE is not needed) further optimizing by not performing unessesary operations.

There is probably a lot more to discuss about all of this, but for what we needed at the time, this was plenty. I liked that things were structured, fast, and customizable compared to the alternatives we had. One downside that we discovered was that expanding the User-Defined Table Type was not without a small issue. The type could not be updated as long as it was in use, so we had to remove it from the stored procedure that used it, modify it, and then add it back to the stored procedure. Not really that much of an issue, but a small issue nonetheless.