SQL Windows Functions might be non-deterministic

One of the scariest things that can happen when you develop SQL code is getting indeterminate results. When you run your query multiple times and each time gets a different value.

How can that happen?

SQL functions or queries can be categorized into two groups – deterministic vs. non-deterministic.

Deterministic functions

Deterministic functions always return the same data given the same input. For example, if my table contains 2 rows, selecting count(*) from the table will return 2 no matter how many times I run it.

non-deterministic functions

Non-deterministic means that each time the query or function runs it may return a different value. Getdate() is such a function, it returns the date and time to the parts of a millisecond, and each run might return a different value. Another example is when you use TOP 10 (in SQL server, in Databricks, and other SQL dialects it’s LIMIT 10) SQL will return 10 rows, but it doesn’t guarantee the these rows will always be the same.

Please note that a non-deterministic function might return the same data. But it might not, and we as developers, have to consider that.

What are Windows Functions?

Windows functions calculate a value for each row based on multiple rows. For example, ROW_NUMBER() is a function that calculates the row number based on the location of the row in the query, You need to specify the order column(s), and optionally the partition column(s) (if you want to start counting again at each partition).

But what happens when we specify an order column in a window function, but there are many rows with the same value for that column? Since order is not guaranteed (in other words, it’s non-deterministic) the value of the window function result might change between runs.

Example (this example is on SQL server, but should work similarly on all SQL variants):

Let’s create a table and fill it with a lot of rows (I’ll explain in the end why many rows are required):

DROP TABLE IF EXISTS
	dbo.Transactions;
GO


CREATE TABLE
	dbo.Transactions
(
	TranId		INT		NOT NULL ,
	TranDate	DATE		NOT NULL ,
	Amount		DECIMAL(9,2)	NOT NULL ,

	CONSTRAINT
		pk_Transactions
	PRIMARY KEY CLUSTERED (TranId ASC)
);
GO


INSERT INTO	
	dbo.Transactions
(
	TranId ,
	TranDate ,
	Amount
)
SELECT
	TranId	 = value ,
	TranDate = DATEADD (DAY ,  - ABS (CHECKSUM (NEWID ())) % (5 * 365) , SYSDATETIME ()) ,
	Amount	 = ABS (CHECKSUM (NEWID ())) % 1000
FROM
	GENERATE_SERIES (1 , 1000000 , 1);
GO

Some explanations:

  • GENERATE_SERIES – (new function in SQL server 2022) creates a table with the number of rows based on the parameters it excepts. in this example, a table with 1 million rows, starting with 1 and advancing by 1 on each row.
  • TranDate – this piece of code will assign a random date from the last 5 years
  • Amount – a random amount

The next step would be to read from this table and use a window function to calculate a running_total based on the amount column, order by the TranDate (which has multiple duplicate values). Then I want to ask how many rows have a value larger than 5 million

with cte as (
SELECT
	TranId ,
	TranDate ,
	Amount ,
	SUM (Amount) OVER (ORDER BY TranDate ASC) AS RunningTotal
FROM
	dbo.Transactions )
select 
	count(*) as count
from cte
where RunningTotal > 5000000

I ran this command 3 times, and here are the results.

Try 1

Try 2

Try 3

As you can see the results are different. Why?

RunningTotal is calculated based on the order of TranDate. Since TranDate isn’t unique the order of the rows is not constant, and so the value of RunningTotal might change based on the order of that run. That means that the number of rows above our threshold can change too.

How do we prevent that?

To have a deterministic result, we need to make sure the order of the window function is constant, so we need a tie-breaking value.

for example:

with cte as (
SELECT
	TranId ,
	TranDate ,
	Amount ,
	SUM (Amount) OVER (ORDER BY TranDate ASC , TranId ASC) AS RunningTotal
FROM
	dbo.Transactions )
select 
	count(*) as count
from cte
where RunningTotal > 5000000

In this version of the code, since we added the unique column TranID to the order by clause of the window function, the order is always the same, no matter how many times we run this query, and so the count of rows with RunningTotal larger than X is always the same.

Important note:

The behavior of order by can be influenced by many things, like the number of rows involved, how many CPUs are available for the query engine, and more. When I tried the above code with just a few rows, it always produced the same order and the same number of rows from count(*). So please note that it doesn’t always produce problematic behavior. But still, we need to ensure that under no circumstance our queries might produce changing results.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *