PostGres and SQL Server - Divided by a common language

,

My database career took off when I was sent on the two SQL Server 6.5 training courses.  As a developer and later as a DBA I stayed with SQL Server until 2014.

Since then I have been exposed to a variety of database platforms and technologies.  I have reached the conclusion that, for me, it is hardest to retain new knowledge when the technology is familiar.  It is not the differences that trip me up, it is the similarities.  I should like to show you what I mean by giving you some details of a project I worked on where the chosen database was an AWS RDS instance of PostgreSQL.

Background for the project

The project was to present a website visitor with a view of products based on a data science model of their propensity to buy.  Those products that the model showed they were likely to buy were emphasised in some way.  That could affect their position on the screen or the size of the tile on which they were presented.  The goal was to make sure the website visitor didn't miss the product they were looking for.  Like I miss things in the house that turn out to be right where I was told they were.

From a technical perspective what we knew were as follows.

  • The data science model represented approximately 100 million records and rising
  • The model had a high degree of change so a refresh could represent up to 60% change.  These would be either inserts or updates.
  • The model was produced by a regular batch process
  • The model was represented as a simple tabular structure.
  • The personalisation system, like the website, would have to be available 24/7/365

Quite early on we had worked out that we had to control the size of the model.  The cost of producing and ingesting and processing a 100+ million record and growing data set was going to be prohibitive. We also found that we would need to split the model file into a set of smaller files to prevent a constipated ingestion pipe. We did look to see if Amazons DynamoDb would fit the bill (pun intended) but the characteristics described above ruled that out quite quickly.

Solution design

Look at the design for these requirements  in the diagram shown below.

Loading propensity models without interupting website activity

As model files became available, their fully qualified pathname would be stored in the ModelFileHopper table.  Everything else would be handled by a batch process.

At the start of the batch process the first check would be to see if there were any files still remaining in the ModelWIP table.  We abort the process if this were the case.  This is to allow for the overall process running longer than a repeating schedule would allow.  If the ModelWIP file were empty then every data model file name would be moved from the ModelFileHopper to the ModelWIP table.

The ModelWIP table contains a list of files we wish to bulk insert into the Model_Load table.  Regardless of whether there is 1 or 100 file names listed in ModelWIP we bulk load them all into Model_Load.

Having populated Model_Load then where its records represent an update on the Model_Stage table records we perform that update.  We do this before inserting new records because we do not want to evaluate records we know to be new in the update.  Once both the update and insert process have completed Model_Stage now represents what we wish Model_Live to be.  We do the equivalent of sp_rename to swap Model_Stage with Model_Live.

Model_Live is now the most up-to-date model and Model_Stage is stale so we run the Update/Insert again to make sure they are the same.  Only after this has completed successfully do we truncate ModelWIP so the process is ready to receive new files.

Security considerations

The service that reads Model_Live does not need access to any of the preceding tables.  Only the batch process needs the rights to insert, update, truncate and rename tables.

Permissions should be granted to a role and not an individual account.  If ever an account is compromised then we can change the account and assign it to a role without worrying about migrating permissions.

Load considerations

Experience with SQL Server suggests that we are probably don't want to run a massive update and massive insert to bring Model_Stage up-to-date.  We would want to break these actions up into smaller chunks.  Through experimentation we found that the best batch size was 250,000.

Ease of development considerations

The process fits an ELT pattern.  With that in mind the development team asked whether the various steps could be represented as stored procedures.  If possible marshalled by a single stored procedure calling all the others.  From their perspective they wanted to put the names of files to be processed into the ModelFileHopper table and let a scheduled batch process take it from there.

Where the similarities between SQL Server and PostgreSQL tripped us up

These are a few areas that were similar between the SQL Server and PostgreSQL platforms.

Database roles

Both SQL Server and PostgreSQL have the concept of user defined roles.  You grant permissions on objects to these roles in much the same way in both products.  However, we missed a crucial difference between the way the two treat database roles

We created the equivalent of a db_datareader role in PostgreSQL for a data science database application.  We discovered we were blocked from  tearing down/rebuilding the database because our role was active in another database

  • In SQL Server a database role exists within the database
  • In PostgreSQL a database role is at the PostgreSQL instance level

Stored Procedure permissions

Both SQL Server and PostgreSQL have stored procedures.  This allows us encapsulate complex set based operations as a securable unit of code.

In SQL Server you don't need to grant permissions to the underlying tables or views.  We can demonstrate this by setting up an experiment with a low privilege user.

USE master
GO
-- Create a login for our experiment
CREATE LOGIN Binky WITH PASSWORD='Binky',
DEFAULT_DATABASE = Master,
CHECK_POLICY = OFF,
CHECK_EXPIRATION = OFF ;
/*
Switch to a copy of Adventureworks 2014
*/USE AW2014
GO
CREATE ROLE proc_execute  authorization dbo;
-- Create our low user, Binky, and make them a member of our proc_execute role
CREATE USER Binky FROM Login Binky;
EXECUTE sp_addrolemember 'proc_execute', 'Binky';
GO
-- Create a simple procedure that we want Binky to execute
CREATE PROCEDURE Person.ListContactType
AS
SET NOCOUNT ON
SELECT ContactTypeId,Name 
FROM Person.ContactType
GO
-- As Binky is a member of our proc_execute role they will gain the ability 
-- to execute our Person.ListContactType role
GRANT EXECUTE ON Person.ListContactType TO proc_execute;

Binky can run Person.ListContactType but if they try to perform the SELECT directly then they get an error message

Msg 229, Level 14, State 5, Line 24
The SELECT permission was denied on the object 'ContactType', database 'AW2014', schema 'Person'.

In PostgreSQL this is not the case.  The permissions of the person calling a stored procedure are those that are used.  You have two options to resolve this.

  • Grant permissions to the underlying tables/views.  This negates a major security advantage of stored procedures
  • Explicitly create the stored procedure with the SECURITY DEFINER clause.  The stored procedure executes with the permissions of the creator/owner of the procedure.

It isn't much work to specify SECURITY DEFINER stored procedures (and functions), however, you do need to follow certain practises to make sure you don't accidentally create a security hole. The article Abusing SECURITY DEFINER functions explains the problem and the steps taken to protect yourself from those problems.

Default schemas and objects

We qualify objects with their schema in SQL Server.  You should continue to do this in PostgreSQL, too.

In the Abusing SECURITY DEFINER functions article it description the SET search_path function.  You can tell PostgreSQL the precedence in which database schemas should be searched for a given object if no schema is specified and you can use multiple schemas in the command.  You can specify this at four levels

  • Database level
  • Role level
  • User level
  • In stored procedures in functions

You cannot alter SQL Server's precedence.  When SQL Server receives a call to an unqualified object it will follow a precedence

  • Try and use the object in the user's default schema if it exists
  • Try and use the object in the dbo schema if it exists

This means we can get an unpleasant surprise in the results if an object in different schemas shares a name. We can demonstrate this in the script below.

USE AW2014
go
CREATE PROC dbo.SimplePrint
AS
SET NOCOUNT ON
SELECT 'This is the dbo schema'
GO
CREATE PROC Person.SimplePrint
AS
SET NOCOUNT ON
SELECT 'This is the Person schema'
GO

GRANT EXECUTE ON dbo.SimplePrint TO proc_execute;
GRANT EXECUTE ON Person.SimplePrint TO proc_execute;
DROP USER Binky;
CREATE USER Binky WITH DEFAULT_SCHEMA=Person;
exec sp_AddRoleMember 'proc_execute','binky';
exec SimplePrint
exec Person.SimplePrint

When we execute the stored procedures as database owner the stored procedure calls gives the following

This is the dbo schema
This is the Person schema

When Binky executes the same stored procedures they get the following

This is the Person schema
This is the Person schema

If we were to drop Person.SimplePrint then if Binky tried to run exec SimplePrint they would get the This is the dbo schema version.

Stored procedures and transactions

In our solution design we update and insert the Model_Stage table from Model_Load.

  • Update first because there are fewer records to update
  • Insert last

The number of inserts and updates are quite large so we want to carry out those operations in batches.  In SQL Server our stored procedure for inserting records would be akin to the following

CREATE PROC Propensity.Populate_Model_Stage @BatchSize INT=50000
AS
SET NOCOUNT ON
DECLARE @InsertedRecords INT =1,
@TotalRecords INT = 0
WHILE @InsertedRecords > 0
BEGIN
INSERT INTO Propensity.Model_Stage({Column names})
SELECT TOP (@BatchSize) SRC.{Column names}
FROM Propensity.Model_Load AS SRC
LEFT JOIN Propensity.Model_Stage AS DEST
ON SRC.model_id = DEST.model_id
WHERE DEST.model_id IS NULL
SET @InsertedRecords = @@ROWCOUNT
SET @TotalRecords = @TotalRecords + @InsertedRecords
RAISERROR('Total records inserted = %i',10,1,@TotalRecords) WITH NOWAIT
END
GO

By looking at the SQL Server transaction log we would see that we would have a transaction for each batch size chunk of records.

CHECKPOINT
exec Propensity.Populate_Model_Stage 50000
SELECT  [Current LSN],[Previous Page LSN],[Operation],[Transaction ID] ,AllocUnitName
FROM sys.fn_dblog(NULL,NULL)

The first time we ran the equivalent stored procedure in PostgreSQL the stored procedure took far longer to run and produced nothing until the entire procedure had finished. PostgreSQL stored procedure automatically opens a transaction when it is called.  Without an explicit commit statement after each loop iteration everything in the stored procedure will be wrapped up in a single transaction.

The equivalent PostgreSQL stored procedure is as shown below.

CREATE OR REPLACE PROCEDURE Propensity.Populate_Model_Stage(batch_size integer DEFAULT 50000)
LANGUAGE plpgsql SECURITY DEFINER
AS
$$
SET search_path = 'Propensity';
DECLARE inserted_records INTEGER := 1;
BEGIN
    loop
        if inserted_records > 0 then
            INSERT INTO Propensity.Model_Stage({Column Names})
            SELECT SRC.{Column Names}{Column Names}
            FROM Propensity.Model_Load AS SRC
                LEFT JOIN Propensity.Model_Stage AS DEST
                ON SRC.model_id = DEST.model_id
            WHERE Dest.model_id IS NULL
            LIMIT batch_size;
            
            GET DIAGNOSTICS inserted_records := ROW_COUNT;
            
            RAISE WARNING '% BATCH INSERT: Propensity.Model_Stage: Records inserted % on batch size %', clock_timestamp(),inserted_records, batch_size;
            
            COMMIT;
            
            if inserted_records = 0 then
                exit;
            end if;
        end if;
    end loop;
END;
$$;

As a result of the stored procedure transaction behaviour our team found it was not possible to have a main calling procedure to call all other procedures.   In order to overcome this they wrote a small scheduled task to call the stored procedures in the desired order.

Stored procedure language

In SQL Server SQL is SQL.  In Postgres an equivalent stored procedure can be SQL or plpgsql.

  • SQL for simple DML
  • plpgsql for procedural processes

Closing thoughts

The examples given show where a database concept is implemented differently in two different database platforms.  I just happened to pick SQL Server and PostgreSQL.  Similar challenges would be faced if context switching from Google BigQuery to SnowflakeDB, PostgreSQL to MySQL or most other pairs of databases.

I found features in each database platform that I wish were available in the other.  That is the nature of products in a competitive marketplace.

 

Rate

5 (5)

Share

Share

Rate

5 (5)