Blog Post

SQL Saturday Baton Rouge 2025

,

SQL Saturday Baton Rouge is a great event hosted on LSU’s campus that has some of the best food, presenters, and attendees in the country! Had some incredible questions for my session and learned a ton throughout the conference. Thank you to all the sponsors and volunteers who make this event happen year after year (16 years running!). It’s one of the first events I ever spoke at (https://dataonwheels.wordpress.com/2022/08/23/sql-saturday-baton-rogue-recap/), and I’m so excited I was able to speak here again!

Here’s a link to the event where you can find other incredible speakers and sessions! Pictured below are some of my SML (Saturday Morning Learning) buddies – Dominick Raimato, Jason Romans, Kevin Pereira, and me! Thanks Sheila Romans for the great picture of us nerds!

My session – Mastering Microsoft Fabric Data Warehouses Tips & Tricks

Thanks to everyone who came out to my session and came with me on a journey to kill Pippin’s long-running query, create case insensitive warehouses, and monitor our warehouse and lakehouse queries on a single pane of glass. You guys were such an engaged audience and I loved being able to laugh and commiserate about our SQL experiences together!

You can find the presentation and resources here – https://github.com/Anytsirk12/DataOnWheels/tree/main/2025%20Presentations/2025%20SQL%20Saturday%20Baton%20Rouge

Jason Romans – How to Diagnose a Slow Power BI Report

Jason gave an awesome session about how to use external tools to diagnosis and even proactively solve a slow Power BI Report. Check out his session live if you get a chance, lots of great slides and demos! Below is the abstract from his session, be sure to follow the DAX Shepherd.
https://thedaxshepherd.com/presentations/

Users complain that a Power BI report is so slow that it is not worth using. A Power BI report may not perform well for many reasons, but how do you know where to start? In this session, I will show you some of the main tools I use to dive deep into the common and the not-so-common performance issues I run into. Starting from a slow report, we will learn how to diagnose where the problem is. Beginning with Performance Analyzer in Power BI, we can start to figure out where the issue is. Is it the visual that is slow, or are the DAX queries taking too long?

Next, we will use DAX Studio to tune the DAX queries that are being generated. Along the way, we will look beneath the visuals to see how Power BI is architected so we can better understand how to write better code from the start. We will examine Query Plans and Server Timings to evaluate which of our changes are more beneficial.

Lastly, we will examine methods for detecting and preventing performance issues before they occur. Using Tabular Editor, we will learn how Best Practice Analyzer can detect performance issues with your semantic models.

Kevin Feasel – Applying Forensic Accounting Techniques using SQL and Python

Below are my stream of conscious notes from this session, it was a ton of fun to think of forensically investigate a SQL environment using python. Session was packed with great examples and was an absolute blast to play detective. Kevin runs curated SQL which is an awesome site to get a “curated” list of recent blogs with incredible insights across the data community.
https://curatedsql.com/

Forensic accounting deals with primarily investigation and litigation. Goal is to connect the dots on potential fraud in a way that will stand up in court.

Step 1 – Summary Analysis

  1. Find largest table in terms of records or size
  2. For largest table, review data types and keys
  3. Look at foreign keys off that largest table and do the same analysis over and over.

You end up with a database model. That gives you a high-level view of the database.
There will be some python notebooks available in the future that will walk through how to explore a sample database.

Pandas has a command called df.describe(include=’all’) that will give you stats on a dataframe. Pretty neat way to easily get some info on numeric values especially. Note – you should convert the date fields to be dates within the dataframe (likely default to objects) then you’ll get similar stats for the date columns.
Pandas also has good ways to easily group, count, and sorting (also able to do this in SQL) so you can easily see what the top artifacts are within certain categories. One example is number of vendors with certain expense categories. For example, maybe only one Vendor is selling cleaning supplies. There may be a reason for it, but in an auditing scenario, that is a good place to start looking for preferential treatment and fraud.

Step 2 – Growth Analysis

This looks at changes in ratios over time. Goal is to look for major changes which look to be far outside the norm. Typically this is performed against important additive measures like revenue, cost, number of customers, number of orders, and even active users as a percent of total users (a “fixed cap” ratio).

We can perform an eyeball test at this phase. We can also do comprehensive analysis using KS and PPS tests. Seaborn library in python allows you to easily create lineplots and other visuals to see how additive measures perform over time. This allows you to investigate and gut check growth over time. This can be particularly interesting if you expect a linear increase or decrease but see something more random.

Min max scaler inside of sklearn library (preprocessing) can rescale everything on a range from 0 – 1 to allow you to really compare growth of two very different scaled items.

Gap analysis – sometimes the lack of value can be more interesting. Two meanings – review of controls and where things can go wrong & what data is missing.
Controls – do people have access to edit logs, who can do what? Does data change and do we know when it does? Are we tracking failures? Double-checking calculations on reports.
Gaps and Islands – gaps in data. Gaps are elements in a sequence which are missing. Islands are contiguous sets of data up to and not including the gap.
Accountants care about sequences that should be complete like receipt numbers, purchase orders, check numbers, invoice numbers, etc. Example check 1003 is missing, your accountant wants to see the voided check.

How to solve gaps and islands. Use DENSE_RANK() to build an ordering then calculate a difference between actual value and the DENSE_RANK() to find where the gap pops up and the islands get created.

Warning with gaps – SQL will typically have gaps in IDENTITY. This happens often with rollbacks where identity value gets burned. This also applies to SEQUENCE. Building a gapless sequence is a huge hit on capacity which is why it’s not default. It forces single row insertion. Creates serializable isolation level.

Regression Analysis

Contains a dependent variable (the thing we want to explain). For example, count of invoices per month, count of invoices is the dependent variable. Dependent variable must be continuous, countable feature. Can only be one dependent variable in a regression. Multivariate multiple regression is a thing but it’s really a series of regressions. We have one or more independent variables that we think will explain the dependent variable (time, people, customer, location, etc.). We assume the independent variables will drive the dependent variable but not other independent variables. Data should ideally have a good amount of variety in the independent variables. Ideal is 30-100 rule of thumb (data points per combination of independent variables). More data is better. We also assume that the past behavior is similar to future behavior.

sklearn.linear_model import LinearRegression and sklearn.metrics import mean_square_error from python make conducting a linear regression super easy to implement. You need to fit the model to see how the dependent variable is impacted by the provided independent variables. sm_model.sumary() will give you the OLS regression results. Be sure to use what you learned from your gut check line charts. For example, if there’s a significant deviation from the line in a certain year, try building the linear regression for before and during that significant difference occurs and compare that to what you get after. This may also help you find the true outliers that led to the craziness overall.

Cohort Analysis – looking at top-level aggregates or by a single slice might help identify problems. Differences should be able to be explained by domain experts.

  • Duplication of entries
  • frequency of one thing versus another
  • changes in frequency over time
  • shape of numeric distribution

Intuition on duplication

  • People don’t like to perform more work than necessary
  • when given an option between more work and less work, people will generally choose to less work
  • duplication is usually considered more work
  • if the low-work is easy to understand, people are more likely to gravitate to it ceteris paribus

These are not rules of nature, just intuitions.

Digit analysis – you should expect a uniformed distribution of the last digit in a sequence. There are exceptions. Budgets usually are rounded off and end with 5s or 0s. Estimates in general are also rounded to 5 or 0. Less precise measurements often end in 0.

When people try to fake data, they tend to go with “more random” numbers like 2, 3, 7, and 8. Benford’s law – unlike the final digit, the first digit of a sequence often follows a power law because 1 is more likely to occur than 2, etc. Works best when there’s 2-3 orders of magnitude or more. Pretty neat, definitely look up Benford’s law.

Python library is called beautiful soup for this. Pretty cool.

Keep records of your findings. As you dig, you will want to have that data readily available for future reference.

Jonathan Stewart – Building scalable data warehouses in Fabric

Jonathan walks through the basics of building a data warehouse that can easily grow without taking down capacity. He walks through best data integration options, star schema principles, and some great data modeling best practices. Column selection, JOIN, and aggregation optimization techniques were the focus.

V-Ordering, when should you do it? It’s enabled by default for read optimization on the data warehouse but off by default for data lakehouses to prioritize data timeliness.

  • Direct Lake Semantic models
  • If user experience is priority over data timeliness

Size Optimization Strategy = choose the smallest data type that supports your data requirements. Plan for scaling, but realistically. A column that is yes/no will never have more than 3 characters so it can be a VARCHAR(3).

His abstract:

Learn how to design and implement scalable data warehouses using Microsoft Fabric. This session will cover best practices for data storage, management, and retrieval, as well as the use of Fabric’s data warehousing features. Attendees will gain insights into creating efficient data pipelines, optimizing performance, and ensuring data consistency.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating