Blog Post

Power BI: Data Quality Checks Using Python & SQL

,

Picture this, you have a report in Power BI that someone passes off to you for data quality checks. There are a few ways to make sure your measures match what is in the source data system, but for this demo we are going to use python and excel to perform our data quality checks in one batch. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and connect to Excel to grab the formulas and to push back the quality check into Excel for final review. To find a sample Excel and the final python script, please refer to my GitHub.

To build this, you’ll need a few pieces:

  1. A way to run Power BI REST API’s
    • Requires an Azure AD App – if you can’t make one then you will need someone in your company to create one and give you the client ID for it and credentials to use it with
  2. Ability to run python scripts (Visual Studio Code will be used in this demo)
  3. Access to a Power BI dataset that is published to the Power BI service (Premium is not needed)
  4. Access to a SQL Server
  5. A list of DAX functions that you want to test in Excel
  6. A list of SQL queries that should have the same result as your DAX functions in Excel

Authentication

To start, you’ll need to register an application for Power BI within Azure. First, create & register your app in the Azure portal. This link will take you through the manual process step-by-step including how to add permissions so the application can access the PBI service. You’ll need to give your app the delegated permission of Dataset.ReadWrite.All or Dataset.Read.All as well as Tenant.Read.All or Tenant.ReadWrite.All which will require an admin to grant (check the required scope in API documentation). One important limitation for accessing the Azure AD app, the user that you use to access it must have no MFA on it. If that makes your IT security team squeemish, remind them that the app can be given read only access to only Power BI’s metadata. No proprietary data is at risk and write back is not necessary to accomplish our goals.

Let’s test out running the REST API using python scripts! To run this in python, we will need to generate an access token then call the REST API using the access token acquired earlier.

Adding Libraries in Python

If you have not used some of the libraries that will be imported for this demo, you will need to run “pip install [library name]” in your command prompt. If you don’t have pip, follow directions on this link and/or modify your Python application to include pip. Newer versions often have this enabled for ya by default, but triple check that you have added python to your environment variables or else you will get errors saying Python isn’t installed when you try to run stuff in the command prompt.

Getting the REST API Token

Alrighty, let’s go to Visual Studio Code and build out the authentication portion of our Python. We are going to use the adal library to make a REST API call that will get the token for us. Make sure to put single or double quotes around your Client ID, Username, and Password to format them as text. Keep in mind, our final code will get these values from the excel file instead of hard coding them into our code. That means you can also get these values from a separate Excel file if your IT team would prefer to keep these values out of the final code (which is highly recommended).

import adal
authority_url = 'https://login.windows.net/common'
resource_url = 'https://analysis.windows.net/powerbi/api'
client_id = <INSERT CLIENT ID>
username = <INSERT USERNAME>
password = <INSERT PASSWORD>
context = adal.AuthenticationContext(authority=authority_url,
                                     validate_authority=True,
                                     api_version=None)
token = context.acquire_token_with_username_password(resource=resource_url,
                                                     client_id=client_id,
                                                     username=username,
                                                     password=password)
access_token = token.get('accessToken')
print(access_token)

Awesome, now we have a way to generate an access token to use for our future API calls!

Alrighty, let’s try this out with a simple REST API call – getting a list of the workspaces (called groups in the API for some reason) in my tenant. I only have a couple workspaces, but this will let us test out passing our token to get back a response. Add the following lines to the end of the code above:

get_workspaces_url = 'https://api.powerbi.com/v1.0/myorg/groups'
header = {'Authorization': f'Bearer {access_token}'}
r = requests.get(url=get_workspaces_url, headers=header)
r.raise_for_status()
print(r.text)

You should see something similar to the screenshot below. You’ll notice there are a lot of elements in this response, but we can look at the name section and confirm that the code is working as expected. Exciting stuff!

Connecting to a SQL Server database

To connect to SQL, we are going to use the pymssql library. See below for the format we will need to connect to our SQL Server database and run a sample query.

#SQL Server Authentication
import pymssql 
sql_server = 'server.database.windows.net'
sql_user = 'Sample'
sql_password= 'Password'
sql_database = 'DatabaseName'
sql_con = pymssql.connect(sql_server,sql_user,sql_password,sql_database)
sql_cursor = sql_con.cursor(as_dict=True)
#execute SQL script
sql_cursor.execute('SELECT TOP 1 SalesKey FROM dbo.FactSales')
for row in sql_cursor: 
    print(row)

Your result will be a dictionary with the column name then the value. This will be important when we try to compare it to our Power BI DAX query result.

Comparing DAX Query with SQL Query

Alrighty now to the main event! Running a DAX query against our Power BI data model and comparing it to a SQL query to ensure our measure is working as expected. For this demo, I’ll keep it pretty simple. I have a measure in my dataset, Sales Amount = SUM(FactSales[SalesAmount]). This should equal the result from the SQL query SELECT SUM(SalesAmount) ‘Sales Amount’ FROM FactSales. If you have errors using the execute queries REST API, please review the limitations section of the Microsoft Execute Queries Documentation.

Python Code to Run SQL Script

#SQL Script Execution
sql_cursor.execute('SELECT SUM(SalesAmount) as SalesAmount FROM dbo.FactSales')
for row in sql_cursor: 
    print(row["SalesAmount"])

Python Code to Run PBI Execute Queries REST API

#PBI Execute Queries 
pbi_dataset_id = '10c3f357-9b2a-42f4-8438-748295b47b9b'
execute_queries_url = f'https://api.powerbi.com/v1.0/myorg/datasets/{pbi_dataset_id}/executeQueries'
dax_query_json_body = {
                "queries":
                        [
                            {
                            "query": 'EVALUATE ROW("Sales_Amount", '_Measures'[Sales Amount])'
                            }
                        ]
                        
                }
eq = requests.post(url=execute_queries_url, headers=header,json=dax_query_json_body)
eq.raise_for_status()
query_json = eq.text.encode().decode('utf-8-sig')
print("JSON Response ", json.loads(query_json))

Awesome! Looks like now I can visually confirm that my Power BI DAX measure returns the same result as my SQL query. But what if I want to do more than one at a time? In the next section, we will flip the script to use an Excel file as our input and our output. That way, all you would need to do is have two columns for input – one with the SQL query and one with the corresponding DAX formula. Then the output would be a third column that returns the difference between the two (ideally, 0).

Use Excel File as Input and Output

To start, we are going to build an Excel template to use with this script. For ease of use, we are going to use formatted tables so end users can append new measures as needed (a copy is available on our GitHub for download). We will make two sheets- one for credentials and one for the data quality checks. On the credentials sheet, make one table for the SQL credentials and for the PBI credentials. I’m going to name the table “Creds”.

On the Quality Check tab, we will make one more table that has a column for Power BI DAX formula, the SQL query, the result of the DAX formula, the result of the SQL query, and the difference between the two. Let’s avoid using spaces in our column and table names to keep our coding simple. We’ll call this table “Quality”.

Alrighty, now to pull these values into python! Let’s start with grabbing our creds and pulling them in. For this section, we need to import the openpyxl library and use it to pull in our tables. To start, we are going to grab our file then go to the worksheet named “Credentials”. From there, we will see what tables are in that worksheet.

#Grab data from our excel file
from openpyxl import load_workbook
#Read file
wb = load_workbook(r"C:UsersKristynaHughesDownloadsPythonDataQualityChecker.xlsx") 
#we need the r at the beginning since "" is a special character in python
sheets = wb.sheetnames
print(sheets) #see the sheets in our workbook
#access specific sheet called Credentials
cred_ws = wb['Credentials']
print(cred_ws.title)
cred_tables = cred_ws.tables.values()
for table in cred_tables:
    print(table.displayName)

Next, we are going to use the mapping function to create an object that contains a dataframe for our excel file. If you’re not familiar with python, a dataframe is essentially Python’s version of a SQL view or Excel table that can store and manipulate tabular data. Once we get our two tables into dataframes, we can start to pull out values as inputs for our earlier equations.

Don’t worry, at the end of this blog is the entire final script.

cred_mapping = {} #this will help us build a dataframe from our table
for entry, data_boundary in cred_ws.tables.items():
    #parse the data within the ref boundary
    data = cred_ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] #ent makes our code cleaner by handling nested data sources
               for ent in data
          ]
    header = content[0]
    #the contents excluding the header (aka column names)
    rest = content[1:]
    #create dataframe with the column names
    #and pair table name with dataframe
    cred_df = pandas.DataFrame(rest, columns = header)
    cred_mapping[entry] = cred_df
print(cred_df)
server = cred_df._get_value(0,"SQL_Server")
db = cred_df._get_value(0,"SQL_Database")
print(server)
print(db)

Sweet, now we can get our SQL_Server and SQL_Database values directly from the spreadsheet instead of hard coding it in Python! To code out the rest of the variables, simply copy then replace “server” and “SQL_Server” in the code above with the variable name and column name that you want to return.

Alrighty, now we need to pull in our data quality spreadsheet and set up some for loops that can help iterate through the list of Power BI and SQL queries and return some outputs to our Excel sheet.

Note: Your SQL query will need to have every column named or else you will get an error when you run the cursor over it. Also, for this section we will need to import a couple more libraries. If you don’t already have win32.com installed, you’ll need to run pip install pypiwin32 in your command prompt.

This script is pretty awesome to see work because we will finally get some values back into our Excel file! The first section will use the win32com library to essentially open excel in edit mode instead of read only like we have been doing. After we open the worksheet to edit it, we will grab our SQL queries then iterate through them to get the results from the query and put it back into Excel. If that doesn’t get the nerd in you excited, I’m not sure what will!

from win32com.client import Dispatch; import os
#Using the win32com library, we can open excel in edit mode
xl = Dispatch("Excel.Application") #opens excel for us to edit
xl.Visible = True
edit_wb = xl.Workbooks.Open(excel_file)
edit_qc_sh = edit_wb.Worksheets("Quality Check")
#Running SQL query and putting result back into Excel
sql_df = quality_df.dropna() #removing any blank rows so we don't run blank sql queries lol
sql_queries = sql_df.loc[:,"SQL_Query"] 
#the first argument (:) signifies which rows we want to index (: indicates all columns), and the second argument lets us index the column we want
#print(sql_queries) #this returns the rows in our sql query column
rownumber = 1
for query in sql_queries:
    sql_cursor.execute(str(query))
    #print (sql_cursor.fetchall()[0])
    for row in sql_cursor:
        #print(row) #this is our dictionary created from the sql query result
        key = list(row.items())[0]
        rownumber += 1 #puts our sql responses in the right rows
        sql_result = key[1]
        quality_df.at[rownumber,"SQL_Result"] = sql_result #this will put our results into the proper cell in our dataframe
        edit_qc_sh.Range(f"D{rownumber}").Value = sql_result #this will put our results in the right excel cell
        #print(key[1]) #returns just our result values
        
edit_wb.Save() #saving our values back to excel

Okay, time to do the same thing with our DAX queries and wrap this project up. You may have noticed that the DAX queries in the sample data is not the exact same you would use in your measure. What we are doing is telling DAX how we want the data back, in this case we want it to return one row for us (hence, EVALUATE ROW at the beginning) based on measures I have in a table called “_Measures”. If you aren’t sure how to make the DAX query work for this API, feel free to put the measure you want to test in a card visual then use the performance analyzer to get the query. A good way to see if your DAX query will work is to test it out in DAX Studio.

Python time! We are going to tweak our code from earlier by renaming our variable “header” to “pbiheader” so we can reference it more clearly in our JSON request code. Again, feel free to skip to the end of this post for the final code if you have any questions (or need this to just work and fast). First, we will need all the pieces to execute our post script. Once we have the URL, the DAX query, and the json body, we can iterate through the list of queries from our Excel file.

#Running queries and putting results back into Excel
qdf = quality_df.dropna() #removing any blank rows so we don't run blank sql queries lol
sqlrownumber = 1
pbirownumber = -1
#Running SQL Queries
sql_queries = qdf.loc[:,"SQL_Query"] 
for query in sql_queries:
    sql_cursor.execute(str(query))
    #print (sql_cursor.fetchall()[0])
    for row in sql_cursor:
        #print(row) #this is our dictionary created from the sql query result
        key = list(row.items())[0] #gets the first item from the row dictionary
        sqlrownumber += 1 #puts our sql responses in the right rows
        sql_result = key[1] #grabs just the result of our query
        quality_df.at[sqlrownumber,"SQL_Result"] = sql_result #this will put our results into the proper cell in our dataframe
        edit_qc_sh.Range(f"D{sqlrownumber}").Value = sql_result #this will put our results in the right excel cell
        #print(key[1]) #returns just our result values
        
#Running PBI Queries
pbi_queries = quality_df.loc[:,"PBI_DAX_Query"]
execute_queries_url = f'https://api.powerbi.com/v1.0/myorg/datasets/{pbidatasetid}/executeQueries'
for items in pbi_queries:
    pbirownumber += 1
    pbiexcelrownumber = pbirownumber+2
    list_pbiquery = list(pbi_queries.items())[pbirownumber]
    #print(list_pbiquery)
    item_pbiquery = list_pbiquery[1]
    dax_query_json_body = {
                "queries":
                        [
                            {
                            "query": item_pbiquery
                            }
                        ]  
                }
    pbi_response = requests.post(url=execute_queries_url, headers=pbiheader,json=dax_query_json_body)
    query_json = pbi_response.text.encode().decode('utf-8-sig') #allows us to read the json response
    pbi_values = query_json.replace("}]}]}]}","") #drops the trailing encoding from json
    #print(query_json)
    pbi_result = pbi_values.split(":")[4] #grabs just the result of our query
    print(pbi_result) #grabs just the result of our query
    quality_df.at[pbiexcelrownumber,"PBI_Result"] = pbi_result #this will put our results into the proper cell in our dataframe
    edit_qc_sh.Range(f"C{pbiexcelrownumber}").Value = pbi_result #this will put our results in the right excel cell
edit_wb.Save() #saving our values back to excel

Pretty great! Now we have a functioning data quality testing tool where Excel holds all our inputs and outputs and all we need to do in Python is hit Run.

Final Query

'''---------------README------------------
Full instructions available: https://dataonwheels.wordpress.com/?s=Power+BI%3A+Data+Quality+Checks+Using+Python+%26+SQL
The goal of this script is to compare DAX measures to corresponding SQL queries. 
To accomplish this, it uses an excel file containing authentication variables and query values.
This idea came from seeing the Execute Queries REST API in Power BI and wanting to use python to perform our data quality checks efficiently.
To connect to a Power BI Data Model, we need to pass an authentication token generated through an Azure AD App. 
To run a DAX query and SQL query, we need to loop through our excel file then put the queries into the API and SQL calls respectively.
Finally, this script takes those query results and puts it back into the excel file.
I have left some print() lines commented out, feel free to uncomment them to troubleshoot or walk through the code step-by-step.
You will need to swap out the excel_file variable with your excel file path. Other than that, the rest of the variables are managed inside your excel file. 
Once you have your variables and queries in excel, hit run and you're good to go. 
'''
#---------------------------------------#
#       Import libraries needed  
#---------------------------------------#
import requests
import adal
import json
import pymssql 
import pandas
import openpyxl
from win32com.client import Dispatch; import os
#---------------------------------------#
#    Build out authentication steps  
#---------------------------------------#
#----- Authentication Variables from Excel -----#
#Grab authentication data from our excel file
from openpyxl import load_workbook
#we need the r at the beginning since "" is a special character in python
excel_file = r"C:UsersKristynaHughesDownloadsPythonDataQualityChecker.xlsx"
wb = load_workbook(excel_file) 
sheets = wb.sheetnames
#print(sheets) #see the sheets in our workbook
cred_ws = wb['Credentials'] #access specific sheet called Credentials
cred_mapping = {} #this will help us build a dataframe from our table
for entry, data_boundary in cred_ws.tables.items():
    #parse the data within the ref boundary
    data = cred_ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] #ent makes our code cleaner by handling nested data sources
               for ent in data
          ]
    header = content[0] #the contents excluding the header (aka column names)
    rest = content[1:] #create dataframe with the column names
    cred_df = pandas.DataFrame(rest, columns = header) #this is the dataframe we will use to get our creds
    cred_mapping[entry] = cred_df
#Use the dataframe to set up credential variables we can use later in the script
sqlserver = str(cred_df._get_value(0,"SQL_Server"))
sqldb = str(cred_df._get_value(0,"SQL_Database"))
sqluser = str(cred_df._get_value(0,"SQL_User"))
sqlpassword = str(cred_df._get_value(0,"SQL_Password"))
pbiclientid = str(cred_df._get_value(0,"PBI_ClientID"))
pbiusername = str(cred_df._get_value(0,"PBI_Username"))
pbipassword = str(cred_df._get_value(0,"PBI_Password"))
pbidatasetid = str(cred_df._get_value(0,"PBI_DatasetID"))
#check to make sure your variables are correct by uncommenting the next line
#print(sqlserver,sqldb,sqluser,sqlpassword,pbiclientid,pbiusername,pbipassword,pbidatsetid)
#----- Power BI REST API Authentication -----#
authority_url = 'https://login.windows.net/common'
resource_url = 'https://analysis.windows.net/powerbi/api'
context = adal.AuthenticationContext(authority=authority_url,
                                     validate_authority=True,
                                     api_version=None)
token = context.acquire_token_with_username_password(resource=resource_url,                                                     
                                                     username=pbiusername,
                                                     password=pbipassword,
                                                     client_id=pbiclientid)
access_token = token.get('accessToken')
pbiheader = {'Authorization': f'Bearer {access_token}'}
#try out the get workspaces REST API using our access token by uncommenting the next few lines
#get_workspaces_url = 'https://api.powerbi.com/v1.0/myorg/groups'
#r = requests.get(url=get_workspaces_url, headers=header)
#r.text will give us the response text for our get request, pretty neat!
#print(r.text)
#----- SQL Server Authentication -----#
try:
    sql_con = pymssql.connect(sqlserver,sqluser,sqlpassword,sqldb)
    sql_cursor = sql_con.cursor(as_dict=True)
except Exception as e:
    raise Exception(e)
#---------------------------------------#
#  Build out data quality check steps 
#---------------------------------------#
#----- Read excel to get quality check queries into a dataframe -----#
quality_ws = wb['Quality Check']
quality_mapping = {} #this will help us build a dataframe from our table
for entry, data_boundary in quality_ws.tables.items(): #grabs data dynamically from our table
    data = quality_ws[data_boundary] #parse the data within the ref boundary
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] #ent makes our code cleaner by handling nested data sources
               for ent in data
          ]
    header = content[0] #the contents excluding the header (aka column names)
    rest = content[1:] #create dataframe with the column names
    qualitydf = pandas.DataFrame(rest, columns = header) #this is the dataframe we will use to get our quality check queries
    quality_df = qualitydf.fillna(' ') #helps remove blank records from our excel file
    quality_mapping[entry] = quality_df
#print(quality_df)
#----- Open excel file in edit mode -----#
xl = Dispatch("Excel.Application") #opens excel for us to edit
xl.Visible = True
edit_wb = xl.Workbooks.Open(excel_file)
edit_qc_sh = edit_wb.Worksheets("Quality Check")
#----- Set variables to use in our iterators -----#
qdf = quality_df.dropna() #removing any blank rows so we don't run blank sql queries lol
sqlrownumber = 1
pbirownumber = -1
#----- Run SQL queries and put results back into excel -----#
sql_queries = qdf.loc[:,"SQL_Query"] 
for query in sql_queries:
    sql_cursor.execute(str(query))
    #print (sql_cursor.fetchall()[0])
    for row in sql_cursor:
        #print(row) #this is our dictionary created from the sql query result
        key = list(row.items())[0] #gets the first item from the row dictionary
        sqlrownumber += 1 #puts our sql responses in the right rows
        sql_result = key[1] #grabs just the result of our query
        quality_df.at[sqlrownumber,"SQL_Result"] = sql_result #this will put our results into the proper cell in our dataframe
        edit_qc_sh.Range(f"D{sqlrownumber}").Value = sql_result #this will put our results in the right excel cell
        #print(key[1]) #returns just our result values
#----- Run PBI DAX queries and put results back into excel -----#
pbi_queries = quality_df.loc[:,"PBI_DAX_Query"]
#print(pbi_queries)
execute_queries_url = f'https://api.powerbi.com/v1.0/myorg/datasets/{pbidatasetid}/executeQueries'
for items in pbi_queries:
    pbirownumber += 1
    pbiexcelrownumber = pbirownumber+2
    list_pbiquery = list(pbi_queries.items())[pbirownumber]
    #print(list_pbiquery)
    item_pbiquery = list_pbiquery[1]
    dax_query_json_body = {
                "queries":
                        [
                            {
                            "query": item_pbiquery
                            }
                        ]  
                }
    pbi_response = requests.post(url=execute_queries_url, headers=pbiheader,json=dax_query_json_body)
    query_json = pbi_response.text.encode().decode('utf-8-sig') #allows us to read the json response
    pbi_values = query_json.replace("}]}]}]}","") #drops the trailing encoding from json
    #print(query_json)
    pbi_result = pbi_values.split(":")[4] #grabs just the result of our query
    #print(pbi_result) #grabs just the result of our query
    quality_df.at[pbiexcelrownumber,"PBI_Result"] = pbi_result #this will put our results into the proper cell in our dataframe
    edit_qc_sh.Range(f"C{pbiexcelrownumber}").Value = pbi_result #this will put our results in the right excel cell
#----- Save our changes back to the excel file -----#
edit_wb.Save() #saving our values back to excel
print("All done")

Additional Resources:

This was my first experience with Python, and below are all the links that helped me build out this demo and learn more about how Python operates. Hope you find these as useful as I did!

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating