Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.
I’ve got a database of NBA statistics with data like this for players. I downloaded a CSV and loaded it into SQL Server.
I decided to play with the data a bit and at one point wanted to see who scored the most points for a team and year. So I ran this query:
year = ‘1972’
AND team = ‘LAL’
The result was 705. That’s a decent number of points, and if I weren’t careful, this might seem fine. 1972 was a long time ago, and they didn’t score as many points as they do today in games.
In fact, if I were putting this in a summary report with lots of data, it might be the case that someone glancing at this would make a poor decision based on the data.
Let’s look at the data.
Even a quick glance would let me know this seems funny. There are values of 1575 and 1084 in there, but the MAX() I returned was 705. If I look deeper at the import, I can see why.
Anything stand out there? If you look, pts is a varchar, not a numerical value. In the character world, 705 beats 1575. I really need this query:
Always be aware of the datatypes you work with and manipulate. Knowing a little bit about the meaning and use of the data can help you spot anomalies like this. As much as I like random test data, I’d also be sure you have some real data cases when you have users check your work. It’s easy for them to miss problems like this without good reference cases.
Or use good test data that you’ve setup and unit tests.
Filed under: Blog Tagged: SQLNewBlogger, syndicated, T-SQL