Statistical Analysis

Question

Statistical Analysis

Steve Jones - SSC Editor

SSC Guru

Points: 734418
More actions
May 28, 2015 at 10:38 pm

#304294

Comments posted to this topic are about the item Statistical Analysis

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61136 More actions · Answer 1

I think this is one of the areas where our profession will grow more and more across the next decade. As we deal with lots of data of varying types, and our organizations look to gain some strategic advantage through deeper insight into their information, we will have lots of chances to experiment and learn more about complex data analysis.

First, I wanted to reply not so much because I have very much to contribute but I'm extremely interested in this thread - I can't wait to see where this goes. Second, if you have not read the SSC article - Big Data for

SQL Folks[/url] do it. (I'm guessing you have, it's more of a plug)

I loved stats in College but my Degree was in business and I have not really used it since I graduated in 2002. My company (Slalom[/url]) is putting a lot of pressure on their BI Consultants (which I am one) to learn more about Big Data, Data Science, Hadoop, AWS, NoSQL, etc... One of my professional goals for the year was to get up-to-speed on the AWS family of big data/cloud/DB as a Service products. I just wrapped up a project with Google and can say that they are going after AWS. I'm currently reading a Hadoop book and trying to wrap my head around PIG, Hive and HiveQL. I'm also reading Statistics for Dummies (I got it for like $5 at Half Priced Books in Chicago).

I have nothing much to report and the advanced functions front except to say that I dig text mining and analysis. I spend 10+ hours a week screwing around with that. Check out the N-Gram Wikipedia page - I think it's absolutely fascinating stuff. It's commonly used in protein sequencing and DNA sequencing. SQL Server 2008 R2+ includes an nGrams function in their collection of CLRs that are used by MDS and DQS. I have been jerking around with ngrams for a year+ now (note the code at the end of my comment).

I was having a great discussion today with one of our Data Scientist/Predictive Analytics people today and I hear that creating useful R libraries is the big thing in the R/Data Science community.

IF OBJECT_ID('dbo.nGrams8K') IS NOT NULL DROP FUNCTION dbo.nGrams8K

GO

CREATE FUNCTION dbo.nGrams8K (@string varchar(8000), @k int)

/********************************************************************

Created by: Alan Burstein

Created on: 3/10/2014

Last Updated on: 5/22/2015

n-gram defined:

In the fields of computational linguistics and probability,

an n-gram is a contiguous sequence of n items from a given

sequence of text or speech. The items can be phonemes, syllables,

letters, words or base pairs according to the application.

For more information see: http://en.wikipedia.org/wiki/N-gram

Use:

Outputs a stream of tokens based on an input string.

Similar to mdq.nGrams:

http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.

Except it only returns characters as long as K.

nGrams8K also includes the position of the "Gram" in the string.

********************************************************************/

RETURNS TABLE WITH SCHEMABINDING AS RETURN

WITH

E1(N) AS (SELECT 1 FROM (VALUES (null),(null),(null),(null),(null)) x(n)),

E3(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b CROSS JOIN E1 c),

iTally(N) AS

(

SELECT TOP (LEN(@string)-(@k-1)) CHECKSUM(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))

FROM E3 a CROSS JOIN E3 b

)

SELECT

position = N,

token = SUBSTRING(@string,N,@k)

FROM iTally;

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

shoestringdba SSCertifiable Points: 6226 More actions · Answer 2

Can't say that I've used them directly beyond the aforementioned SUM and AVG, etc., as I prefer to create a data set for the people who want to do analysis and let them work the formulas in Excel or whatever.

However I did recently engage two of my colleagues to evaluate a solution I developed about a year and a half ago to track certain labor data. They're both very well-versed in process improvement and statisitical analysis and I understand enough about both to be dangerous.

It's exciting and a little scary at the same time (this project is my baby) but I'm looking forward to the experience. Since the database it's built on is in Full Recovery with hourly t-log backups, I've toyed with the idea of "going back in time" and rolling it forward (in a non-production environment, of course) by intervals to simulate changes over time (hmm...interesting idea for a script...). I also took an on-line statistics course last year that had an optional module on R which I took but never really had a reason to use until now.

It'll be fun.

____________
Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.

Eric M Russell SSC Guru Points: 125519 More actions · Answer 3

Not just for data scientists, basic statistical analysis can also be incorporated into a database at the operational level. For a dashboard application that monitors ETL processes, I leverage the STDEV() function to automatically identify runs containing unusually high or low record volumes, records / minute duration, and various data quality issues like miscodes. Runs that deviate from the historical trend are flagged and held back in staging, and sending the data analysts an email alert. At that point they manually review the dataset before choosing to either reject or release the run to the production data warehouse.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

simon.barnes Old Hand Points: 370 More actions · Answer 4

Did a little bit earlier this week. We'd encountered an issue in which certain orders got stuck in a "paying" state. This state applies when a customer has opened a payment/card details modal, and doesn't change until a payment success initiates a call to confirm the order. The issue that we'd has involved a load issue that was preventing a certain number of these calls from occurring and so I was asked to write a check to identify future instances of this happening.

However simply looking at the existence of this status wouldn't cut it because a number of customers will inevitably change their mind at checkout and simply close the browser window or click back -- leaving their order in the same status. This check was inevitably to be plugged into monitoring software = callouts, so I wanted to keep false positives to a minimum. To compound this we have a number of clients using different instances of the same software; with differing "busy" and "quiet" times (read: different countries), differing order volumes, and different customer behaviours.

What I wrote is pretty simple, but gave me a nice refresher in some basic statistical concepts. Initially I went back n minutes (I was working with between 15 and 30 for testing) in time to look at the order statuses in the last n minutes. Ratio is easy, just (paying / (paying + confirmed)). However in order to retrieve a gauge of the expected percentage for any given client over n minutes at that time of day, I also went back d days (tested between 30 and 90) to retrieve the ratio for the same time period on each day.

I'm often writing similar things, and normally stick an AVG on it and rolling my own attempt to minimise false positives. But this time I started looking into whether STDEV was suitable for this. I played around with STDEV and STDEVP and got some fairly decent results (and luckily had the data from the aforementioned incident to test on too), but I could still see a jarring number of false positives from the past 90 days.

I have the immense fortune of sitting next to a PhD mathematician and so explained what I was doing and asked if I was using standard deviations correctly (I remembered the words from school, but had to read the Wikipedia article to remind myself what it was). He agreed that I was but indicated that I might want to compare the mean to the median in order to see if this data was distributed normally.

I then discovered that there's no median function built into SQL Server (2008 R2, at least). Simple enough to write. I then heard the words "interquartile range" (75th percentile - 25th percentile) for the first time since before my GCSEs (the mandatory exams we sit in the UK at around 16). As I was now anticipating also needing the 25th and 75th percentiles I threw NTILE(4) into the mix and did it that way. Sure enough, the median and mean were far enough apart for the IQR method to win me over, once I'd adjusted the multiplier to my liking.

The final thing I wanted to do was identify a "buffer time" for any given client. The data we have available allows us to retrieve a rough estimate of the average time between a person beginning an order and it being confirmed (i.e. how long it takes them to pay), so I incorporated that and shifted n by that many minutes (unfortunately this meant having two statements in the check procedure; which I'd managed to avoid through prudent use of CTEs). I left multipliers, thresholds, overrides etc. to my client process so that those can be adjusted on-the-fly should false positives crop up again, but parametrised n, d and the buffer via the stored proc.

So the quick request (came in on Tuesday, implemented Wednesday) turned into a bigger job than expected, but I've now:

* had a refresher course in standard deviation and interquartile ranges

* learnt what normal (Gaussian) distribution is (if I ever knew beforehand, I'd forgotten)

* talked to a skilled mathematician about topics that interest us both

* exceeded my usual quota of 5 hours per week complaining about all the window functions and framing abilities in 2012 and 2014 (and 2016 now, I guess) that I don't get to play with

* learnt about the PERCENTILE_DISC, PERCENTILE_CONT, CUME_DIST and PERCENT_RANK functions in 2012+

And through further reading about the PERCENTILE_* functions I was able to not feel so bad about not being able to use them, because a) I got to use window functions instead, and I loves me some window functions; and b) Aaron Bertrand's observation that even a handwritten median in 2000 (!) outperformed PERCENTILE_CONT in his testing.

Gary Varga SSC Guru Points: 82166 More actions · Answer 5

Knowing that my more advanced maths was either lacking or nonexistent I decided to do the edX.com. Unfortunately, Calculus 1A: Differentiation course syllabus suggests that that they are full I need them and the test questions did too. x

Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!

Alan Burstein SSC Guru Points: 61136 More actions · Answer 6

simon.barnes (5/29/2015)
I then discovered that there's no median function built into SQL Server (2008 R2, at least). Simple enough to write. I then heard the words "interquartile range" (75th percentile - 25th percentile) for the first time since before my GCSEs (the mandatory exams we sit in the UK at around 16). As I was now anticipating also needing the 25th and 75th percentiles I threw NTILE(4) into the mix and did it that way. Sure enough, the median and mean were far enough apart for the IQR method to win me over, once I'd adjusted the multiplier to my liking.

Take a look at this link: What is the fastest way to calculate the median? Note the comments and links posted by Dwain Camps. Interesting stuff.

Itzik Ben-Gan covers median and Mode in his SQL 2012 Windows Functions book. There's a great section on Inverse Distribution Functions too. A lot of discussion on Percentile_.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

DavidL Hall of Fame Points: 3582 More actions · Answer 7

I use stats and such intensively, though not so much in SQL Server. We're a manufacturing company and I've designed and built from the bottom up custom SPC (statistical process control) applications for all aspects -- raw materials to packaged product. We also have a sensory lab, and I designed and built the app for experimental design, flights, data collection and analysis. Some of this was done using math libraries imported into python, but others we just rolled our own. I originally used some heavy artillery type programs like SPSS and SAS (later JMP, which for our purposes was more than enough), but found that rolling my own was 1) easier to give the end user what they needed and 2) way more fun.

We also use a data viz product where it is a bit easier to analyze data, but as with all the other projects it works best when done in close collaboration with the database. By that I mean being able to craft a custom, tight dataset makes the final analysis MUCH easier. I kind of grew these skills side by side, so I've come to appreciate both analysis and db skills even more. This board is mostly sql pros, but if you can learn that you can learn stats.

Joe Johnson-482549 SSC Eights! Points: 916 More actions · Answer 8

I was a math major in university but never really got an opportunity to use more than simple business math in programming for several years. However, I was working in manufacturing and we had an issue with volume of work required to maintain the MRP system. Planners were working to answer planning messages rather than actually talking to vendors. We needed a better solution.

I worked a couple hours a day developing the custom algorithms before I ran into issues and had to dug up my undergrad statistics book and my grad course probability book. I realized that I had re-invented the calculation for trend factor (such a common sense name that I had even named it the same).

That got me to thinking that I was re-inventing the entire stone wheel. I started looking around for libraries and found SmartForecasts (this was almost 20 years ago) as a complete package and integrated it into my code. This not only gave me excellent forecasting capabilities, but also error analysis on how good my forecast was.

We started using the new package and never looked back at the old MRP system.

I can honestly say that working on the custom algorithms was some of the most intensive use of my capabilities, not only from a mathematical standpoint, but also design and analysis as this code had to be highly optimized to run through the volume of data. However, it was also some of the most wasted time as I realized that there were good uses of my time and bad -- there are times when I should pay for an expert in a certain areas. That was a hard lesson to learn.

DavidL Hall of Fame Points: 3582 More actions · Answer 9

Another thought on this: Thinking Fast and Slow, written by Daniel Kahneman (who received the Nobel prize for economics for more or less 'inventing' the field of behavioral economics) is a fascinating overview of how we think, and how we often think badly. It is tangential to this topic, but many of his experiments rely on very simple statistical principles that we assume we have an intuitive understanding of, yet consistently mis-apply, and for this reason is a great intro into statistics. I can't recommend it too highly, as it is highly readable and at the same time very dense with ideas and implications.

David.Poole SSC Guru Points: 75896 More actions · Answer 10

I used them before I understood them. The problem is that if your audience don't understand maths and/or don't trust your figures they go with gut feel.

A million to one shot comes off nine times out of ten!

LinkedIn Profile

Wayne West SSC-Insane Points: 22586 More actions · Answer 11

Gary Varga (5/29/2015)
Knowing that my more advanced maths was either lacking or nonexistent I decided to do the edX.com. Unfortunately, Calculus 1A: Differentiation course syllabus suggests that that they are full I need them and the test questions did too. x

Check Khan Academy, he has several courses in statistics, plus calculus and higher. It might keep you occupied until you can get one in EdX.

Myself, I've never used SQL statistically for more than the basic functions. That may change once we accumulate a few years of data in my new system. I'm making a tracking system (of sorts) for blind/visually impaired students, I really wish we had zip code of birth, it'd be interesting to see if there are any identifiable clusters by eye condition. We know there are clusters near mines (lots of mining, past and present, in New Mexico), I'd like to see if we could identify more. But that specifically is geobase stuff, an overlapping but different field.

-----
[font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

simon.barnes Old Hand Points: 370 More actions · Answer 12

Alan.B (5/29/2015)
Take a look at this link: What is the fastest way to calculate the median? Note the comments and links posted by Dwain Camps. Interesting stuff.

That's actually the same link I mentioned at the end of my post 🙂 and another reminder that just because it's a built-in function that went through Microsoft's QA doesn't mean it's necessarily the most efficient ("YMMV" caveats notwithstanding).

I have actually been working through one of Ben-Gan's books; I don't have it to hand but believe it's the 2012 T-SQL querying book that complements the foundation of most of the current MS DB qualifications. I'm competent but stuck on 2008R2 so wanted a full run through of 2012's new querying capabilities and to check off anything that I hadn't come across before. I've been slogging through the XML chapters for completeness's sake (ugh) but haven't seen mention of the PERCENTILE_ functions yet so might pick up the functions book as I imagine it would be more in-depth.

I also just noticed that Dwain put up a post specifically on percentiles a couple of weeks ago which I missed while I was away; it didn't come up through my Googling earlier in the week but it's on the SSC front page at the moment. Wish I had more of a head for maths!

Graham.DBA SSC Enthusiast Points: 136 More actions · Answer 13

I have been a DBA for many years. I liked the stats unit I did at Uni but that was mainly SSPS. I linked SQL Server to R in 2012 as an exercise. I implemented this Forecasting with SQL[/url] system once. To give forecast reports on database growth.

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 14

DavidL (5/29/2015)
Another thought on this: Thinking Fast and Slow, written by Daniel Kahneman (who received the Nobel prize for economics for more or less 'inventing' the field of behavioral economics) is a fascinating overview of how we think, and how we often think badly. It is tangential to this topic, but many of his experiments rely on very simple statistical principles that we assume we have an intuitive understanding of, yet consistently mis-apply, and for this reason is a great intro into statistics. I can't recommend it too highly, as it is highly readable and at the same time very dense with ideas and implications.

The book looks interesting. I'll give it a try.