• I think this is one of the areas where our profession will grow more and more across the next decade. As we deal with lots of data of varying types, and our organizations look to gain some strategic advantage through deeper insight into their information, we will have lots of chances to experiment and learn more about complex data analysis.

    First, I wanted to reply not so much because I have very much to contribute but I'm extremely interested in this thread - I can't wait to see where this goes. Second, if you have not read the SSC article - Big Data for

    SQL Folks[/url] do it. (I'm guessing you have, it's more of a plug)

    I loved stats in College but my Degree was in business and I have not really used it since I graduated in 2002. My company (Slalom[/url]) is putting a lot of pressure on their BI Consultants (which I am one) to learn more about Big Data, Data Science, Hadoop, AWS, NoSQL, etc... One of my professional goals for the year was to get up-to-speed on the AWS family of big data/cloud/DB as a Service products. I just wrapped up a project with Google and can say that they are going after AWS. I'm currently reading a Hadoop book and trying to wrap my head around PIG, Hive and HiveQL. I'm also reading Statistics for Dummies (I got it for like $5 at Half Priced Books in Chicago).

    I have nothing much to report and the advanced functions front except to say that I dig text mining and analysis. I spend 10+ hours a week screwing around with that. Check out the N-Gram Wikipedia page - I think it's absolutely fascinating stuff. It's commonly used in protein sequencing and DNA sequencing. SQL Server 2008 R2+ includes an nGrams function in their collection of CLRs that are used by MDS and DQS. I have been jerking around with ngrams for a year+ now (note the code at the end of my comment).

    I was having a great discussion today with one of our Data Scientist/Predictive Analytics people today and I hear that creating useful R libraries is the big thing in the R/Data Science community.

    IF OBJECT_ID('dbo.nGrams8K') IS NOT NULL DROP FUNCTION dbo.nGrams8K

    GO

    CREATE FUNCTION dbo.nGrams8K (@string varchar(8000), @k int)

    /********************************************************************

    Created by: Alan Burstein

    Created on: 3/10/2014

    Last Updated on: 5/22/2015

    n-gram defined:

    In the fields of computational linguistics and probability,

    an n-gram is a contiguous sequence of n items from a given

    sequence of text or speech. The items can be phonemes, syllables,

    letters, words or base pairs according to the application.

    For more information see: http://en.wikipedia.org/wiki/N-gram

    Use:

    Outputs a stream of tokens based on an input string.

    Similar to mdq.nGrams:

    http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.

    Except it only returns characters as long as K.

    nGrams8K also includes the position of the "Gram" in the string.

    ********************************************************************/

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    E1(N) AS (SELECT 1 FROM (VALUES (null),(null),(null),(null),(null)) x(n)),

    E3(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b CROSS JOIN E1 c),

    iTally(N) AS

    (

    SELECT TOP (LEN(@string)-(@k-1)) CHECKSUM(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))

    FROM E3 a CROSS JOIN E3 b

    )

    SELECT

    position = N,

    token = SUBSTRING(@string,N,@k)

    FROM iTally;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001