Combining R and SQL Server to diagnose performance bottlenecks

  • Hi Grant, thank you for your eagle eyes! Well spotted mistakes in all of these cases: missing RowCount, missing closing quote in read.csv(), and requiring the GGPLOT2 library.

    Many thanks!

    Nick

  • Hi Barnaby, thank you for putting it all together in one script. I think you have pretty much answered everyone's questions in one great post! 🙂 Do you think it is worth adding a line to explicitly close the ODBC connections?

    Cheers,

    Nick

  • nick.dale.burns (12/7/2015)


    Hi Barnaby, thank you for putting it all together in one script. I think you have pretty much answered everyone's questions in one great post! 🙂 Do you think it is worth adding a line to explicitly close the ODBC connections?

    Cheers,

    Nick

    No worries. It is my first time using R so wanted to create the script for myself. I haven't worked out how to close the connection yet 😀 but yeah, probably a good idea!

  • Your first time? Wow - you are flying! This should close your connection:

    odbcClose(dbhandle)

    Regards,

    Nick

  • Great Post. R is an oiutstanding language and tool. I changed the query as shown below:

    NOTE: note the line comment about hidden internal tables which may occasionally provide some useful information.

    select

    db_name() as database_name

    , object_name(ops.object_id) as [Table Name]

    , idx.nameas [Index name]

    , idx.type_descas description

    , sysidx.rowcntas row_count

    , sum(ops.range_scan_count) as [Range Scans]

    , sum(ops.singleton_lookup_count) as [Singleton Lookups]

    , sum(ops.row_lock_count) as [Row Locks]

    , sum(ops.row_lock_wait_in_ms) as [Row Lock Waits (ms)]

    , sum(ops.page_lock_count) as [Page Locks]

    , sum(ops.page_lock_wait_in_ms) as [Page Lock Waits (ms)]

    , sum(ops.page_io_latch_wait_in_ms) as [Page IO Latch Wait (ms)]

    from

    sys.dm_db_index_operational_stats(null,null,NULL,NULL) as ops

    inner join

    sys.indexes as idx

    on idx.object_id = ops.object_id

    and

    idx.index_id = ops.index_id

    inner join

    sys.sysindexes as sysidx

    on idx.object_id = sysidx.id

    where

    ops.object_id > 100

    and

    sysidx.indid not in (1, 2) -----get rid of hidden system tables

    group by ops.object_id, idx.name, idx.type_desc, sysidx.rowcnt

    order by [RowCnt] desc

    Also, in my query I am not getting decimal numbers < 1 as you have gotten in your first chart. Am I doing something wrong?


    Bill Bergen

  • Hi Bill, nice changes adding in additional identifying data such as database name and index name. The exact values you get will obviously vary from one database to another, but I also suspect that I might have normalised this data (scaled it so that all the measures were directly comparable). I would have to double check that.

    But ultimately, this is an attempt to guide further investigation towards critical bottlenecks and exclude non-critical tables. It is perhaps best described as an exploratory method and certainly not fixed in stone.

    Regards,

    Nick

  • We've just started charting performance data with ggplot in R using the RODBC library to pull the data directly into RStudio from SQL Server. Of course you can chart in Excel but it's soooo much more laborious.

    Plotting time-series of performance data is proving useful, but finding problems using machine learning techniques is a fantastic idea. Thank you!

  • Hi Matthew - thank you for your response, glad you like the article. Love your idea of pulling out time-series metrics for performance! What a great opportunity to use R for real-time anomaly detection!!!

    Cheers,

    Nick

  • matthew.x.greer (12/7/2015)


    We've just started charting performance data with ggplot in R using the RODBC library to pull the data directly into RStudio from SQL Server. Of course you can chart in Excel but it's soooo much more laborious.

    Plotting time-series of performance data is proving useful, but finding problems using machine learning techniques is a fantastic idea. Thank you!

    If you end up having any examples of your machine learning techniques, I'm sure they would be welcome by the community. That's really what monitoring software should be doing - learning from baselines and real-world scenarios.

  • if you don't have the ggplot2 library installed, you can install it in RStudio like this:

    > install.packages("ggplot2") (source: http://ggplot2.org/)

    I believe the intended code execution for

    # load GGPLOT library(ggplot2) plot.data <- data.frame(pca$x[, 1:2]) g <- ggplot(plot.data, aes(x=PC1, y=PC2)) + geom_point(colour=alpha("steelblue", 0.5), size=3) + geom_text(label=1:102, colour="darkgrey", hjust=1.5) + theme_bw() print(g)

    is

    > # load GGPLOT

    > library(ggplot2)

    > plot.data <- data.frame(pca$x[, 1:2])

    > g <- ggplot(plot.data, aes(x=PC1, y=PC2)) + geom_point(colour=alpha("steelblue", 0.5), size=3) + geom_text(label=1:102, colour="darkgrey", hjust=1.5) + theme_bw()

    > print(g)

    if you get this error: Error in do.call("layer", list(mapping = mapping, data = data, stat = stat, :

    could not find function "alpha"

    then load the "scales" library: library(scales)

    Once, I did all that, I got this error:

    Error: Incompatible lengths for set aesthetics: colour, hjust, label

    which is where I find myself stuck

    :pinch:

    EDIT:

    Just saw the post made by barnaby.self. That works!

    :hehe:

  • danechambrs (12/7/2015)

    Once, I did all that, I got this error:

    Error: Incompatible lengths for set aesthetics: colour, hjust, label

    which is where I find myself stuck

    I hit those errors with the original code as well. I just removed the formatting and it worked out. I'd love to know how to fix it and keep the formatting.

    So all this is great, as I said, but now my question is, how the heck do I figure out which algorithm to use? I've got some ideas for how to put this to work in some different places, but I'm a little stuck.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • That is the hardest part me as well: figuring out which models to use. I can't really answer that question beyond some old fashioned trial and error. However, I did take a class on some basic data modeling which helped explain to me some of these concepts. Slides can be found here: http://www.cogsci.ucsd.edu/~desa/109/

    Some slides even contain pictures! Note that any code you find on this site is probably in Matlab.

    I hope this helps guide the googling. Aside from that, there are two books I'd recommend:

    Pattern Recognition and Machine Learning http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738

    This book is very math heavy (read: collegiate knowledge of linear algebra and statistics are required) but has pretty much all of the "technical" information about these topics.

    Data Mining: Practical Machine Learning Tools and Techniques http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0123748569

    This book takes a more practical approach and encourages you to try out different algorithms to see which one fits your needs.

    You can also watch Andrew Ng's videos on Machine Learning here: https://youtu.be/UzxYlbK2c7E?list=PLJ_CMbwA6bT-n1W0mgOlYwccZ-j6gBXqE

    I was once told by a graduate student at my university that "if you want to know about machine learning, just watch these videos and follow along in the bishops book. No other classes or lectures really compare to Andrew's video series". This was his opinion, but one that I personally do respect.

    Good luck!

  • Hi Grant,

    So glad to hear there are lots of other people out there as excited about the marriage of R with SQL Server as I am! The following will plot table names alongside the points:

    ## Assuming that the table names are stored in a variable 'tables':

    g <- ggplot(plot.data, aes(x=PC1, y=PC2, colour=clusters)) +

    geom_point() +

    geom_text(label=tables, colour="darkgrey", hjust=1.5) +

    theme_bw()

    print(g)

    There are some tricks around GGPLOT naming and scoping (for example using 'color' vs. 'colour' and whether to wrap things in aes()), but the above will hopefully work fine. I wasn't sure what you meant about "which algorithm to use"? Did you mean specifically whether to use PCA or KMeans? Or more generally, other methods? I have very roughly described both PCA and KMeans below (as they were used here anyway).

    Cheers,

    Nick

    Principal Components Analysis (PCA): rotates high-dimensional data onto lower dimensions, but while preserving the maximum amount of possible variation in the data. Often used for to visualise high dimensional data in two dimensions - which is really why we used it here.

    KMeans: groups "similar" observations together. For KMeans, "similar" simply means "objects that are near to one another".

  • Hi All, re. "incompatible lengths for label, colour and hjust":

    I just had a realisation; in the original post I had simply removed the table names for privacy. But, the original data had 102 tables - and I replaced their names with an index from 1 to 102. The length of the labels (1, 2, ..., 102) should be the same length as the number of tables you have in your data. If you use the table names themselves, this is easiest (and most informative!):

    So... ORGINAL:

    g <- ggplot(plot.data, aes(x=PC1, y=PC2, colour=clusters)) +

    geom_point(size=3) +

    geom_text(label=1:102, colour="darkgrey", hjust=1.5) +

    theme_bw()

    print(g)

    Becomes:

    ## where 'tables' is a list with your table names

    g <- ggplot(plot.data, aes(x=PC1, y=PC2, colour=clusters)) +

    geom_point(size=3) +

    geom_text(label=tables, colour="darkgrey", hjust=1.5) +

    theme_bw()

    print(g)

    Hope that is really clear! 🙂

    Nick

  • Hi Nick,

    Yeah, it's making that call on what calculation against the data is going to result in... anything.

    For example, if you have two columns of data, KMean would let you get an idea of how much that data... groups? Basically, how much there is a correlation between the two columns having a relationship. For example, Salesman and Territory has a high correlation, but Customer and SaleValue has a low correlation (most of the time, although that example could open a can of worms).

    Anyway, it's understanding how to apply the right algorithm that seems like the hardest part. Most of the rest seems to be reading documentation on how to apply methods.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 16 through 30 (of 34 total)

You must be logged in to reply to this topic. Login to reply