SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Word analysis with SQL


Word analysis with SQL

Author
Message
Sam Vanga
Sam Vanga
SSC-Enthusiastic
SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)

Group: General Forum Members
Points: 148 Visits: 502
I did a blog post earlier called PASS Summit Twitter Dashboard. It shows all tweets by people at the PASS summit.

I now want to expand that by measuring the sentiment score. Logic to measure the score is below:

there are a list of pre defined good and bad words with a score of 1 and -1 respectively.
for each tweet, remove punctuation marks from text.
compare words from each tweet with the predefined words list.
get score based on matched words.
sentiment score is the sum of scores from the above match.

The logic may be inaccurate, but from what i learnt this model is currently being researched and is acceptable.

Here is the sample DDL.

Predefined words and score:

CREATE TABLE #Words
(
Id int identity(1,1) primary key
, Word char(10)
, Score int) ;

INSERT #Words
(Word, Score)
VALUES ('Good',1)
, ('Awesome', 1)
, ('Super', 1)
, ('Bad', -1)
, ('Fail', -1)
, ('Dirty', -1) ;



Tweets:

CREATE TABLE #Text
(Id int identity(1,1) primary key
, [Text] varchar(140))

INSERT #Text
([Text])
VALUES
('New Bond movie is #awesome!')
, ('I hear dirty reviews. Product X is a fail. #fail')
, ('I am neutral!!!')



Result:

CREATE TABLE #Result
([Text] varchar(140), Score int)

INSERT #Result
VALUES
('New Bond movie is #awesome!',1)
, ('I hear dirty reviews. Product X is a fail. #fail',3)
, ('I am neutral!!!',0)

SELECT *
FROM #Result



For example, score for 'New Bond movie is #awesome!' is 1 because after removing punctuation mark (!) word awesome matches with a word in the Words table and score is 1.
Score for 'I hear dirty reviews. Product X is a fail. #fail' = 3 because of the words dirty, fail, and fail (after removing #).

Query should be able to perform with a huge data set, approximately 100K rows.

Many thanks for your time and input.

Sam.

Sam Vanga
http://SamuelVanga.com
mister.magoo
mister.magoo
SSCarpal Tunnel
SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)

Group: General Forum Members
Points: 4112 Visits: 7865
you might want to take a look at the proc dm_fts_parser

It might just provide the word analysis you are looking for...

Link : http://msdn.microsoft.com/en-us/library/cc280463(v=sql.100).aspx

MM


select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);




  • Forum Etiquette: How to post Reporting Services problems
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • How to Post Performance Problems - by Gail Shaw

  • dwain.c
    dwain.c
    SSCertifiable
    SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)

    Group: General Forum Members
    Points: 7247 Visits: 6431
    If Mr. Magoo's link doesn't give you any ideas, you might want to take a look at the 4th article in my signature links to retrieve the FUNCTION PatternSplitCM and then try it this way.


    SELECT a.ID, a.[Text], TotalScore=ISNULL(SUM(c.Score), 0)
    FROM #Text a
    CROSS APPLY PatternSplitCM([Text], '[A-Za-z]') b
    LEFT JOIN #Words c ON b.Item = c.Word
    GROUP BY a.ID, a.[Text]




    If you're working in a case sensitive database, you'll want to UPPER the Item and Word in the ON clause of the LEFT JOIN.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
    Since random numbers are too important to be left to chance, let's generate some!
    Learn to understand recursive CTEs by example.
    Splitting strings based on patterns can be fast!
    My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
    Sam Vanga
    Sam Vanga
    SSC-Enthusiastic
    SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)SSC-Enthusiastic (148 reputation)

    Group: General Forum Members
    Points: 148 Visits: 502
    Dwain - Thank you very much. It works like a charm!

    Sam Vanga
    http://SamuelVanga.com
    sqlbi.vvamsi
    sqlbi.vvamsi
    SSC-Enthusiastic
    SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)SSC-Enthusiastic (102 reputation)

    Group: General Forum Members
    Points: 102 Visits: 412
    @sam, can you test this one too, just curious, not sure if this helps performance wise..


    ;with cte1(part,leftover,textid) as
    (
    select left([text],CHARINDEX(' ',[text])),RIGHT([text],len([text])-CHARINDEX(' ',[text])), id from #Text
    union all
    select left(leftover,case when CHARINDEX(' ',leftover)=0 then LEN(leftover) else CHARINDEX(' ',leftover) end ),RIGHT(leftover,len(leftover)-case when CHARINDEX(' ',leftover)=0 then len(leftover) else CHARINDEX(' ',leftover) end ),textid from cte1 where leftover!=''
    ),
    cte2(trimpart,id) as
    (
    select cast(part as varchar(140)),textid from cte1
    union all
    select cast(replace(trimpart,substring(trimpart,patindex('%[^a-z]%',rtrim(ltrim(trimpart))),1),'') as varchar(140)),id from cte2 where patindex('%[^a-z]%',rtrim(ltrim(trimpart)))!=0
    )
    select t.id,t.[text],isnull(recurrence,0) from(select id,COUNT(id) as recurrence from(select id,trimpart from cte2 where patindex('%[^a-z]%',rtrim(ltrim(trimpart)))=0 and trimpart in (select [word] from #Words))temp
    group by id)temp1
    right join
    #Text t
    on t.id=temp1.id
    dwain.c
    dwain.c
    SSCertifiable
    SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)

    Group: General Forum Members
    Points: 7247 Visits: 6431
    sqlbi.vvamsi (1/23/2013)
    @sam, can you test this one too, just curious, not sure if this helps performance wise..


    It is a good suggestion to do a performance test. I like general tools like PatternSplitCM for the ease with which they allow you to solve a problem quickly. Most of the time a focused, working solution will outperform a general tool. But the general tool may be a good, bootstrap approach to get your code working.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
    Since random numbers are too important to be left to chance, let's generate some!
    Learn to understand recursive CTEs by example.
    Splitting strings based on patterns can be fast!
    My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
    Go


    Permissions

    You can't post new topics.
    You can't post topic replies.
    You can't post new polls.
    You can't post replies to polls.
    You can't edit your own topics.
    You can't delete your own topics.
    You can't edit other topics.
    You can't delete other topics.
    You can't edit your own posts.
    You can't edit other posts.
    You can't delete your own posts.
    You can't delete other posts.
    You can't post events.
    You can't edit your own events.
    You can't edit other events.
    You can't delete your own events.
    You can't delete other events.
    You can't send private messages.
    You can't send emails.
    You can read topics.
    You can't vote in polls.
    You can't upload attachments.
    You can download attachments.
    You can't post HTML code.
    You can't edit HTML code.
    You can't post IFCode.
    You can't post JavaScript.
    You can post emoticons.
    You can't post or upload images.

    Select a forum

































































































































































    SQLServerCentral


    Search