Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Similar strings. Expand / Collapse
Author
Message
Posted Tuesday, April 23, 2013 5:33 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Today @ 4:38 AM
Points: 248, Visits: 1,178
In the past week, I saw somewhere a string similarity function.
(I think I saw it in my 'spare' time when I am not focused on my work but still am reading about SQL-server to keep up. So did not note the link, the book or whatever.)

Depending on the longest string which was shared between two strings a value between 0 and 1 was given. This dependend on the length of the string(s) and the similarity. A tresshold was set to get a ?? good starting point??

One example were it was used was to compare chemical formula. And find the same formula but which was not written the same.

Does anybody recognise were I could have seen this?
Other (or same) techniques to find similar strings?

We want to use this technique to find questions which are similar but formulated differently. Does not have to be perfect, but is to be used a a help. (I know the Soundex and Difference functions).

Thanx in advance,
ben brugman
Post #1445366
Posted Wednesday, April 24, 2013 10:03 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Yesterday @ 8:13 AM
Points: 1,708, Visits: 4,854
There there are plenty of user defined functions posted on the web that will parse a string into a resultset of keywords, compare two strings, etc. They may provide the functionality you're looking for, but understand that parsing keywords from varchar or text based columns tends to perform very poorly, because it leverages looping constructs and is not indexable.

However, perhaps you were reading an article on SQL Server 2012's new Semantic Search service.

... provides deep insight into unstructured documents stored in SQL Server databases by extracting and indexing statistically relevant key phrases. Then it also uses these key phrases to identify and index documents that are similar or related ...

http://msdn.microsoft.com/en-us/library/gg492075.aspx
https://www.simple-talk.com/sql/database-administration/exploring-semantic-search-key-term-relevance/

If you don't have SQL Server 2012, then look into the Full-Text Search service, which is robust and has been around for decades in several past releases of SQL Server.
http://msdn.microsoft.com/en-us/library/ms142571%28v=sql.105%29.aspx
Post #1446093
Posted Wednesday, April 24, 2013 11:41 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 10:01 AM
Points: 10,282, Visits: 13,266
You could also be thinking about the SOUNDEX function.



Jack Corbett

Applications Developer

Don't let the good be the enemy of the best. -- Paul Fleming

Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
How to Post Performance Problems
Crosstabs and Pivots or How to turn rows into columns Part 1
Crosstabs and Pivots or How to turn rows into columns Part 2
Post #1446143
Posted Wednesday, April 24, 2013 4:17 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:22 PM
Points: 35,372, Visits: 31,924
Jack Corbett (4/24/2013)
You could also be thinking about the SOUNDEX function.


Agreed but that's a terrible function. It stops scanning at the first non-alpha character to name just one serious disadvantage.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1446253
Posted Wednesday, April 24, 2013 4:20 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:22 PM
Points: 35,372, Visits: 31,924
ben.brugman (4/23/2013)
In the past week, I saw somewhere a string similarity function.
(I think I saw it in my 'spare' time when I am not focused on my work but still am reading about SQL-server to keep up. So did not note the link, the book or whatever.)

Depending on the longest string which was shared between two strings a value between 0 and 1 was given. This dependend on the length of the string(s) and the similarity. A tresshold was set to get a ?? good starting point??

One example were it was used was to compare chemical formula. And find the same formula but which was not written the same.

Does anybody recognise were I could have seen this?
Other (or same) techniques to find similar strings?

We want to use this technique to find questions which are similar but formulated differently. Does not have to be perfect, but is to be used a a help. (I know the Soundex and Difference functions).

Thanx in advance,
ben brugman


Perhaps you brushed against the following?
http://en.wikipedia.org/wiki/Levenshtein_distance



--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1446254
Posted Wednesday, April 24, 2013 4:24 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 10:01 AM
Points: 10,282, Visits: 13,266
Jeff Moden (4/24/2013)
Jack Corbett (4/24/2013)
You could also be thinking about the SOUNDEX function.


Agreed but that's a terrible function. It stops scanning at the first non-alpha character to name just one serious disadvantage.


I didn't say it was any good, I just said it was out there.

I've tried it and discarded it as not having any use.




Jack Corbett

Applications Developer

Don't let the good be the enemy of the best. -- Paul Fleming

Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
How to Post Performance Problems
Crosstabs and Pivots or How to turn rows into columns Part 1
Crosstabs and Pivots or How to turn rows into columns Part 2
Post #1446255
Posted Thursday, April 25, 2013 2:47 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Today @ 4:38 AM
Points: 248, Visits: 1,178
Jeff Moden (4/24/2013)
Perhaps you brushed against the following?
http://en.wikipedia.org/wiki/Levenshtein_distance



Thanks for the link (losing a 'h'):
http://en.wikipedia.org/wiki/Levenstein_distance]

This was not what I saw, but 'solves' the same problem. So maybe this is even better than what I saw. (Have to try it out).

As for the Soundex and difference, which I mentioned in my first message, they are not usefull for my situation.

Thanks for the link,
ben brugman
Post #1446357
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse