Where can I download a Thesaurus file?

  • So we're looking at implementing FTS for a few term type lookup things in one of ours apps.

    Of course I remembered that the Thesaurus that comes with SQL Server is essentially empty.

    Is there a Thesaurus that can be downloaded for US English so I won't have to sit there with Roget's doing data entry for a few weeks?



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Google this: Thesaurus Database Download and it will give you this: http://wordnet.princeton.edu/

    Jared
    CE - Microsoft

  • And when you add SQL Server Full Text onto that search you get an entirely useless set of links πŸ™‚

    Thanks... I take it the file is in SQL Server ready to use format?



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • I have no idea, I just googled it and found it. I guess you could have done that for yourself :hehe: Sorry I'm not much help.

    Jared
    CE - Microsoft

  • here you go Mark;

    dunno if this will help, i took a thesaurus file from the Gutenberg project, and bulk inserted it into a table, then built the table of synonyms.

    it's probably a bit too much: 30K words with 2.5 million synonyms.but it was one of those proof of concept items i slap together now and then:

    the file is here:

    --Edit: replaced 24 meg txt with zip:

    mthesaur.zip

    you'll have to change the path on the bulk insert to match your working folder, but the rest is paste and GO:

    IF OBJECT_ID('[dbo].[Thesaurus]') IS NOT NULL

    DROP TABLE [dbo].[Thesaurus]

    GO

    CREATE FUNCTION dbo.DelimitedSplit8K

    --===== Define I/O parameters

    (

    @pString VARCHAR(7999),

    @pDelimiter CHAR(1)

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table” produces values up to

    -- 10,000... enough to cover VARCHAR(8000)

    WITH

    E1(N) AS ( --=== Create Ten 1's

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 --10

    ),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000

    cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT N)) FROM E4)

    --===== Do the split

    SELECT ROW_NUMBER() OVER (ORDER BY N) AS ItemNumber,

    SUBSTRING(@pString, N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, N) - N) AS Item

    FROM cteTally

    WHERE N < LEN(@pString) + 2

    AND SUBSTRING(@pDelimiter + @pString, N, 1) = @pDelimiter

    ;

    GO

    GO

    create table [Thesaurus](

    definition varchar(max) )

    BULK INSERT [Thesaurus] FROM 'C:\_PA\Workspace\Workspace_Dictionary\Thesaurus\mthesaur.txt'

    WITH (

    DATAFILETYPE = 'char',

    FIELDTERMINATOR = ',',

    ROWTERMINATOR = '\n',

    FIRSTROW = 1

    )

    --pulls out the keyword for the thesaurus

    select LEFT([definition],CHARINDEX(',',[definition])-1)AsTheKeyWord,[definition] As TheDefinition from [Thesaurus]

    --gets the split for a given string

    select * from dbo.DelimitedSplit8K('A-bomb,H-bomb,atomic bomb,atomic warhead,clean bomb',',')

    --this is erroring out, not getting what i was expecting

    select top 500 myAlias.* ,[Thesaurus].*

    from (select row_number() over( order by definition) AS RW,definition from [Thesaurus])[Thesaurus]

    cross apply dbo.DelimitedSplit8K(Thesaurus.definition,',') myAlias

    where RW = 1

    --where idx=1

    --where myAlias.item = LEFT([definition],CHARINDEX(',',[definition])-1)

    CREATE TABLE TheThesaurus(

    ThesaurusID int identity(1,1) NOT NULL PRIMARY KEY,

    Word varchar(100),

    TheDefinition varchar(max) )

    CREATE TABLE TheSynonyms(

    SynonymsID int identity(1,1) NOT NULL PRIMARY KEY,

    ThesaurusID int REFERENCES TheThesaurus(ThesaurusID),

    Word varchar(100),

    TheDefinition varchar(max) )

    insert into TheThesaurus(Word,TheDefinition)

    select LEFT([definition],CHARINDEX(',',[definition])-1)As TheKeyWord,[definition] As TheDefinition from [Thesaurus]

    INSERT INTO TheSynonyms(ThesaurusID,Word,TheDefinition)

    select ThesaurusID,Word,Item

    from TheThesaurus

    cross apply dbo.DelimitedSplit8K(TheDefinition,',') myAlias

    where Word <> Item

    and rtrim(Item) <> ''

    select top 50000 * from TheSynonyms

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell, with that size thesaurus, did you see a significant performance hit on FTS queries? Or did you test it that far?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • didn't go that far as to play with Full Text.;

    it was purely from a Text-Dictionary to SQL Server Table project I fiddled with a few years ago; Full text wasn't part of the project at the time,w hich is what makes me wonder if it will help Mark, but it was omething I at least had handy in my scripts.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks.

    I'm about to implement a fairly large first-names thesaurus, and was curious if your experience covered performance hits. I'll be testing it, of course, but data from others with prior experience is always good.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • It's pretty cool.. but it only seems to load one row from the file.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • I think thats due to the ROWTERMINATOR parameter not being supplied to the BULK INSERT procedure. (probably because WYSIWYG not displaying the character \ n)

    Try something like:

    BULK INSERT [Thesaurus] FROM 'C:\mthesaur.txt'

    WITH (

    FIELDTERMINATOR = ',',

    ROWTERMINATOR = '\ n'

    )

    You'll need to remove the space between the '\' and the 'n' in the ROWTERMINATOR.

    Good post guys, thanks for lending a hand!

    -----------------
    ... Then again, I could be totally wrong! Check the answer.
    Check out posting guidelines here for faster more precise answers[/url].

    I believe in Codd
    ... and Thinknook is my Chamber of Understanding

  • Actually it was the commented section in the query that says "this errors out and isn't what I was expecting"

    I originally just changed the path to the file and ran it... and it errored out.

    Running it step by step worked... Now to see if TPTB really want to impliment a full english language thesaurus, like they originally stated... probably not πŸ™‚

    Which of course moves back to figuring out terms and determining synonymns within the business... fun fun... getting management's views regarding this sort of stuff is generally like pulling teeth.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Hi,

    I tried to download the mthesaurs at the link provided, but got error file not found or unavailable. Is it possible you could provide a link to me?

    lyleholc@gmail.com

    Thank you so much!!

    Lyle

  • lyleholc (8/17/2015)


    Hi,

    I tried to download the mthesaurs at the link provided, but got error file not found or unavailable. Is it possible you could provide a link to me?

    lyleholc@gmail.com

    Thank you so much!!

    Lyle

    I updated my old post with a corrected path, but repalced the old txt file with a zip, and also retested my example code, which still works perfectly, although a little slow for my taste.

    for a one time ETL, it's fine.

    http://www.stormrage.com/SQLStuff/mthesaur.zip

    ;

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • The file seems to come from a 1911 Roget's Thesaurus compiled by the Project Gutenberg team. Here's one place I found the unzipped version of that file.

    https://github.com/statico/dotfiles/blob/master/.vim/mthes10

    It is not in tsenu.xml format, but it seems straightforward to convert it. The key gotcha will be unicode byte ordering as mentioned in https://docs.microsoft.com/en-us/sql/relational-databases/search/configure-and-manage-thesaurus-files-for-full-text-search?view=sql-server-2017

  • Hi,

    I tried to download the thesaurus, but the latest link http://www.stormrage.com/SQLStuff/mthesaur.zip leads to a 404 error.

    Is it possible to update it or send the thesaurus to me privately via email?

    Thank you!

     

     

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply