Term Extraction Tokens

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 715912

    Comments posted to this topic are about the item Term Extraction Tokens

  • This was removed by the editor as SPAM

  • HappyGeek

    SSCoach

    Points: 18666

    Very good question for the end of week, had to do a bit research on this so learnt something new, thank you.

    ...

  • Iulian -207023

    SSCertifiable

    Points: 7507

    Nice topic touched,

    Thank you for the question

  • paul s-306273

    SSChampion

    Points: 10589

    Never seen it before - no points to end the week!

  • paul s-306273

    SSChampion

    Points: 10589

    paul s-306273 (4/22/2016)


    Never seen it before - no points to end the week!

    Having now read the tech. article, the answer is obvious.

  • mjagadeeswari

    SSC-Addicted

    Points: 454

    Nice question. Thanks steve. Good timing , i'm going through some of SSIS topics on transformations in this week.

  • Ed Wagner

    SSC Guru

    Points: 286958

    I'd never even heard of it before. Not being an SSIS user, I'm not surprised I missed it.

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    I am confused.

    I never used this extraction and I cannot test it so I hope someone else will for the definitive answer. However, I expected the asnwer to be two based on reading exactly the article that is mentioned in the explanation.

    The article describes the phases of the process. Splitting on word boundaries is the first, followed by tagging. In that phase "The" and "is" are discarded; "date" is tagged a noun; "January" is tagged a proper noun, and "4" and "2015" are tagged as number.

    But these four words are not all returned, that is determined in a later phase depending on the configuration of the component - nouns only, noun phrases only, or both.

    The word "date" clearly qualifies as a noun. "January", a proper noun, does the same. I am not sure if "4" and "2015" are added to make this a noun phrase or not, as all the examples focus on adjectives. So the second returned term might be either the noun "January" or the noun phrase "January 4, 2015".

    Perhaps I am misreading something. Perhaps numbers are also considered nouns, in which case the article is misleading. I do not have an SSIS installation at hand so I cannot test it, but I hope someone else will, and then share the results here.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • FridayNightGiant

    SSCertifiable

    Points: 7091

    I was confused as well so tested it quickly - only Date and January are returned see attachment.

    Also using the default settings nothing is returned. The frequency threshold needs to be changed from 2 to 1.

  • Scott Arendt

    SSCertifiable

    Points: 7671

    I had not seen this transformation before, but happen to be taking a course with an element of natural language processing right now. There are actually 6 tokens here. I would have gotten it wrong if 6 was a choice. So, I had to think through options to get a different answer. I correctly guessed that stopwords (the is) were ignored and correctly guessed 4.

    Good timing for me personally with this question.

  • akljfhnlaflkj

    SSC Guru

    Points: 76202

    Thanks for the question.

  • Koen Verbeeck

    SSC Guru

    Points: 258942

    Interesting question, thanks.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Joshua M Perry

    SSCrazy

    Points: 2655

    I have been using tis one for a while, and only date and January will be returned. the numbers will be tokenized and discarded.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply