Term Extraction Tokens

  • Comments posted to this topic are about the item Term Extraction Tokens

  • This was removed by the editor as SPAM

  • Very good question for the end of week, had to do a bit research on this so learnt something new, thank you.

    ...

  • Nice topic touched,

    Thank you for the question

  • Never seen it before - no points to end the week!

  • paul s-306273 (4/22/2016)


    Never seen it before - no points to end the week!

    Having now read the tech. article, the answer is obvious.

  • Nice question. Thanks steve. Good timing , i'm going through some of SSIS topics on transformations in this week.

  • I'd never even heard of it before. Not being an SSIS user, I'm not surprised I missed it.

  • I am confused.

    I never used this extraction and I cannot test it so I hope someone else will for the definitive answer. However, I expected the asnwer to be two based on reading exactly the article that is mentioned in the explanation.

    The article describes the phases of the process. Splitting on word boundaries is the first, followed by tagging. In that phase "The" and "is" are discarded; "date" is tagged a noun; "January" is tagged a proper noun, and "4" and "2015" are tagged as number.

    But these four words are not all returned, that is determined in a later phase depending on the configuration of the component - nouns only, noun phrases only, or both.

    The word "date" clearly qualifies as a noun. "January", a proper noun, does the same. I am not sure if "4" and "2015" are added to make this a noun phrase or not, as all the examples focus on adjectives. So the second returned term might be either the noun "January" or the noun phrase "January 4, 2015".

    Perhaps I am misreading something. Perhaps numbers are also considered nouns, in which case the article is misleading. I do not have an SSIS installation at hand so I cannot test it, but I hope someone else will, and then share the results here.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • I was confused as well so tested it quickly - only Date and January are returned see attachment.

    Also using the default settings nothing is returned. The frequency threshold needs to be changed from 2 to 1.

  • I had not seen this transformation before, but happen to be taking a course with an element of natural language processing right now. There are actually 6 tokens here. I would have gotten it wrong if 6 was a choice. So, I had to think through options to get a different answer. I correctly guessed that stopwords (the is) were ignored and correctly guessed 4.

    Good timing for me personally with this question.

  • Thanks for the question.

  • Interesting question, thanks.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I have been using tis one for a while, and only date and January will be returned. the numbers will be tokenized and discarded.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply