Technical Article

Data Mining Part 35: Text Mining using the term extraction tool

,

Introduction

Text mining is a very interesting topic that is used to detect patterns and analyze the text. In text mining you analyze, summarize and get information from text like comments, feedback or other text sources.

In this chapter, we will use a simple tool called the Term Extraction Task to analyze comments made in the SQLServerCentral and get the most frequently used words. This is a very simple tool used to detect the frequency of the most commonly used terms in a specified text.

This new chapter is independent. If you did not read any of the Data Mining series, it is OK. Even when the article uses SQL Server Integration Services, if you do not have any experience at all, you can still read this article and learn from 0.

If you already read many or maybe all the other articles, they will help you a little bit, specially the ones related to SSIS (part 14 and 15).

Requirements

  1. SQL Server Installed.
  2. SQL Server Integration Services.
  3. The SQL Server Data Tools (SSDT).
  4. The table score, which will be used to measure the frequency of the terms with a score.

  1. The SQLServerCentral_comments table. We attached a file named sqlservercentral_script that will help you to generate both tables with Data. The script is in the resources section.

The sqlservercentral_comments contains comments of this Web site. We will analyze the comments using the term extraction tool from SSIS and detect the most frequently used terms.

Getting Started

1. In order to start, run the SSDT (or BIDS in earlier versions).

2. In the SSDT go to File>New>Project.

3. Select the Integration Service Project.

4. Drag and drop the data flow task and double click on the Data Flow Task in the design pane.

5. In the Data Flow Tab, drag and drop the OLE DB Source.

6. Drag and drop the term extraction task (this is the super tool that we are going to learn in this chapter).

7. The Term extraction has a special data type. For this reason, we will need the Data Conversion Task. Drag and drop it to the design pane. And finally, to store the data, drag and drop the SQL Server Destination on the surface. Finally join all the tasks.

8. Double click on the OLE DB Source Tasks and press the New button to create a new connection.

9.  Connect to the server and database with the tables created on the requirements section. In this example, the database where the tables were created is named db1.

10. The OLE DB Source will analyze the comments stored in the sqlservercentral_comments table. Specify the connection and the table.

11. Once the OLE DB Source is specified, double click on the Term Extraction Task. Select the column that contains the comments. In this example, the column name is text. Check the text column.

The output will be the term and the score. You can change the name of the output columns in this window.

12. As I said before, the term extraction task has an output in a specific data type. We will need to convert the data types using the Data Conversion Transformation Task. Double click on the task and check the Term and Score. I changed the Output Alias to Term_converted and Score converted. You can change the name to any name of your preference. The data types will be converted to Unicode String and four-byte signed integer. What we are doing is to convert the output of the term extraction task to a format compatible with the score table created on the requirements section.

13. Double click the SQL Destination task and connect to the db1 database to the score table created on the requirements. Click on the Mappings page.

14. Remove the existing mappings, and map the term_converted to term and Score_converted  to score.

15. If everything is OK, the SSIS flow will be the following:

16. Now, run the package. If everything is OK, you will receive a message similar to the following:

17. Finally, run a select to see the score table data:

SELECT term
     , score
 FROM db1.dbo.score

Conclusion

As you can see, the most frequent word is article with a score of 5. The second term is nice article. There are other terms like tutorial and SSAS with a score of 3.

You may ask yourself what should I do with these terms? Well, the answer is that these words can be very important to analyze readers' needs. This simple tool is giving you analysis of the most frequent words used and will give you tips about what the readers are looking for. 

This tool can be applied to any web site. With the term extraction task, you can find special words that will help you to summarize what the customer needs. As you can see, it is not very hard to analyze the information and this tool is very simple.

The term extraction task is not related with all the other mining tools used before. It is independent, but very useful.

References

Resources

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating