Data Mining for spam detection

Question

Data Mining for spam detection

Jonathan Mallia

SSCertifiable

Points: 5192
More actions
March 18, 2011 at 4:30 am

#233356

Dear all,
I am experimenting to try and apply spam detection using Analysis Services. This basically occurs where a user enters a particular comment in a forum / blog, and this system needs to categorize the comment as either spam or a valid comment.
Can anyone please shed some light on how to achieve this?
Thanks in advance!
Regards,
J.

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259197 More actions · Answer 1

You'll need to use a decision tree algorithm to create a decision tree (what's in a name :-)). This tree will decide if a message is spam according to some rules.

To do this, you'll need some training data, which is basically usernames, the message they wrote, mabye the frequence they post and on what time, but most important of all, a flag that indicates if the message is spam or not. The SSAS data mining algorithm will crawl through this training data and tries to create some rules that can evaluate a certain message as spam.

After the model has been created, you can test it by another subset of data, to see if messages are correctly labeled as spam or not. This is the test data. It can happen for example that the rules are created too strictly based on the training data, so that the rules only apply to that specific subset of data. This is overtraining your model. You can avoid overtraining by regularly testing the accuracy of the model.

When this is done, you can feed actual data to the data mining model, and it will predict if a message is spam or not within a certain accuracy. (For example: the model is 90% certain that the message is spam).

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP