Word2vec Spam Filter screenshot

Word2vec Spam Filter

Author Avatar Theme by Doodyparizada
Updated: 18 Dec 2017
149 Stars

Using word vectors to classify spam messages

Categories

Overview

The word2vec spam filter project is an innovative solution developed during the Kik hackathon in 2017, designed to classify spam messages while prioritizing user privacy. This system operates on the client side, generating a “hash” from incoming messages and sending it to a server for classification. By comparing these hashes against a bank of previously reported messages, the system can efficiently identify spam and incrementally build its accuracy based on user feedback.

The approach combines machine learning with practical functionality, utilizing word vectors and various configurable parameters to enhance performance. It also offers a user-friendly web client that allows individuals to test the spam classification in real-time, giving users control over message reporting and classification status.

Features

  • Privacy Protection: The system generates a hash of the message, ensuring that user privacy is maintained while determining spam status.
  • Dynamic Spam Bank: New spam messages are added to a central bank after being reported multiple times, facilitating continuous learning and enhancement of spam detection.
  • Customizable Hyper-Parameters: Users can adjust parameters such as confidence thresholds and vector sizes to optimize the spam filtering process according to their needs.
  • User-Friendly Web Client: The project includes a web interface with multiple view modes, making it easy for users to interact with the spam filter.
  • Interactive Testing: Users can input their messages to check if they are classified as spam or report them accordingly, aiding in the training of the spam detection system.
  • Extensive Configurations: Supports configurations for handling different message types, including non-English words and various punctuation marks, increasing accuracy.
  • Real-Time Feedback Loop: Users can instantly report messages as spam, which contributes to the ongoing refinement of the spam bank and overall system performance.
  • Quick Setup: A single makefile allows for easy initialization and installation of dependencies, making it accessible for developers to run and contribute to the project efficiently.