Keyword-Topic Categorizer Python Script
Overview:
The Keyword-Topic Categorizer is an advanced script designed to categorize a list of keywords based on their relevance to a set of predefined topics. Whether you have a dataset of search terms from your website or a list of product keywords and want to categorize them under specific topics, this tool is tailored to streamline the process.
Features:
- Multilingual Support: Categorizes keywords in English, German, French, Spanish, Russian, and Chinese.
- Dynamic Model Loading: Uses the appropriate language model based on detected keyword language for optimal accuracy.
- Batch Processing: Efficiently processes large datasets in manageable batches.
- Debug Mode: Provides detailed insights into the categorization process for each keyword.
Installation:
Prerequisites:
- Python 3.x
- pip (Python package installer)
Steps:
- Clone or download the repository containing the script.
- Navigate to the script's directory in the terminal or command prompt.
- Install the required Python packages using the following commands:
pip install pandas tqdm spacy langdetect
- Download the necessary spaCy language models:
python -m spacy download en_core_web_sm python -m spacy download de_core_news_sm python -m spacy download es_core_news_sm python -m spacy download fr_core_news_sm python -m spacy download ru_core_news_sm python -m spacy download zh_core_web_sm
Usage:
-
Prepare Your Data:
- Ensure you have two text files:
keywords.txt
(one keyword per line) andtopics.txt
(one topic per line). - Place both files in the same directory as the script.
- Ensure you have two text files:
-
Run the Script:
- Navigate to the script's directory in the terminal or command prompt.
- Execute the script:
python match.py
-
View Results:
- Once the script completes its execution, a file named
results.csv
will be generated in the same directory. This file contains two columns: "keyword" and "category". Each keyword fromkeywords.txt
is paired with its closest matching topic fromtopics.txt
.
- Once the script completes its execution, a file named
-
Debug Mode (Optional):
- If you wish to see a detailed breakdown of how each keyword is categorized, set the
DEBUG_MODE
variable in the script toTrue
. When you run the script in this mode, it will print diagnostic information for each keyword.
- If you wish to see a detailed breakdown of how each keyword is categorized, set the
Note: This tool provides categorizations based on token overlaps between keywords and topics. It's essential to ensure that your topics are representative of the categories you want to create. The script defaults to English when a keyword's language can't be determined or if the language isn't supported. Adjustments may be necessary based on specific use cases or domain-specific requirements.