To achieve fast filtering of Chinese sensitive content, you can utilize the Sensitive-lexicon project by following the steps below:
- Download Thesaurus: Get it by cloning the repository via Git or by downloading the ZIP file directly!
sensitive-lexicon.txt
Glossary file. - Selection Matching Algorithm: For lightweight applications, regular expressions can be used directly to splice all sensitive words into a single pattern (such as
(词1|词2)
), the matching efficiency is low but simple to implement; for high-frequency scenarios, DFA or Trie tree algorithms are recommended. - integrated code: Load the thesaurus file into memory (e.g. Python's
set
(structure), combined with the algorithm to achieve text matching logic. Project pseudo-code can refer to the article in the example, call the third-party Trie library efficiency is better.
Note: The method needs to periodically synchronize the thesaurus updates and adjust the misclassification rules with the business scenarios.
This answer comes from the articleSensitive-lexicon: a continuously updated thesaurus of Chinese sensitive wordsThe