The key steps for integrating Sensitive-lexicon across languages are listed below:
- universal thesaurus format: No matter using Java/PHP/Go, etc., all directly read UTF-8 encoded
sensitive-lexicon.txt
Text file, parsed as an array by line. - Selection of Language Adaptation Algorithm: Java Recommendations
org.ahocorasick.trie
Library implements DFA; available for PHPphptrie
Extensions; Go Language Standard Librarystrings.Contains
It can be quickly implemented with the Map structure. - Packaged General Purpose Modules: Encapsulate the thesaurus loading and matching logic into independent services (e.g., REST APIs) that are called by different business systems through interfaces.
This solution can be integrated into the base in less than 1 hour and has a lower performance loss than 5%.
This answer comes from the articleSensitive-lexicon: a continuously updated thesaurus of Chinese sensitive wordsThe