There are two typical integration approaches for developers:
- Basic methods: Sensitive words are spliced into patterns (e.g., "word 1|word 2") by regular expressions, suitable for low performance scenarios.
- Efficient methods: Using DFA or Trie tree algorithms, the lexicon is first loaded into a data structure (e.g., Python's Trie library), and then the text is matched. The latter time complexity is only related to the length of the text, suitable for high concurrency scenarios. The project provides pseudo-code examples to illustrate the whole process of loading the lexicon, building the matcher and checking the text.
This answer comes from the articleSensitive-lexicon: a continuously updated thesaurus of Chinese sensitive wordsThe