Three points to focus on:
- legal compliance: Subject to regional content review regulations, such as China's Cybersecurity Law requirements for sensitive information.
- contextual miscalculationPure string matching may harm normal content (e.g. "computer" contains "computing"), it is recommended to adjust the thesaurus or introduce NLP technology in conjunction with the business.
- performance optimization: High-concurrency scenarios require the selection of efficient algorithms such as DFA to avoid auditing becoming a system bottleneck.
In particular, the article emphasizes that the thesaurus needs to be secondary processed in conjunction with business scenarios and not directly copied.
This answer comes from the articleSensitive-lexicon: a continuously updated thesaurus of Chinese sensitive wordsThe