The following optimization schemes are recommended for sensitive word filtering performance problems in high concurrency scenarios:
- Choosing Efficient Data Structures: Prioritize the use of DFA or Trie trees instead of regular expressions, with time complexity down to O(n), independent of the size of the lexicon. Most programming languages (e.g., Python's
pyahocorasick
libraries) to provide off-the-shelf implementations. - preloaded thesaurus: Build sensitive words as in-memory Trie trees/DFAs at service startup to avoid parsing files repeatedly per request.
- distributed cache: For hyperscale systems, consider storing the constructed matchers in a cache such as Redis and sharing them across multiple nodes.
According to the test data, the matching time of DFA algorithm for processing 100,000 words of text is usually less than 100ms, which is suitable for multi-million daily live applications.
This answer comes from the articleSensitive-lexicon: a continuously updated thesaurus of Chinese sensitive wordsThe