Sensitive-lexicon is an open source Chinese sensitive-lexicon thesaurus project that provides a list of tens of thousands of words in a plain text file. The project is designed to help developers and content managers quickly integrate a basic text review feature into their applications or websites. The thesaurus covers a number of mainstream sensitive areas such as politics, pornography, violence, etc., and is kept up to date through the power of the community to adapt to the ever-changing online language environment. Due to its simple text format, the thesaurus can be easily read and used by a variety of programming languages and frameworks. Developers can combine different algorithms such as DFA, Trie tree or regular expression to realize the filtering and reviewing of text content according to their business needs.
Function List
- Extensive vocabulary coverage: The thesaurus contains tens of thousands of sensitive words covering many common sensitive content areas including politics, violence, and pornography.
- Ongoing community updates: The project relies on community contributions (Issues and Pull Requests) to keep adding new words and fixing bugs, ensuring that the thesaurus keeps up with changes in Internet terminology and maintains its validity.
- Easy to integrate: Plain text is provided (
.txt
) file, a format so versatile that developers can easily integrate it into any programming language or project framework without a complex parsing process. - Multiple implementations: Developers can flexibly choose different string matching algorithms to use this thesaurus according to specific business scenarios and performance requirements, such as DFA (Deterministic Finite Automata), Trie Tree, or basic regular expression matching.
- openness and transparency: As an open source project, the thesaurus content and update history are publicly available and can be reviewed, contributed to and freely downloaded by users.
Using Help
The project itself is a collection of thesaurus files, not a piece of software that can be run directly, so there is no traditional "installation" process. At its core, it provides the data, and developers need to write their own code to read and use the data.
How to get and use the thesaurus
- Download Thesaurus File
The most straightforward way to get updates easily is to clone the entire repository via Git. Open your terminal or command line tool and execute the following command:git clone https://github.com/konsheng/Sensitive-lexicon.git
If you're not familiar with Git, you can also download the entire project as a zip by clicking the "Code" button on the GitHub project homepage and selecting "Download ZIP".
- Selecting the right thesaurus file
In the downloaded project file, the core files aresensitive-lexicon.txt
It contains all types of sensitive words. The project may also provide separate thesaurus files categorized by different sensitive areas, and you can choose to load all words or only specific categories of words according to your needs. - Integration in code
The core step in integrating a thesaurus is to read in your program thesensitive-lexicon.txt
file and loads each line (i.e., a sensitive word) inside it into a data structure (e.g., a list, collection, or node of a Trie tree).Basic implementation: using regular expressions
This is a simple but less efficient method for scenarios where the amount of text is not large or where performance requirements are not high.- move::
- surname Cong
sensitive-lexicon.txt
Read all sensitive words line by line in the - Stitch these words together into a large regular expression pattern with the
|
(or) Separation. For example, if the thesaurus contains "word one" and "word two", the pattern is(词语一|词语二)
The - Use this regular expression pattern to match the text you need to review.
- surname Cong
Efficient implementation: using DFA (Deterministic Finite Automata) or Trie trees
For production environments that need to deal with large amounts of text and high-frequency auditing, the use of DFA or Trie trees is a more desirable choice. These algorithms have very high matching efficiency, and the time complexity is proportional to the length of the text to be matched and independent of the size of the thesaurus.- move::
- Building Data Structures: At the startup of your application, read the
sensitive-lexicon.txt
file and builds all the sensitive words into a Trie tree or DFA. many programming languages have ready-made libraries to implement both data structures, and you don't need to start from scratch. - execute a match: The text to be reviewed is entered into a constructed Trie tree or DFA for matching.
- Outcome of the process: The algorithm returns whether there are sensitive words in the text and where they are located. You can choose to replace sensitive words with asterisks according to your business needs
*
, or simply refuse to post content containing sensitive words.
- Building Data Structures: At the startup of your application, read the
Example (pseudo-code)
# 伪代码,展示基本逻辑 # 1. 加载词库 sensitive_words = set() with open("sensitive-lexicon.txt", "r", encoding="utf-8") as f: for line in f: sensitive_words.add(line.strip()) # 2. 构建高效的匹配器(如此处使用第三方Trie库) from some_trie_library import Trie trie = Trie() for word in sensitive_words: trie.add_word(word) # 3. 检查文本 def check_text(text): found_words = trie.search_all(text) if found_words: print(f"发现敏感词: {found_words}") return False # 包含敏感词 return True # 文本安全 # 4. 应用 user_input = "这是一段用户输入的测试文本。" is_safe = check_text(user_input) if is_safe: print("内容发布成功。") else: print("内容包含不当信息,发布失败。")
- move::
caveat
- Compliance with regulations: When using this thesaurus for content filtering, be sure to comply with the relevant laws and regulations of your country or region.
- contextual issue: The criteria for judging sensitive words are largely influenced by culture, geography and specific contexts. Developers need to evaluate and adjust the thesaurus with their own business scenarios in practical applications to avoid accidentally hurting normal content.
- Performance Considerations: For large, highly concurrent applications, it is important to choose a high-performance matching algorithm (e.g., DFA) to avoid content review becoming a system bottleneck.
application scenario
- Community forums and social media
It is used to automatically review the posts, comments and private messages posted by users and filter out inappropriate remarks involving abusive language, pornography, advertisements, etc., in order to maintain a healthy atmosphere in the community and a clean network environment. - online game
It is used to filter in-game chat content, character naming, guild names, etc. to prevent players from using uncivilized or illegal words, to enhance the gaming experience and meet regulatory requirements. - Content Publishing Platform
Blogs, news sites, video platforms, etc. can use this thesaurus for initial screening of user-uploaded content (e.g., titles, descriptions, comments, pop-ups) to reduce the pressure of manual review. - Educational and children's products
It can be used to block inappropriate content such as violence and pornography in apps or websites aimed at minors to protect young people from harmful information. - Intra-company communication
It is used to monitor the internal chat tool or email system of the enterprise to prevent the leakage of sensitive data or the dissemination of inappropriate remarks and to safeguard information security.
QA
- Is this project just a thesaurus file? Is there an API that can be called directly?
Yes, the core of the project itself is a thesaurus file in plain text format (sensitive-lexicon.txt
). It does not provide a direct call to the API or ready-made software services. Users need to write their own code to read this file and implement specific text filtering logic. - How often is this thesaurus updated?
The project is a community-driven open source project with no fixed update cycle. It relies on contributions from community members for updates. The thesaurus is updated when someone submits a new vocabulary (via Issue or Pull Request) and it is merged by the project maintainer. - How do I contribute new sensitive words to this project?
You can create a new "Issue" on your GitHub project page where you list the words you suggest adding. A more standardized way to "Fork" this project would be to change in your own copy of thesensitive-lexicon.txt
file, then launch a Pull Request to the original project and wait for the maintainer to review the merge. - Does using this thesaurus cause normal words to be filtered incorrectly?
Possible. Because the thesaurus is a direct string match, there may be "false positives" (e.g., "computer" may be filtered by overly broad rules because it contains "computing"). (e.g., "computer" may be filtered by some overly broad rules because it contains "computing"). Therefore, in practice, it is recommended to adjust the thesaurus according to the business requirements and combine it with smarter algorithms (e.g., natural language processing) to understand the context in order to improve the accuracy.