Current Position:fig. beginning " AI Professional Tools

DeepSieve: a RAG Intelligent Information Screening Tool for Processing Complex Query Sources

2025-08-07

957 7

make a copy of

DeepSieve is an open source Retrieval Augmented Generation (RAG) framework hosted on GitHub that focuses on processing complex queries and multi-source data. It provides efficient information filtering capabilities by decomposing queries, routing sub-questions, reflecting on failed retrievals and fusing answers.Developed by MinghoKwok, DeepSieve supports processing structured data (e.g., SQL tables, JSON logs) and unstructured data (e.g., Wikipedia), and is suitable for scenarios requiring multi-step reasoning. It emphasizes on modular design, users can adjust the functions according to their needs, and it is suitable for researchers and developers to handle complex data analysis tasks. The project was released as a preprint on arXiv on July 29, 2025, and the full corpus has been uploaded to Arkiv.

Function List

Query decomposition : Split a complex query into multiple simple sub-problems to facilitate precise processing.
Sub-issue routing : Intelligent assignment of sub-questions to appropriate tools or data sources (e.g., local databases or global knowledge bases).
Reflection mechanisms : Automatically detects failed searches and retries them, supporting up to two reflections.
Convergence of answers : Integrate sub-question answers to generate the final complete response.
Supports multiple data sources : Handles heterogeneous data such as SQL tables, JSON logs, Wikipedia, etc.
Two RAG modes : Provides two search modes, Naive and Graph, to suit different needs.
Detailed logging : Save intermediate results, fusion hints and performance metrics for each query for debugging and optimization.
modular design : Users can enable or disable function modules with command line switches for flexibility.

Using Help

Installation process

DeepSieve is a Python based open source project , you need to clone the repository through GitHub and configure the environment . Here are the detailed steps:

clone warehouse
Run the following command in a terminal to clone the DeepSieve repository locally:
```
git clone https://github.com/MinghoKwok/DeepSieve.git
```
Go to the project catalog:
```
cd DeepSieve
```
Installation of dependencies
The project relies on Python 3.7+ and related machine learning and data processing libraries. Install the dependencies:
```
pip install -r requirements.txt
```
in the event that requirements.txt Not provided, manual installation of the core library is recommended:
```
pip install numpy pandas scikit-learn openai
```
It is recommended to use a virtual environment to avoid dependency conflicts:
```
python -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows
```
Configuring Environment Variables
DeepSieve uses the Large Language Model (LLM) to process queries that require configuration of API keys. For example, using the DeepSeek Model:
```
export OPENAI_API_KEY=your_api_key
export OPENAI_MODEL=deepseek-chat
export OPENAI_API_BASE=https://api.deepseek.com/v1
```
Depending on the usage mode (Naive or Graph), set the RAG Type:
```
export RAG_TYPE=naive  # 或 graph
```
Verification Environment
Ensure that all dependencies are installed correctly and that the API key is valid. If using a custom data source, check that the data file path is configured correctly.

Run DeepSieve

DeepSieve runs from the command line and offers flexible parameter configuration. The following is the basic usage:

Naive RAG model

Naive mode is suitable for simple tasks. run the following command:

export RAG_TYPE=naive
python runner/main_rag_only.py \
--dataset hotpot_qa \
--sample_size 100 \
--decompose \
--use_routing \
--use_reflection \
--max_reflexion_times 2

--dataset: Specify the dataset (e.g. hotpot_qa).
--sample_size: Set the number of treatment samples.
--decompose: Enable query decomposition.
--use_routing: Enable sub-problem routing.
--use_reflection: Enable reflective mechanisms.
--max_reflexion_times: Set the maximum number of reflections.

Graph RAG mode

Graph mode is suitable for complex queries and requires graph structure support:

export RAG_TYPE=graph
python runner/main_rag_only.py \
--dataset hotpot_qa \
--sample_size 100 \
--decompose \
--use_routing \
--use_reflection \
--max_reflexion_times 2

Disable Modules

Users can disable features by removing command line parameters. Example:

Not using query decomposition: remove --decomposeThe
Not using routes: remove --use_routingThe
Reflections not used: remove --use_reflectionThe

output result

The following files are generated for each run:

Results for each query:outputs/{rag_type}_{dataset}/query_{i}_results.jsonl
Fusion Tip:outputs/{rag_type}_{dataset}/query_{i}_fusion_prompt.txt
Overall performance metrics:overall_results.txt cap (a poem) overall_results.json

Main Functions

Query decomposition

DeepSieve splits complex queries into sub-problems. For example, the query "Revenue and number of employees of a company in 2023" is split into:

Subquestion 1: Find the company's revenue in 2023.
Sub-question 2: Find the number of employees in a company.
Operational Steps:

Enter a query to a script or command line.
Run the script and DeepSieve automatically breaks down the query and displays the sub-issues (viewable in the log).

Sub-issue routing

Each sub-question is assigned to the appropriate tool or data source. Example:

Structured data(e.g., SQL tables) routed to database query tools.
Unstructured data (e.g., Wikipedia) is routed to text retrieval tools.
There is no need for users to manually specify, DeepSieve does the routing automatically. Check Log Files query_{i}_results.jsonl Route details can be viewed.

Reflection mechanisms

If a sub-issue retrieval fails, DeepSieve automatically retries up to two times. The reflection process is recorded in the log, and the user can view the reason for failure and the retry result.

Convergence of answers

DeepSieve consolidates the sub-question answers to generate the final response. For example, the answers to the company query above would be merged into:

"In 2023 the company will have revenues of $X and Y employees."
Fusion tips are saved in the query_{i}_fusion_prompt.txt, easy to check by the user.

Precautions for use

Data preparation : Ensure that the input data is in the correct format (e.g. CSV, JSON) to avoid encoding errors.
API key : Confirm that the LLM API key is valid and that the network connection is stable.
Log checking : View after running outputs/ directory, analyzing performance metrics and error logs.
Community Support : If you encounter problems, visit the GitHub Issues page or the arXiv paper for more information.

application scenario

academic research
Researchers process data from multiple sources (e.g., Wikipedia and experimental databases) to quickly answer complex questions with DeepSieve. For example, analyzing biological datasets and genetic associations in the literature.
Business Data Analytics
Enterprise analysts use DeepSieve to process sales data and customer logs to answer multidimensional questions such as, "Which products will have the highest sales and high customer satisfaction in 2023?" .
Privacy Sensitive Scenarios
DeepSieve supports private data sources (e.g., internal databases) to process queries without merging data, suitable for financial or healthcare industries.
open source development
Developers take advantage of DeepSieve's modular design to extend functionality or integrate into existing systems for customized data processing.

QA

What data sources does DeepSieve support?
Support for SQL tables, JSON logs, Wikipedia, etc. For details, please refer to the project documentation or configuration file.
How do I debug a runtime error?
probe outputs/ log file in the directory to see the error details. Make sure the dependency library version is correct and the API key is valid.
Difference between Graph mode and Naive mode?
Naive mode is suitable for simple queries and is fast; Graph mode is suitable for complex multi-step reasoning with higher accuracy but increased computational cost.
How do I contribute code?
Fork the repository, modify the code and submit a Pull Request. CONTRIBUTING.md Documentation that follows the code specification.

AI open source project Knowledge Retrieval and the RAG Framework

AI productivity tools " DeepSieve: a RAG Intelligent Information Screening Tool for Processing Complex Query Sources Posted on 2025-08-07, please contact us if you find the URL is out of date, or inaccessible.

0Bookmarked

0kudos

DeepSieve: a RAG Intelligent Information Screening Tool for Processing Complex Query Sources

Function List

Using Help

Installation process

Run DeepSieve

Naive RAG model

Graph RAG mode

Disable Modules

output result

Main Functions

Query decomposition

Sub-issue routing

Reflection mechanisms

Convergence of answers

Precautions for use

application scenario

QA

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

DeepSieve: a RAG Intelligent Information Screening Tool for Processing Complex Query Sources

Function List

Using Help

Installation process

Run DeepSieve

Naive RAG model

Graph RAG mode

Disable Modules

output result

Main Functions

Query decomposition

Sub-issue routing

Reflection mechanisms

Convergence of answers

Precautions for use

application scenario

QA

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool