Current Position:fig. beginning " AI How-Tos

Making Dify "See" Pictures: Integrating MinerU-API for Knowledge Base OCR Resolution

2025-05-22

1.1 K

Many users are uploading important data (such as plain image files or scanned PDF documents) to the LLM application development platform. Dify A tricky problem is often encountered when the knowledge base of theDify It is not possible to read and parse these non-text formats directly. This is mainly due to the Dify s knowledge base native functionality is more focused on processing and understanding plain text data. To overcome this limitation, it is possible to introduce MinerU-API tool that empowers Dify Knowledge Base's powerful Optical Character Recognition (OCR) capabilities. Next, details will be given on how to build a workflow that enables the Dify The Knowledge Base is capable of effectively parsing text information in images and scanned documents. This tutorial is based on the Dify Version 1.3.1.

preliminary

There are two key preparations that need to be completed before you can start building your workflow: deploying the MinerU-API Service and Creation Dify Knowledge Base.

Deploying MinerU-API

MinerU-API is a tool that supports multiple format document parsing (including OCR). For its detailed introduction and steps to get the code, you can refer to the two related articles "Extracting PDF with MinerU in Dify" and "MinerU-API | Supporting Multi-Format Parsing to Further Enhance Dify's Document Capabilities". This assumes that the user has obtained MinerU-API code and briefly describe its Docker Deployment Command.

docker run -d --gpus all --network docker_ssrf_proxy_network --name mineru-api -v minerupaddleocr:/root/.paddleocr mineru-api:v0.3

This command will start a command in the background called mineru-api (used form a nominal expression) Docker container and allocates GPU resources (if available) while connecting it to the specified network and mounting a data volume for persistent PaddleOCR of relevant data.

Creating a Dify Knowledge Base

First, in the Dify A new knowledge base is created in the platform. The creation process involves setting up the underlying Embedding model, which is responsible for converting text data into high-dimensional vectors for semantic understanding and similarity calculation by the machine, and the Rerank model, which is used to reorder the initial retrieval results to improve the accuracy and relevance of the final answers.

Creating the Dify Knowledge Base
Figure 1: Create Dify knowledge base interface

Once the knowledge base has been created, open the knowledge base with the browser'saddress barThis ID is an important parameter for subsequent API calls.

Get Knowledge Base ID
Figure 2: Getting the knowledge base ID from the browser address bar

Next, navigate to theKnowledge Base -> API Settings screen to generate a new API key. This key will be used to authorize the various operations performed by the workflow on the knowledge base.

Generating Knowledge Base API Keys
Figure 3: Generate Knowledge Base API Key Interface

Building MinerU Knowledge Base Workflows

Workflow Overview

The constructed workflow consists of three key code execution nodes that work together to parse and library an image or scanned document.

Figure 4: Overview of MinerU knowledge base workflow

The functions of each of the three code blocks are as follows:

Process Parameters: This node is mainly responsible for handling calls to Dify Create a document interface (/datasets/{dataset_id}/document/create-by-text) when the required parameters.
MinerU extraction: The core task of this node is to call MinerU-API A service that converts incoming PDF or image files into plain text content in Markdown format using OCR technology.
Knowledge Base - Document Creation: This node is created by calling the Dify flat-roofed /datasets/{dataset_id}/document/create-by-text API interface, which will be defined in the previous step by the MinerU The extracted text content is created as a new document in the knowledge base. The following is sample Python code for this node:

import requests
def main(api_key, file_name, content, api_params, dataset_id):
headers = {
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json',
}
# 更新API参数，加入文件名和提取的文本内容
api_params.update({
"name": file_name,
"text": content,
})
# 构建Dify API的请求URL
# 注意：实际部署时，'http://api:5001' 可能需要根据Dify服务的实际地址和端口进行调整
url = f'http://api:5001/v1/datasets/{dataset_id}/document/create-by-text'
response = requests.post(
url,
headers=headers,
json=api_params,
)
return {"result": response.text}

Effectiveness Test

In order to verify the effectiveness of the workflow, take a PDF document directly printed from a web page as an example, and compare it with a PDF document directly uploaded to the Dify The knowledge base is the same as the knowledge base created through the newly created MinerU The effect of workflow processing.

The effect of directly uploading a knowledge base:

Direct Transfer Knowledge Base Effect
Figure 5: directly upload PDF documents to the Dify knowledge base after the state of the

As you can see from the image above, even though the document was successfully uploaded, the Dify The native Knowledge Base capabilities were unable to parse any of the text in the scanned PDF, leaving the document virtually blank in the Knowledge Base.

The effect of creating a document through a MinerU workflow:

Figure 6: Execution results of processing and creating documents through MinerU workflow

The chart above shows thatMinerU The workflow executed successfully and the interface call returned a successful result. At this point, you can go to the Knowledge Base to view the newly created document.

Knowledge Base to view MinerU document creation
Figure 7: Viewing a document created by a MinerU workflow in the Dify Knowledge Base

After a document is created through a workflow and imported into the knowledge base, theDify It will be automatically processed for indexing. After waiting for the indexing to complete, a recall test can be performed to check whether the knowledge base is able to perform effective Q&A or information retrieval based on the text content in the images.

Figure 8: Recall testing of documents processed and warehoused by MinerU

The test results show that by MinerU Workflow processed documents that contain textual content that has been successfully extracted and indexed, making the Dify The knowledge base is able to understand and respond to questions based on these originally posed for image format information. This significantly enhances the Dify The ability of the knowledge base to handle diverse document types.

May not be reproduced without permission:Chief AI Sharing Circle " Making Dify "See" Pictures: Integrating MinerU-API for Knowledge Base OCR Resolution

Making Dify "See" Pictures: Integrating MinerU-API for Knowledge Base OCR Resolution

preliminary

Deploying MinerU-API

Creating a Dify Knowledge Base

Building MinerU Knowledge Base Workflows

Workflow Overview

Effectiveness Test

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

Making Dify "See" Pictures: Integrating MinerU-API for Knowledge Base OCR Resolution

preliminary

Deploying MinerU-API

Creating a Dify Knowledge Base

Building MinerU Knowledge Base Workflows

Workflow Overview

Effectiveness Test

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

Quick query station AI tool