OceanBase SeekDB is a vector retrieval engine in the core of the OceanBase database, built specifically for the AI era. Rather than being a standalone database product that needs to be operated and maintained separately, SeekDB fully embeds the capabilities of a vector database within the mature distributed relational database, OceanBase. From a first-nature perspective, SeekDB essentially gives relational databases the ability to "understand" unstructured data (e.g., vectors generated by text, images, and audio). It solves the pain points of data fragmentation, data synchronization delay, and operation and maintenance complexity brought by the separation of "relational database storing business data" and "dedicated vector database storing feature data" in traditional AI application architecture. SeekDB is based on the distributed architecture of OceanBase, inheriting its high availability, strong consistency and horizontal scalability, supporting vector storage of up to 16,000 dimensions, providing HNSW and HNSW support. Based on OceanBase's distributed architecture, SeekDB inherits its high availability, strong consistency, and horizontal scalability, supports vector storage up to 16,000 dimensions, provides high-performance indexing algorithms such as HNSW and IVF, and completes Top-K queries with very low latency on massive data, making it an ideal data base for building RAG (retrieval augmentation generation), recommender systems, and multimodal search applications.
Function List
- Native vector data type support: defined directly in the database table structure
VECTORtype field that supports dense vector storage up to 16,000 dimensions, with theINT,VARCHARand other traditional data types coexist. - High Performance Vector IndexesBuilt-in industry-leading indexing algorithms such as HNSW (Hierarchical Navigational Small World) and IVF (Inverted File), and support for a variety of distance formulas such as L2 (Euclidean Distance), Inner Product and Cosine (Cosine Similarity), ensure high recall and low latency in searching.
- SQL+AI Hybrid Search: Supports the inclusion of both the
WHEREclauses for scalar filtering (e.g., price range, product category) and theORDER BYof vector similarity ordering, the database kernel automatically optimizes the execution plan, avoiding multiple interactions at the application layer. - Distributed Horizontal ScalingVector data can be automatically distributed to multiple servers as it is sliced and diced, relying on OceanBase's native distributed architecture. By adding more nodes, the storage capacity and retrieval performance can be linearly increased, making it easy to cope with tens of billions of vectors.
- Fully compatible with MySQL protocol: Without having to learn a specialized vector database API, developers can directly manipulate vector data using any client that supports the MySQL protocol (e.g., JDBC, PyMySQL) or the ORM framework.
- Seamless integration of AI ecosystemsOceanBase is a new type of AI development framework that provides a Python SDK to enable developers to quickly connect OceanBase to the data chain of large model applications, and is deeply adapted to mainstream AI development frameworks such as LangChain, LlamaIndex, and DB-GPT.
Using Help
The OceanBase SeekDB experience is designed to be very close to that of a traditional MySQL database, with the core differences being the definition and indexing of vector data. Below is a detailed description of the complete process from environment preparation to the completion of a hybrid search.
1. Environment preparation and installation
SeekDB's capabilities are included in OceanBase 4.3.3 and above. For developers, the fastest way to get started is to deploy a standalone version of OceanBase using Docker containers.
Step 1: Start the OceanBase Container
Make sure your machine has Docker installed and at least 8GB of memory allocated (vector computing has some memory requirements). Execute the following command in the terminal to pull and launch the latest version of the image:
docker run -p 2881:2881 --name oceanbase-ce -e MINI_MODE=1 -d oceanbase/oceanbase-ce:latest
Notes:MINI_MODE=1 parameter is used to boot in minimal mode on resource-constrained PCs.
Step 2: Connect to the database
Once the container has started (usually takes 1-2 minutes to initialize), you can connect using any MySQL client. Here is an example of a command-line MySQL client:
# 进入容器内部
docker exec -it oceanbase-ce bash
# 连接数据库(默认无密码,端口2881)
obclient -h127.0.0.1 -P2881 -uroot@test
Note: In a production environment, it is recommended to create a non-root user and set a strong password.
2. Operational flow of the vector table
We will simulate a "book recommendation system" scenario by creating a table containing basic information about the book (title, price) and a vector of features about the book.
Step 1: Create a table of support vectors
When creating a table, use the VECTOR(<维度>) The syntax defines vector columns. Let's say we need to store 3 dimensional vectors (in real world scenarios this is usually 768 or 1536 dimensions):
CREATE TABLE books (
id INT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10, 2),
embedding VECTOR(3) -- 定义一个3维的向量列
);
Step 2: Insertion of data
Use string format when inserting vector data [v1, v2, v3, ...] Represents a vector.
INSERT INTO books (title, category, price, embedding) VALUES
('Python深度学习', 'Tech', 59.90, '[0.1, 0.2, 0.8]'),
('如果是普通的书', 'Fiction', 29.90, '[0.8, 0.1, 0.1]'),
('数据库内部原理', 'Tech', 89.00, '[0.2, 0.2, 0.9]'),
('烹饪大全', 'Life', 45.00, '[0.1, 0.9, 0.1]');
3. Creating vector indexes
To speed up retrieval, indexes must be created on vector columns.SeekDB supports the creation of IVFFLAT or HNSW indexes.
Parameter Description:
distance: Distance measurement method, optionall2(Euclidean distance).inner_product(inner product), (inner product)cosine(cosine).type: Indexing Algorithm Types, RecommendedhnswThelib: Vector library implementation, defaults tovsagThe
-- 创建一个基于 HNSW 算法的索引,使用 L2 距离
CREATE INDEX idx_book_embedding ON books(embedding)
USING VECTOR
WITH (distance=l2, type=hnsw, lib=vsag);
Note: The construction of vector indexes may consume some memory resources, and the construction speed depends on the amount of data.
4. Implementation of Hybrid Search
This is the most powerful feature of SeekDB. Suppose we are looking for "content related to computer technology (vectors close to the [0.15, 0.2, 0.85]) and priced under $100".
SQL statement writing:
utilization l2_distance function calculates the distance, combining the WHERE clause filters the price.
SELECT title, price, l2_distance(embedding, '[0.15, 0.2, 0.85]') as distance
FROM books
WHERE price < 100 AND category = 'Tech' -- 标量过滤条件
ORDER BY distance ASC -- 按相似度排序(距离越小越相似)
LIMIT 5;
Interpretation of implementation results:
The database engine will first quickly sift through the scalar indexes to find the price < 100 cap (a poem) category = 'Tech' The optimizer can either find a set of candidates using the vector index, or find an approximation using the vector index and then filter it (depending on the cost estimate of the optimizer). There is no need to manually write complex "filter and sort" logic.
5. Use in Python
When developing AI applications in practice, Python is commonly used. the following is a combination of the PyMySQL The full example code of the
import pymysql
import json
# 1. 建立连接
conn = pymysql.connect(
host='127.0.0.1',
port=2881,
user='root@test',
password='',
database='test',
autocommit=True
)
cursor = conn.cursor()
# 2. 准备查询向量(模拟来自 Embedding 模型的结果)
query_vector = [0.15, 0.2, 0.85]
query_vector_str = json.dumps(query_vector)
# 3. 执行混合检索 SQL
sql = """
SELECT title, l2_distance(embedding, %s) as dist
FROM books
WHERE category = 'Tech'
ORDER BY dist ASC
LIMIT 3
"""
cursor.execute(sql, (query_vector_str,))
results = cursor.fetchall()
# 4. 输出结果
print("推荐图书:")
for row in results:
print(f"书名: {row[0]}, 距离: {row[1]}")
cursor.close()
conn.close()
With the above process, you have successfully implemented a vector-based semantic search function in OceanBase without involving any external vector database.
application scenario
- Enterprise Private Knowledge Base Question and Answer (RAG)
Organizations slice and dice their internal documents (PDFs, wikis, code libraries) and turn them into vectors stored in OceanBase. When an employee asks a question, the system vectorizes the question, retrieves the most relevant document fragments from the database through vector search, and generates an accurate answer by combining with LLM. The advantage of OceanBase is that the data does not go out of the database, and the authority management directly reuses the ACL mechanism of the database, which guarantees the security of enterprise data. - Multimodal map search
In an e-commerce platform scenario, the image features of a product are extracted as vectors and stored in a database. Users upload a picture, and the system quickly retrieves products with similar visual features through SeekDB. Combined with the scalar capability of SQL, it can also easily superimpose business filtering conditions such as "only see what's available" and "price between $50-100" to provide an accurate shopping guide experience. - Personalized Recommendation System
Using user browsing history and behavioral data to generate user profile vectors that are matched with vectors from content libraries (articles, videos), SeekDB can filter out the closest entries to user interest vectors in milliseconds from millions of pieces of content, realizing real-time "Guess Your Favorite" functionality and improving user retention.
QA
- Why did you choose to use OceanBase SeekDB instead of a specialized vector database (e.g. Milvus, Pinecone)?
Specialized vector databases are usually deployed independently, which means you need to maintain two sets of data systems (relational DB + vector DB), which not only increases operation and maintenance costs, but also faces the consistency problem of data synchronization. SeekDB realizes "all-in-one", processing transactions, analysis and vector retrieval in the same database kernel, dramatically simplifying the technical architecture of AI applications, and directly inheriting the financial-grade high-availability capability of OceanBase. - What index types does SeekDB support? What is the performance?
Currently, it mainly supports HNSW (suitable for scenarios with high performance requirements and sufficient memory) and IVF (suitable for scenarios with very large data volume and memory sensitivity) series indexes. In the standard ANN Benchmarks test, OceanBase's vector retrieval engine has excellent QPS and recall under 10 million data sizes, which can meet the real-time demand of most online reasoning services (usually at 10ms-50ms level). - Can vector columns be added to the existing table?
Can. You can manipulate normal columns as you would withALTER TABLEstatement for an existing business table to addVECTORtype of column and create an index on that column. This makes AI-smart retrofitting of legacy systems very smooth, without the need to migrate data.
































