Skip to content

Commit 7158ac6

Browse files
authored
docs: add proposal for vector index (#55707)
close #55706
1 parent b72b6c2 commit 7158ac6

File tree

2 files changed

+285
-0
lines changed

2 files changed

+285
-0
lines changed
Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# Proposal: Vector Index
2+
3+
* Author: [zimulala](https://github.com/zimulala)
4+
* Tracking issue: https://github.com/pingcap/tidb/issues/55693
5+
6+
## **Introduction**
7+
8+
Vector Index refers to constructing a time and space efficient data structure for vector data based on a certain mathematical quantization model, to efficiently retrieve several vectors that are as similar as possible to the target vector. Currently, we plan to support the HNSW series of vector indexing methods.
9+
10+
## **Motivation or Background**
11+
12+
This document plans to support the same processing of vector indexes as ordinary indexes, its specific functions are as follows :
13+
14+
* Support adding and using vector indexes, including adding vector indexes during and after table creation.
15+
* Support dropping and renaming vector indexes.
16+
17+
## **Detailed Design**
18+
19+
The process of adding a vector index is similar to that of adding an ordinary index. However, since the actual vector index data is added to TiFlash, there is no process for populating the index data to TiKV.
20+
The following figure takes the add vector index operation as an example and briefly describes its execution process.
21+
22+
![Figure 1: flow chart](./imgs/vector-index.png)
23+
24+
### **TiDB**
25+
26+
#### **Parser**
27+
28+
Syntax required to create a vector index:
29+
30+
* Adding a new ConstraintElem syntax creates a vector index when creating a table.
31+
32+
```sql
33+
CREATE TABLE foo (
34+
id INT PRIMARY KEY,
35+
data VECTOR(5),
36+
data64 VECTOR64(10),
37+
-- "WITH OPTION" is not supported in Phase 1
38+
VECTOR INDEX idx_name USING HNSW ((VEC_COSINE_DISTANCE(data))) [WITH OPTION "m=16, ef_construction=64"]
39+
);
40+
```
41+
42+
* By `IndexKeyTypeOpt` add `VECTOR` index type, and `IndexTypeOpt` add `HNSW` option type, and add vector index after creating the table.
43+
44+
```sql
45+
CREATE VECTOR INDEX idx_name USING HNSW ON foo ((VEC_COSINE_DISTANCE(data)))
46+
-- Proposal 1, "WITH OPTION" is not supported in Phase 1
47+
[WITH OPTION "m=16, ef_construction=64"];
48+
-- Proposal 2, "VECTOR_INDEX_PARAM" is not supported in Phase 1
49+
[VECTOR_INDEX_PARAM "m=16, ef_construction=64"];
50+
51+
ALTER TABLE foo ADD VECTOR INDEX idx_name USING HNSW ((VEC_COSINE_DISTANCE(data)))
52+
-- "WITH OPTION" is not supported in Phase 1
53+
[WITH OPTION "m=16, ef_construction=64"];
54+
```
55+
56+
* The syntax is similar to a normal index, and there are several ways to create an index
57+
58+
```sql
59+
CREATE VECTOR INDEX idx ON t ((VEC_COSINE_DISTANCE(a))) USING HNSW;
60+
CREATE VECTOR INDEX IF NOT EXISTS idx ON t ((VEC_COSINE_DISTANCE(a))) TYPE HNSW;
61+
CREATE VECTOR INDEX ident ON db.t (ident, ident ASC ) TYPE HNSW;
62+
63+
ALTER TABLE t ADD VECTOR ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
64+
ALTER TABLE t ADD VECTOR INDEX ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
65+
ALTER TABLE t ADD VECTOR INDEX IF NOT EXISTS ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
66+
```
67+
68+
#### **Planner && Executor**
69+
70+
* Support to choose vector index by CBO and `USE INDEX` hint.
71+
* The vector index only obtains the top n values through similarity. The following is the syntax for using a vector index when querying.
72+
73+
```sql
74+
SELECT *
75+
FROM foo
76+
ORDER BY VEC_COSINE_DISTANCE(data, '[3,1,2]')
77+
LIMIT 5;
78+
```
79+
80+
Currently, the use of vector indexing is supported on the master branch and supports the explain function.
81+
82+
* Protobuf adds the new field of \`indexId\` when TiDB interacts with TiFlash.
83+
84+
```sql
85+
type ANNQueryInfo struct {
86+
QueryType ANNQueryType `protobuf:"varint,1,opt,name=query_type,json=queryType,enum=tipb.ANNQueryType" json:"query_type"`
87+
DistanceMetric VectorDistanceMetric `protobuf:"varint,2,opt,name=distance_metric,json=distanceMetric,enum=tipb.VectorDistanceMetric" json:"distance_metric"`
88+
TopK uint32 `protobuf:"varint,3,opt,name=top_k,json=topK" json:"top_k"`
89+
ColumnName string `protobuf:"bytes,4,opt,name=column_name,json=columnName" json:"column_name"`
90+
ColumnId int64 `protobuf:"varint,5,opt,name=column_id,json=columnId" json:"column_id"`
91+
RefVecF32 []byte `protobuf:"bytes,6,opt,name=ref_vec_f32,json=refVecF32" json:"ref_vec_f32,omitempty"`
92+
MaxDistance float64 `protobuf:"fixed64,10,opt,name=max_distance,json=maxDistance" json:"max_distance"`
93+
HnswEfSearch uint32 `protobuf:"varint,20,opt,name=hnsw_ef_search,json=hnswEfSearch" json:"hnsw_ef_search"`
94+
XXX_unrecognized []byte `json:"-"`
95+
96+
// new fields
97+
IndexId int64 `protobuf:"varint,5,opt,name=column_id,json=columnId" json:"column_id"`
98+
}
99+
```
100+
101+
The analyze statement is not supported at the moment.
102+
103+
#### **DDL**
104+
105+
##### **Limitations and Constraints**
106+
107+
* A replica of TiFlash is required
108+
* Cluster must have a TiFlash replica
109+
* About the TiFlash replica of the table for which the vector index needs to be created
110+
* When a table is created with an index, the TiFlash replica of the table is set to 1 by default in the first stage (serverless is greater than or equal to minReplica in the configuration item).
111+
* After creating a table, creating an index requires a TiFlash replica.
112+
* Cannot be created on TiFlash nodes with static encryption enabled
113+
* Cannot be a primary key index
114+
* Cannot be a composite index
115+
* Create several index constraints for the same column
116+
* There can be different vector indexes defined using different `distance functions`
117+
* Different vector indexes defined by the same `distance function` will not exist
118+
* Same name processing as the ordinary index
119+
* Vector index requires support for functions
120+
121+
| function | Phase 1 | Phase 2 | Phase 3 |
122+
| :---- | :---- | :---- | :---- |
123+
| VEC\_L1\_DISTANCE | | | TBD |
124+
| VEC\_L2\_DISTANCE | v | | |
125+
| VEC\_NEGATIVE\_INNER\_PRODUCT(名字待定) VEC\_NEGATIVE\_INNER\_PRODUCT (name to be determined) | | v | |
126+
| VEC\_COSINE\_DISTANCE | v | | |
127+
128+
* Incompatibility with other features(This operation is not supported in the first phase of this design, but it can be considered for support in the future).
129+
* Setting invisible properties is not supported
130+
* Dropping a column with a vector index is not supported, and creating multiple indexes together is not supported. Also, modifying column types with vector indexes is not supported (lossy changes).
131+
* Workaround, split it into multiple operations: for example, we can first drop all vector indexes on the column, and then drop the column.
132+
133+
##### **Meta Information**
134+
135+
Add the `VectorIndexInfo` information field to describe and store information about vector indexes. In addition, add relevant information to the existing `IndexInfo` to record information about vector indexes.
136+
137+
```go
138+
// VectorIndexInfo is the information on the vector index of a column.
139+
type VectorIndexInfo struct {
140+
// Kind is the kind of vector index. Currently, only HNSW is supported.
141+
Kind VectorIndexKind `json:"kind"`
142+
// Dimension is the dimension of the vector.
143+
Dimension uint64 `json:"dimension"` // Set to 0 when initially parsed from comment. Will be assigned to flen later.
144+
// DistanceMetric is the distance metric used by the index.
145+
DistanceMetric DistanceMetric `json:"distance_metric"`
146+
}
147+
148+
// IndexInfo provides meta data describing a DB index.
149+
type IndexInfo struct {
150+
ID int64 `json:"id"`
151+
Name CIStr `json:"idx_name"` // Index name.
152+
...
153+
// VectorInfo is the vector index information.
154+
VectorInfo *VectorIndexInfo `json:"is_vector"`
155+
}
156+
```
157+
158+
##### **Add Vector Index**
159+
160+
* Consider adding `ActionAddVectorIndex` DDL job type, through the existing DDL framework, to achieve the operation of adding a vector index.
161+
* The middle state is consistent with adding a normal index (None \-\> Delete-Only \-\> Write-Only \-\> Write-Reorg \-\> Public).
162+
* When in the Write-Reorg state, do not use the runReorgJob mechanism of this state, that is, do not enable backfilling or check execution progress logic in the background.
163+
* Sync `VectorIndexInfo` to TiFlash.
164+
* Wait for TiFlash to fill in the HNSW Index data.
165+
* We can check whether the operation has been completed by checking whether the `ROWS_STABLE_NOT_INDEXED` value of the table corresponding to the added vector index on the `system.dt_local_indexes` table on TiFlash is 0\.
166+
* Obtain error info from indexStat to determine if there is an error during execution.
167+
* Compatible with Cancel, Pause, and Resume operations
168+
* The Pause operation needs to consider pausing the operation during the reorg phase.
169+
* Not supported in the first phase
170+
* Implement Rollback related operations.
171+
* `The IsRollbackable` check is the same as a general index.
172+
* Like general indexes, if you roll back during the Write-Reorg phase, you also need to delete the index data that has been added.
173+
174+
##### **Drop Vector Index**
175+
176+
* No need to add a DDL job type, similar to deleting a general index operation.
177+
* The middle state is consistent with deleting a general index (None \- \> Write-Only \- \> Delete-Only \- \> Delete-Reorg \- \> None).
178+
* Judgment, if it is a vector index
179+
* Delete-Reorg state
180+
* No need to wait for HNSW Index data deletion on TiFlash to complete .
181+
* There is no need to add this metadata to `MySQL.gc_delete_range`
182+
* Compatible with Cancel, Pause, and Resume operations
183+
* The Pause operation needs to consider pausing the operation during the reorg phase.
184+
* Not supported in the first phase
185+
* Implement Rollback related operations.
186+
* `The IsRollbackable` check is the same as a general index.
187+
188+
##### **Rename Vector Index**
189+
190+
No need to add a DDL job type, just like renaming a general index operation. We don't have to wait for TiFlash's synchronization operation, just return directly after the TiKV operation is completed.
191+
192+
#### **Display**
193+
194+
* Compatible with `admin show ddl` to display the progress of DDL operation execution
195+
* The display way is considered the same as the general DDL display way, and the `ROWS_STABLE_INDEXED` information of the corresponding table of the `information_schema.tiflash_indexes` table on TiFlash can be filled into the `ROW_COUNT` information of the DDL job.
196+
* Add or update relevant system table information
197+
* information\_schema tiflash\_tables (already supported)
198+
* information\_schema tiflash\_segments (already supported)
199+
* information\_schema tiflash\_indexes (already supported)
200+
* Add columns index\_name and index\_id.
201+
* Add columns error\_msg and error\_count.
202+
* information\_schema.tidb\_indexes
203+
* Support `show create table`
204+
* `The admin check table/index` and `admin repair table` operations do not currently consider processing of vector indexes.
205+
* Add relevant metrics
206+
* Because `ActionAddVectorIndex` is a new DDL job type, the monitoring information for adding a vector index operation needs to be handled the same as adding a general index.
207+
* Because dropping or renaming a vector index does not add a DDL job type, monitoring these two operations does not require special handling.
208+
209+
### **TiFlash**
210+
211+
#### **Add Vector Index**
212+
213+
Adding an index operation mainly consists of two steps:
214+
215+
1. Synchronize schema operations to add index data. Currently, there are two main ways for TiFlash to obtain schema from TiKV during operation: the first is that the Background Sync Thread is responsible for regularly calling Schema Syncer to update the Local Schema Copy module; the second is that the read and write operations in TiFlash will also call Schema Syncer for updates when needed.
216+
* Since adding a vector index does not trigger the synchronization schema of the second method, currently we can only rely on regularly synchronizing the schema to synchronize the metadata to TiFlash.
217+
* Notify TiFlash to perform the load schema operation through HTTP.
218+
* `syncTableSchema` handles synchronization Schema operations, and if new index information is found, it needs to be synchronized to the table.
219+
2. Create the vector index data.
220+
* Constraints that require creating an index. That is, when the index status is Write-Reorg.
221+
* Timing for generating index data:
222+
* When synchronizing to add index information, obtain all segment information under the corresponding table to generate the HNSW Index.
223+
* Use segments as task units and put them into LocalIndexerScheduler (already supported) for parallel processing tasks.
224+
* Other operations that may trigger the generation of index data: merging delta and stable, merging/splitting/ingesting segments.
225+
* Due to the uncontrollable timing of such triggering operations, consider separating the index generation from the existing stable layer. The plan is to wait until the index status is public to truly trigger the index generation.
226+
* Execute index data generation, with segments as the unit of index data generation.
227+
* Create a vector index for DMFile that does not have a corresponding index. Currently, only the data of the stable layer is created with a vector index.
228+
* The real file name for creating the index is similar to: {idx\_id} .idx.
229+
* This naming convention is different from the existing naming convention {col\_id} .idx. If compatibility is a concern, it is best to add a prefix.
230+
* Create a `VectorIndexHNSWBuilder` to read column data and populate the corresponding index data (already supported).
231+
* Update the relevant metadata in the segment, including the index information stored on the DMFile.
232+
* Wait for index data generation on all relevant segments to complete.
233+
234+
#### **Drop Vector Index**
235+
236+
* This operation is similar to the add vector index operation, except that this operation processes dropping the vector index.
237+
* Most of the process is the same as creating a vector index.
238+
* Drop the vector index data.
239+
* When `syncTableschema` handles synchronous Schema operations, it needs to remove the corresponding index.
240+
* The difference with the add vector index is that the constraint for triggering the operation is that the index state is Delete-Reorg state.
241+
* In addition, when triggering the actual deletion operation, if it is found that there is no corresponding vector index (i.e. metadata without vector index), and the corresponding vector index data file still exists, it is necessary to delete the index data.
242+
243+
#### **Display**
244+
245+
Add or update relevant system table information.
246+
247+
* information\_schema tiflash\_indexes needs to add some columns:
248+
* Add columns index\_name and index\_id.
249+
* Add columns error\_msg and error\_count.
250+
251+
### **Compatibility**
252+
253+
#### **Tools**
254+
255+
* BR
256+
* TiCDC
257+
* DM
258+
* No need for support
259+
* Lighting
260+
* Dumpling
261+
262+
#### **MySQL 9.0**
263+
264+
Follow-up MySQL supports [MySQL 9.0 vector](https://dev.mysql.com/doc/refman/9.0/en/vector.html), whether we need to consider compatibility later.
265+
266+
* For the time being, we don't need to be compatible with this syntax
267+
268+
## **Test Design**
269+
270+
### **Functional Tests**
271+
272+
*It’s used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered.*
273+
274+
### **Scenario Tests**
275+
276+
* Create multiple vector indexes on various tables. It tests the speed and resource usage when numerous vector indexes are added in parallel.
277+
* Test whether index data can be deleted in sync schema (Drop vector index) when the index is deleted.
278+
279+
### **Compatibility Tests**
280+
281+
*A checklist to test compatibility:*
282+
283+
1. *Compatibility with other features, like partition table, drop column, etc.*
284+
2. *Compatibility with other components, like BR, TiCDC, Dumpling, BR, DM, Lighting, etc.*
285+
3. *Upgrade compatibility*

docs/design/imgs/vector-index.png

353 KB
Loading

0 commit comments

Comments
 (0)