Until now, we've initialised the collection with a custom schema and built the capability to generate embeddings via Ollama. Now, we'll tie it all together.
A sample to process
We define a format for the document (based on the paperless-ngx
API response):
doc = {
"id": 3031,
"correspondent": 63,
"document_type": 10,
"storage_path": None,
"title": "10_ws3-petrov",
"content": "Laura Oana Petrov * and Nobukazu Nakagoshi\n\n\nThe Use of GIS ...",
"tags": [
68
],
"original_file_name": "10_ws3-petrov.pdf",
"archived_file_name": "2024-07-17 Laura 10_ws3-petrov.pdf",
"owner": 3,
"notes": [],
"custom_fields": [],
"source_url": "http://docs.home.laurivan.com/documents/3031/preview"
}
Note: All numeric fields are references to other things (like the list of tags), and I've ignored them for the time being.
Generate chunks
First, we implement the chunk generator:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def get_chunks(text, chunk_size=250, overlap=0):
'''
Split a document into chunks
'''
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=chunk_size,
chunk_overlap=overlap
)
doc_splits = text_splitter.split_text(text)
return doc_splits
This is probably the simplest chunk splitter! It splits a text in segments of approximately 250 characters each.
Build the embeddings
I generate the embeddings starting from a JSON structure with the following minimum set of fields:
content
- the text to be indexedid
- the document ID to be referred- source_url - the reference point
title
orfile_name
orarchived_file_name
ororiginal_file_name
First I make sure I have the information from the JSON object stored in a variable named doc
:
chunk_size = 250
overlap = 0
# Get the chunks based on "content"
chunks = get_chunks(doc["content"], chunk_size=chunk_size, overlap=overlap)
file_name = doc.get("file_name", doc.get("archived_file_name", doc.get("original_file_name", doc.get("title"))))
source_url = doc.get("source_url")
After that, I calculate the embeddings:
# Calculate the embeddings
embeddings = emb_chunks(chunks)
embedding_dim = len(embeddings)
Once I have the embeddings (this might be a bit slow because Ollama), I build the schema-compliant array to be inserted:
# Prepare the vectors to be inserted in milvus
data = []
for i in range(embedding_dim):
val = {
"embeddings": embeddings[i],
"text": chunks[i],
"uri": source_url,
"title": file_name
}
data.append(val)
print(file_name)
print (len(embeddings))
I know there's probably a better way to do thins that what's written above, but that'll be for production code (plus that it's been a while since I wrote something meaningful in python).
My code looks like this:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def get_chunks(text, chunk_size=250, overlap=0):
'''
Split a document into chunks
'''
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=chunk_size,
chunk_overlap=overlap
)
doc_splits = text_splitter.split_text(text)
return doc_splits
# import the emb_chunks function defined previously
from embeddings import emb_chunks
def generate_embeddings(doc, chunk_size=250, overlap=0):
'''
Generate the embeddings for a document, and prepare the decent data structure for it
The document is a JSON/dict and must at least have the following fields:
- content - the text to be indexed
- id - the document ID to be referred
- source_url - the reference point
- title or file_name or archived_file_name or original_file_name
'''
# Get the chunks based on "content"
chunks = get_chunks(doc["content"], chunk_size=chunk_size, overlap=overlap)
file_name = doc.get("file_name",
doc.get("archived_file_name",
doc.get("original_file_name",
doc.get("title")
)
)
)
source_url = doc.get("source_url")
# Calculate the embeddings
embeddings = emb_chunks(chunks)
embedding_dim = len(embeddings)
# Prepare the vectors to be inserted in milvus
data = []
for i in range(embedding_dim):
val = {
"embeddings": embeddings[i],
"text": chunks[i],
"uri": source_url,
"title": file_name
}
data.append(val)
print(file_name)
print (len(embeddings))
return data
import pathlib
import json
if __name__ == "__main__":
data = generate_embeddings(doc)
pathlib.Path("embad.json").write_text(json.dumps(data))
It has the following characteristics:
- The
generate_embeddings
function requires a JSON object with a specific minimum set of fields. It parameterizes the chunk size and the overlap for the text splitter. - The main generates the embeddings for the sample document
doc
and writes the resulted array as JSON in a file. I did this to see what the embeddings actually generated.
Member discussion: