Until now, we've initialised the collection with a custom schema and built the capability to generate embeddings via Ollama. Now, we'll tie it all together.

A sample to process

We define a format for the document (based on the paperless-ngx API response):

doc = {
  "id": 3031,
  "correspondent": 63,
  "document_type": 10,
  "storage_path": None,
  "title": "10_ws3-petrov",
  "content": "Laura Oana Petrov * and Nobukazu Nakagoshi\n\n\nThe Use of GIS ...",
  "tags": [
    68
  ],
  "original_file_name": "10_ws3-petrov.pdf",
  "archived_file_name": "2024-07-17 Laura 10_ws3-petrov.pdf",
  "owner": 3,
  "notes": [],
  "custom_fields": [],
  "source_url": "http://docs.home.laurivan.com/documents/3031/preview"
}

Note: All numeric fields are references to other things (like the list of tags), and I've ignored them for the time being.

Generate chunks

First, we implement the chunk generator:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_chunks(text, chunk_size=250, overlap=0):
    '''
    Split a document into chunks
    '''
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )

    doc_splits = text_splitter.split_text(text)

    return doc_splits

This is probably the simplest chunk splitter! It splits a text in segments of approximately 250 characters each.

Build the embeddings

I generate the embeddings starting from a JSON structure with the following minimum set of fields:

  • content - the text to be indexed
  • id - the document ID to be referred
  • source_url - the reference point
  • title or file_name or archived_file_name or original_file_name

First I make sure I have the information from the JSON object stored in a variable named doc:

chunk_size = 250
overlap = 0

# Get the chunks based on "content"
chunks = get_chunks(doc["content"], chunk_size=chunk_size, overlap=overlap)
file_name = doc.get("file_name", doc.get("archived_file_name", doc.get("original_file_name", doc.get("title"))))
source_url = doc.get("source_url")

After that, I calculate the embeddings:

# Calculate the embeddings
embeddings = emb_chunks(chunks)
embedding_dim = len(embeddings)

Once I have the embeddings (this might be a bit slow because Ollama), I build the schema-compliant array to be inserted:

# Prepare the vectors to be inserted in milvus
data = []
for i in range(embedding_dim):
  val = {
   "embeddings": embeddings[i],
   "text": chunks[i],
   "uri": source_url,
   "title": file_name
  }
  data.append(val)

print(file_name)
print (len(embeddings))

I know there's probably a better way to do thins that what's written above, but that'll be for production code (plus that it's been a while since I wrote something meaningful in python).

My code looks like this:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_chunks(text, chunk_size=250, overlap=0):
  '''
  Split a document into chunks
  '''
  text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=chunk_size,
    chunk_overlap=overlap
  )
  doc_splits = text_splitter.split_text(text)
  return doc_splits

# import the emb_chunks function defined previously
from embeddings import emb_chunks

def generate_embeddings(doc, chunk_size=250, overlap=0):
  '''
  Generate the embeddings for a document, and prepare the decent data structure for it

  The document is a JSON/dict and must at least have the following fields:
  - content - the text to be indexed
  - id - the document ID to be referred
  - source_url - the reference point
  - title or file_name or archived_file_name or original_file_name
  '''

  # Get the chunks based on "content"
  chunks = get_chunks(doc["content"], chunk_size=chunk_size, overlap=overlap)
  file_name = doc.get("file_name", 
    doc.get("archived_file_name", 
      doc.get("original_file_name", 
        doc.get("title")
      )
    )
  )
  source_url = doc.get("source_url")

  # Calculate the embeddings
  embeddings = emb_chunks(chunks)
  embedding_dim = len(embeddings)

  # Prepare the vectors to be inserted in milvus
  data = []
  for i in range(embedding_dim):
    val = {
      "embeddings": embeddings[i],
      "text": chunks[i],
      "uri": source_url,
      "title": file_name
    }
    data.append(val)


  print(file_name)
  print (len(embeddings))
  return data

import pathlib
import json

if __name__ == "__main__":
    data = generate_embeddings(doc)
    pathlib.Path("embad.json").write_text(json.dumps(data))

It has the following characteristics:

  • The generate_embeddings function requires a JSON object with a specific minimum set of fields. It parameterizes the chunk size and the overlap for the text splitter.
  • The main generates the embeddings for the sample document doc and writes the resulted array as JSON in a file. I did this to see what the embeddings actually generated.