Actualmente estoy trabajando con dsrag, un modelo ...

Question

Actualmente estoy trabajando con dsrag, un modelo que es capaz de rendir mucho mejor que otros metodos rag con unos resultados de hasta 96% de precision, pero quiero cambiar todo el proceso de rag con modelos locales a excepcion de la respuesta final que quiero usar open router, aqui le comparto la documentacion del proyecto en forma de notebook: Solving the out-of-context chunk problem for RAG

A large percentage of the problems developers face with RAG comes down
to this: Individual chunks oftentimes do not contain sufficient context
to be properly used by the retrieval system or the LLM. This leads to
the inability to answer questions and, more worryingly, hallucinations.

Examples of this problem

Chunks oftentimes refer to their subject via implicit references and
pronouns. This causes them to not be retrieved when they should be,
or to not be properly understood by the LLM.
Individual chunks oftentimes don’t contain the complete answer to a
question. The answer may be scattered across a few adjacent chunks.
Adjacent chunks presented to the LLM out of order cause confusion
and can lead to hallucinations.
Naive chunking can lead to text being split “mid-thought” leaving
neither chunk with useful context.
Individual chunks oftentimes only make sense in the context of the
entire section or document, and can be misleading when read on their
own.

What would a solution look like?

There are two solutions that, when used together, solve the bulk of
these problems.

Contextual chunk headers

The idea here is to add in higher-level context to the chunk by
prepending a chunk header. This chunk header could be as simple as just
the document title, or it could use a combination of document title, a
concise document summary, and the full hierarchy of section and
sub-section titles.

Chunks -> segments

Large chunks provide better context to the LLM than small chunks, but
they also make it harder to precisely retrieve specific pieces of
information. Some queries (like simple factoid questions) are best
handled by small chunks, while other queries (like higher-level
questions) require very large chunks. What we really need is a more
dynamic system that can retrieve short chunks when that's all that's
needed, but can also retrieve very large chunks when required. How do we
do that?

Given that relevant chunks tend to be clustered within their original
documents, what if we could identify those clusters and concatenate all
of the chunks back into their original form? This would provide much
more complete and understandable context to the LLM than just providing
the individual (and out-of-order) chunks.

Setup

First, we need to import some things and set API keys.

text
import sys
sys.path.append("../")
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf
from dsrag.rse import get_best_segments
import cohere
import os
from scipy.stats import beta
import numpy as np
import matplotlib.pyplot as plt

# set API keys
import os
os.environ["OPENAI_API_KEY"] = ""
os.environ["CO_API_KEY"] = ""

Load in document and add to KB

We'll go ahead and add the document to a dsRAG knowledge base, which
will take care of semantic sectioning and chunking. Semantic sectioning
is the process of breaking a document into semantically cohesive
sections. These sections are then broken up into smaller chunks if
needed. Each section is given a descriptive title, which we'll use later
in the ranking process.

text
# load in some data
#file_path = "../tests/data/levels_of_agi.pdf"
file_path = "../tests/data/nike_2023_annual_report.txt"
doc_id = os.path.basename(file_path).split(".")[0] # grab the file name without the extension so we can use it as the doc_id

# kb_id = "levels_of_agi"
kb_id = "nike_10k"

if file_path.endswith(".pdf"):
    document_text, _ = extract_text_from_pdf(file_path)
else:
    with open(file_path, "r") as f:
        document_text = f.read()

print (document_text[:1000])

FORM 10-K FORM 10-KUNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K 
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023 
OR
☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE TRANSITION PERIOD FROM TO .
Commission File No. 1-10635 
NIKE, Inc. 
(Exact name of Registrant as specified in its charter)
Oregon 93-0584541
(State or other jurisdiction of incorporation) (IRS Employer Identification No.)
One Bowerman Drive, Beaverton, Oregon 97005-6453 
(Address of principal executive offices and zip code)
(503) 671-6453 
(Registrant's telephone number, including area code)
SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:
Class B Common Stock NKE New York Stock Exchange
(Title of each class) (Trading symbol) (Name of each exchange on which registered)
SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:
NONE
Indica

# NOTE: the two sample KBs used in this notebook already exist in the example_kb_data directory, so you don't need to run this cell unless you want to create a new KB

# add to knowledge base
#kb = KnowledgeBase(kb_id=kb_id, exists_ok=False, storage_directory='example_kb_data') # create a new knowledge base
#kb.add_document(doc_id=doc_id, text=document_text) # add the document to the knowledge base

View chunks and sections

text
# load in chunks
kb = KnowledgeBase(kb_id=kb_id, exists_ok=True, storage_directory='example_kb_data')

num_chunks = len(kb.chunk_db.data[doc_id])
print (num_chunks)

chunks = []
for i in range(num_chunks):
    chunk = {
        "section_title": kb.chunk_db.get_section_title(doc_id, i),
        "chunk_text": kb.chunk_db.get_chunk_text(doc_id, i),
    }

    chunks.append(chunk)

chunks[0]

493

{'section_title': 'Form 10-K Cover Page and Company Information',
 'chunk_text': "FORM 10-K FORM 10-KUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-K \n(Mark One)\n☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE FISCAL YEAR ENDED MAY 31, 2023 \nOR\n☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE TRANSITION PERIOD FROM TO .\nCommission File No. 1-10635 \nNIKE, Inc. \n(Exact name of Registrant as specified in its charter)\nOregon 93-0584541\n(State or other jurisdiction of incorporation) (IRS Employer Identification No.)\nOne Bowerman Drive, Beaverton, Oregon 97005-6453 \n(Address of principal executive offices and zip code)\n(503) 671-6453 \n(Registrant's telephone number, including area code)\nSECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:"}

Now let's print out all of the sections that the semantic sectioning
process identified.

text
# print all section titles
unique_section_titles = []
for i in range(num_chunks):
    section_title = chunks[i]["section_title"]
    if section_title not in unique_section_titles:
        print (section_title)
        unique_section_titles.append(section_title)

Add contextual chunk headers and measure impact

We'll use the Cohere Rerank API for rank our chunks. Normally, we'd
start with a vector and/or keyword search to narrow down the list of
candidates, but since we're just dealing with a single document here we
can just send all chunks directly to the reranker, keeping things a bit
simpler.

text
def transform(x):
    """
    Transformation function to map the absolute relevance value to a value that is more uniformly distributed between 0 and 1
    - This is critical for the new version of RSE to work properly, because it utilizes the absolute relevance values to calculate the similarity scores
    - The relevance values given by the Cohere reranker tend to be very close to 0 or 1. This beta function used here helps to spread out the values more uniformly.
    """
    a, b = 0.4, 0.4  # These can be adjusted to change the distribution shape
    return beta.cdf(x, a, b)

def rerank_documents(query: str, documents: list) -> list:
    """
    Use Cohere Rerank API to rerank the search results
    """
    model = "rerank-v3.5"
    client = cohere.Client(api_key=os.environ["CO_API_KEY"])
    decay_rate = 30

    reranked_results = client.rerank(model=model, query=query, documents=documents)
    results = reranked_results.results
    reranked_indices = [result.index for result in results]
    reranked_similarity_scores = [result.relevance_score for result in results] # in order of reranked_indices

    # convert back to order of original documents and calculate the chunk values
    similarity_scores = [0] * len(documents)
    chunk_values = [0] * len(documents)
    for i, index in enumerate(reranked_indices):
        absolute_relevance_value = transform(reranked_similarity_scores[i])
        similarity_scores[index] = absolute_relevance_value
        v = np.exp(-i/decay_rate)*absolute_relevance_value # decay the relevance value based on the rank
        chunk_values[index] = v

    return similarity_scores, chunk_values

Contextual chunk headers

We'll add both a document context and a section context to each chunk.

text
if kb_id == "levels_of_agi":
    document_context = "Document: Levels of AGI"
elif kb_id == "nike_10k":
    document_context = "Document: Nike 10-K FY2023"
else:
    document_context = "Document: Unknown"

documents = []
documents_no_context = [] # baseline for comparison
for i in range(num_chunks):
    section_context = f"Section: {chunks[i]['section_title']}"
    chunk_text = chunks[i]["chunk_text"]
    document = f"{document_context}\n{section_context}\n\n{chunk_text}"
    documents.append(document)
    documents_no_context.append(chunk_text)

chunk_index_to_inspect = 226
query = "Nike operating segment results"
print (documents[chunk_index_to_inspect])

Document: Nike 10-K FY2023
Section: Operating Segments Overview

Asia Pacific & Latin America(2) 6,431  5,955  8 %  17 %  5,343  11 %  16 %
Global Brand Divisions(3) 58  102  -43 %  -43 %  25  308 %  302 %
TOTAL NIKE BRAND $ 48,763 $ 44,436  10 %  16 % $ 42,293  5 %  6 %
Converse  2,427  2,346  3 %  8 %  2,205  6 %  7 %
Corporate(4) 27  (72)  —  —  40  —  — 
TOTAL NIKE, INC. REVENUES $ 51,217 $ 46,710  10 %  16 % $ 44,538  5 %  6 %
(1) The percent change excluding currency changes represents a non-GAAP financial measure. For further information, see "Use of Non-GAAP Financial 
Measures".
(2) For additional information on the transition of our NIKE Brand businesses within our CASA territory to a third-party distributor, see Note 18 —

Prepending document and section title information adds important context
to the chunk. Let's see how much of a difference it makes in the
relevance ranking.

text
# run this chunk through the Cohere Rerank API with and without the context header
similarity_scores, chunk_values = rerank_documents(query, [documents[chunk_index_to_inspect], documents_no_context[chunk_index_to_inspect]])

print (f"Similarity with contextual chunk header: {similarity_scores[0]}")
print (f"Similarity without contextual chunk header: {similarity_scores[1]}")

Similarity with contextual chunk header: 0.7322965626380898
Similarity without contextual chunk header: 0.615422859563519

That's a pretty decent difference.

Run queries and visualize results

text
# Levels of AGI
#query = "Principles for defining levels of AGI"
#query = "Levels of AGI paper references"
#query = "CoALA paper references"
#query = "What are the levels of AGI?"
#query = "What is Autonomy Level 3"
#query = "Levels of AGI Introduction section"
#query = "Use of existing AI benchmarks like Big-bench and HELM"
#query = "How does autonomy factor into AGI?"
#query = "What are the levels of AGI?"

# Nike 10-K
#query = "Nike cash flow statement"
#query = "Nike revenue growth"
#query = "2023 Nike North America segment results"
#query = "Nike board of directors"
#query = "Nike supply chain challenges"
query = "Nike stock-based compensation expenses"
#query = "Nike consolidated balance sheet"
#query = "What is the mission of Nike?"
#query = "Nike CEO tenure"

similarity_scores, chunk_values = rerank_documents(query, documents)

plt.figure(figsize=(12, 5))
plt.title(f"Similarity of each chunk in the document to the search query")
plt.ylim(0, 1)
plt.xlabel("Chunk index")
plt.ylabel("Query-chunk similarity")
plt.scatter(range(len(chunk_values)), chunk_values)

How to interpret the chunk relevance plot above

In the plot above, the x-axis represents the chunk index. The first
chunk in the document has index 0, the next chunk has index 1, etc. The
y-axis represents the relevance of each chunk to the query. Viewing it
this way lets us see how relevant chunks tend to be clustered in one or
more sections of a document.

Note: the relevance values in this plot are actually a combination of
the raw relevance value and the relevance ranks. An exponential decay
function is applied to the ranks, and that is then multiplied by the raw
relevance value. Using this combination provides a more robust measure
of relevance than using just one or the other, at least in most cases.

What can we do with these clusters of relevant chunks?

The core idea is that clusters of relevant chunks, in their original
contiguous form, provide much better context to the LLM than individual
chunks can. Now for the hard part: how do we actually identify these
clusters?

If we can calculate chunk values in such a way that the value of a
segment is just the sum of the values of its constituent chunks, then
finding the optimal segment is a version of the maximum subarray
problem, for which a solution can be found relatively easily. How do we
define chunk values in such a way? We'll start with the idea that highly
relevant chunks are good, and irrelevant chunks are bad. We already have
a good measure of chunk relevance, on a scale of 0-1, so all we need to
do is subtract a constant threshold value from it. This will turn the
chunk value of irrelevant chunks to a negative number, while keeping the
values of relevant chunks positive. We call this the
irrelevant_chunk_penalty. A value around 0.2 seems to work well
empirically.

Relevant segment deep dive

text
# be sure you're using the Nike 10-K KB for these next few cells, as we'll be focusing on a single example query for that document

query = "Nike stock-based compensation expenses"
similarity_scores, chunk_values = rerank_documents(query, documents)

irrelevant_chunk_penalty = 0.2
all_relevance_values = [[v - irrelevant_chunk_penalty for v in chunk_values]]
document_splits = []
max_length = 30
overall_max_length = 50
minimum_value = 0.5

# get_best_segments solves a constrained version of the maximum sum subarray problem
best_segments, scores = get_best_segments(all_relevance_values, document_splits, max_length, overall_max_length, minimum_value)

print (best_segments)
print (scores)
print ()

# print the best segments
for segment_start, segment_end in best_segments:
    # concatenate the text of the chunks in the segment
    segment_text = f"[{document_context}]\n"
    for i in range(segment_start, segment_end):
        chunk_text = chunks[i]["chunk_text"]
        segment_text += chunk_text + "\n"

    print (segment_text)
    print ("\n---\n")

Let's zoom in to the best segment and see what the chunk relevance
values look like.

text
# plot the relevance values of the best segment
best_segment_chunk_indexes = list(range(best_segments[0][0], best_segments[0][1]))
best_segment_chunk_values = chunk_values[best_segments[0][0]:best_segments[0][1]]

plt.figure(figsize=(12, 5))
plt.title(f"Relevance values of the best segment")
plt.ylim(0, 1)
plt.xlabel("Chunk index")
plt.ylabel("Query-chunk similarity")
plt.scatter(best_segment_chunk_indexes, best_segment_chunk_values)

You'll notice that a few of the chunks (especially 402-404) aren't
particularly relevant, but they're surrounded by relevant chunks, so we
include them in the segment for continuity.

text
# print the individual chunks in the best segment - annotated with their chunk indexes and relevance values
for chunk_index in best_segment_chunk_indexes:
    chunk_text = chunks[chunk_index]["chunk_text"]
    chunk_value = chunk_values[chunk_index]
    print (f"Chunk index: {chunk_index} - Relevance value: {chunk_value}")
    print (chunk_text)
    print ("\n---\n")

Looking at the content of each of these chunks, it's clear that chunks
397-401 are highly relevant, as expected. But looking closely at chunks
402-404, we can see they're actually also relevant, despite being marked
as irrelevant by our ranking model. This is a common theme: chunks that
are marked as not relevant, but are sandwiched between highly relevant
chunks, are oftentimes quite relevant. In this case, the chunks were
about stock option valuation, so while they weren't explicitly
discussing stock-based compensation expenses (which is what we were
searching for), in the context of the surrounding chunks it's clear that
they are actually relevant.

aqui el codigo que he hecho:

import subprocess
import os
import time
import threading
import sys

print("Installing Ollama...")
try:
install_result = subprocess.run(
["curl", "-fsSL", "https://ollama.com/install.sh"],
stdout=subprocess.PIPE,
check=True
)
install_script = install_result.stdout.decode('utf-8')

text
# Ejecutar el script de instalación
install_process = subprocess.run(
    ["bash", "-c", install_script],
    check=True
)

print("✅ Ollama installed successfully")

except subprocess.CalledProcessError as e:
print(f"❌ Error installing Ollama: {e}")
sys.exit(1)

Confirmar que Ollama se instaló verificando la versión

try:
version_result = subprocess.run(["ollama", "--version"], capture_output=True, text=True)
print(f"Ollama version: {version_result.stdout.strip()}")
except Exception as e:
print(f"❌ Error checking Ollama version: {e}")
# Intentar encontrar el binario
find_result = subprocess.run(["which", "ollama"], capture_output=True, text=True)
print(f"Ollama binary location: {find_result.stdout.strip()}")

Definir una función para iniciar Ollama más robusta

def run_ollama_server():
try:
print("Starting Ollama server...")
# Usar la ruta completa del binario si es necesario
ollama_path = subprocess.run(["which", "ollama"], capture_output=True, text=True).stdout.strip()

text
    if not ollama_path:
        print("❌ Ollama binary not found in PATH")
        return None

    print(f"Using Ollama binary at: {ollama_path}")

    # Iniciar el servidor con salida redirigida
    process = subprocess.Popen(
        [ollama_path, "serve"],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )

    print(f"Ollama process started with PID: {process.pid}")
    return process
except Exception as e:
    print(f"❌ Error starting Ollama server: {e}")
    return None

Iniciar el servidor y esperar

ollama_process = run_ollama_server()

Esperar más tiempo para asegurar que el servidor esté listo

print("Waiting for Ollama server to start (30 seconds)...")
time.sleep(30)

Verificar que el servidor está respondiendo

import requests

def check_ollama_server():
max_attempts = 5
for attempt in range(max_attempts):
try:
response = requests.get("http://localhost:11434/api/tags")
if response.status_code == 200:
print(f"✅ Ollama server is running correctly (attempt {attempt+1})")
return True
else:
print(f"⚠️ Ollama server responded with status code: {response.status_code} (attempt {attempt+1})")
except requests.exceptions.ConnectionError:
print(f"⚠️ Ollama server connection refused (attempt {attempt+1})")
except Exception as e:
print(f"❌ Error checking Ollama server: {e} (attempt {attempt+1})")

text
    if attempt < max_attempts - 1:
        print("Waiting 10 more seconds...")
        time.sleep(10)

return False

if not check_ollama_server():
print("❌ Could not connect to Ollama server after multiple attempts")
# Intentar reiniciar el servidor
if ollama_process:
print("Attempting to restart Ollama server...")
ollama_process.terminate()
time.sleep(5)
ollama_process = run_ollama_server()
time.sleep(20)
if not check_ollama_server():
print("❌ Failed to restart Ollama server")
sys.exit(1)

Descargar modelos con verificación

def pull_model(model_name):
max_attempts = 3
for attempt in range(max_attempts):
try:
print(f"Pulling model {model_name} (attempt {attempt+1})...")
result = subprocess.run(
["ollama", "pull", model_name],
capture_output=True,
text=True,
check=True
)
print(f"✅ Successfully pulled model: {model_name}")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Error pulling model {model_name}: {e.stderr}")
if attempt < max_attempts - 1:
print("Waiting before retry...")
time.sleep(5)

return False

Descargar modelos necesarios

models_to_pull = [
"MHKetbi/Unsloth_gemma3-1b-it:q5_K_M",
"jeffh/intfloat-multilingual-e5-large:f16"
]

for model in models_to_pull:
if not pull_model(model):
print(f"❌ Failed to pull {model} after multiple attempts")
print("Continuing anyway, but this may cause errors later")

Descargar modelos con verificación

def pull_model(model_name):
try:
result = subprocess.run(["ollama", "pull", model_name],
capture_output=True, text=True, check=True)
print(f"Successfully pulled model: {model_name}")
return True
except subprocess.CalledProcessError as e:
print(f"Error pulling model {model_name}: {e.stderr}")
return False

Descargar modelos necesarios

models_to_pull = [
"MHKetbi/Unsloth_gemma3-1b-it:q5_K_M",
"jeffh/intfloat-multilingual-e5-large:f16"
]

for model in models_to_pull:
pull_success = pull_model(model)
if not pull_success:
print(f"Warning: Failed to pull {model}, this may cause errors later")

from dsrag.embedding import Embedding, OllamaEmbedding
from dsrag.reranker import Reranker, NoReranker
from dsrag.llm import LLM, OllamaAPI
from dsrag.knowledge_base import KnowledgeBase
from typing import List, Dict, Optional
from sentence_transformers import CrossEncoder
from langchain_ollama import ChatOllama
from dsrag.database.vector.types import Vector, VectorSearchResult
import requests
import json

Exploremos la clase OllamaEmbedding para ver qué parámetros acepta

print("\nExploring OllamaEmbedding parameters:")
import inspect
try:
print(inspect.signature(OllamaEmbedding.init))
print("OllamaEmbedding docstring:")
print(OllamaEmbedding.init.doc)
except Exception as e:
print(f"Error exploring OllamaEmbedding: {e}")

Clase de embedding corregida - NO pasa model en kwargs

class LocalEmbedding(OllamaEmbedding):
def init(self, **kwargs):
# IMPORTANTE: Eliminar 'model' de kwargs para evitar duplicación
if 'model' in kwargs:
kwargs.pop('model')

text
    # Establecer dimensión manualmente
    kwargs['dimension'] = 1024

    # Llamar al constructor padre con model como argumento posicional
    super().__init__(model="jeffh/intfloat-multilingual-e5-large:f16", **kwargs)

Clase para reranker - se mantiene igual

class MultilingualReranker(Reranker):
def init(self, model_name="jinaai/jina-reranker-v2-base-multilingual"):
super().init()
self.model = CrossEncoder(model_name, trust_remote_code=True)

text
def rerank_search_results(self, query: str, search_results: list) -> list[VectorSearchResult]:
    sentences1 = [query] * len(search_results)
    sentences2 = [result.metadata['chunk_text'] for result in search_results]
    scores = self.model.predict(list(zip(sentences1, sentences2)))

    for i in range(len(search_results)):
        search_results[i].similarity = scores[i]
    return sorted(search_results, key=lambda x: x.similarity, reverse=True)

Clase para LLM de AutoContext corregida

class LocalAutoContextLLM(LLM):
def init(self, **kwargs):
super().init()
self.model_name = "MHKetbi/Unsloth_gemma3-1b-it:q5_K_M"
self.base_url = "http://localhost:11434"

text
def make_llm_call(self, chat_messages: list[dict]) -> str:
    prompt_text = ""
    for msg in chat_messages:
        if msg["role"] == "user":
            prompt_text += "<start_of_turn>user " + msg["content"].strip() + "<end_of_turn>"
        else:
            prompt_text += "<start_of_turn>model " + msg["content"].strip() + "<end_of_turn>"
    prompt_text += "<start_of_turn>model"

    # Llamada a la API de Ollama
    try:
        payload = {
            "model": self.model_name,
            "messages": [{"role": "user", "content": prompt_text}]
        }

        response = requests.post(f"{self.base_url}/api/chat", json=payload)
        response.raise_for_status()

        response_json = response.json()

        # Manejar diferentes estructuras de respuesta posibles
        if "message" in response_json and "content" in response_json["message"]:
            return response_json["message"]["content"].strip()
        elif "choices" in response_json:
            return response_json["choices"][0]["message"]["content"].strip()
        else:
            print(f"Unexpected response structure: {response_json}")
            return str(response_json)

    except Exception as e:
        print(f"Error in LLM call: {e}")
        return f"Error generating response: {str(e)}"

Función para verificar que Ollama está funcionando

def check_ollama_server():
try:
response = requests.get("http://localhost:11434/api/tags")
if response.status_code == 200:
print("✅ Ollama server is running correctly")
return True
else:
print(f"⚠️ Ollama server responded with status code: {response.status_code}")
return False
except Exception as e:
print(f"❌ Error connecting to Ollama server: {e}")
return False

Función para ejecutar todo el flujo de trabajo

def run_dsrag_workflow():
try:
print("Initializing embedding model...")
local_embedding_model = LocalEmbedding()
print("Embedding model initialized successfully")

text
    print("Initializing reranker...")
    multilingual_reranker = MultilingualReranker()
    print("Reranker initialized successfully")

    print("Initializing AutoContext LLM...")
    local_autocontext_llm = LocalAutoContextLLM()
    print("AutoContext LLM initialized successfully")

    print("Initializing Knowledge Base...")
    kb = KnowledgeBase(
        kb_id="local_kb_vtt_ollama",
        embedding_model=local_embedding_model,
        reranker=multilingual_reranker,
        auto_context_model=local_autocontext_llm,
    )
    print("Knowledge Base initialized successfully")

    # Ejemplo de uso con VTT
    vtt_content = """WEBVTT

Kind: captions
Language: en

00:00:00.000 --> 00:00:05.000
This is the first subtitle, talking about dsRAG features.

00:00:05.000 --> 00:00:10.000
dsRAG excels at handling complex queries over dense text.

00:00:10.000 --> 00:00:15.000
It uses Semantic Sectioning, AutoContext, and RSE for improved RAG.

00:00:15.000 --> 00:00:20.000
dsRAG achieves high accuracy on benchmarks like FinanceBench.
"""

text
    # Procesar VTT
    vtt_chunks = [line for line in vtt_content.strip().split('\n\n') if line.strip()]

    # Añadir documento al KB
    doc_id_vtt = "vtt_document_1"

    print("Adding document to Knowledge Base...")
    kb.add_document(
        doc_id=doc_id_vtt,
        text="\n\n".join(vtt_chunks),
        document_title="VTT Document Example",
        auto_context_config={"get_document_summary": False},
        semantic_sectioning_config={"use_semantic_sectioning": False}
    )

    print("Document added successfully!")

    # Consultar el KB
    search_queries_vtt = ["What are the key features of dsRAG?"]
    print(f"Querying Knowledge Base with: {search_queries_vtt}")

    results_vtt = kb.query(search_queries_vtt)

    print("Query successful! Results:")
    for i, result in enumerate(results_vtt):
        print(f"Result {i+1}: {result.get('content', '')[:100]}...")

    return kb, results_vtt

except Exception as e:
    print(f"Error in workflow: {e}")
    import traceback
    traceback.print_exc()
    return None, None

Ejecutar el flujo de trabajo

kb, results = run_dsrag_workflow()

Ejemplo de consulta final utilizando langchain-ollama si todo funcionó

if kb is not None and results is not None:
try:
from langchain_ollama import ChatOllama

text
    # Preparar contexto
    context = "\n\n".join([segment.get('content', '') for segment in results])

    # Consulta del usuario
    user_query = "Summarize the key features and accuracy of dsRAG based on the provided context."

    # Generar respuesta final
    print("\nGenerating final response with langchain-ollama...")
    ollama_llm = ChatOllama(
        model="MHKetbi/Unsloth_gemma3-1b-it:q5_K_M",
        base_url="http://localhost:11434"
    )

    prompt = f"Based on the following information:\n{context}\n\nAnswer the query: {user_query}"
    response = ollama_llm.invoke(prompt)

    print("\nFinal answer:")
    print(response.content)

except Exception as e:
    print(f"Error generating final response: {e}")
    import traceback
    traceback.print_exc()

from langchain_ollama import ChatOllama

def get_ollama_final_response(context, user_query):
"""
Generates a response using local Ollama via langchain-ollama.
"""
ollama_llm = ChatOllama(
model="gemma-7b-it", # Correct Ollama model name. MAKE SURE THIS IS INSTALLED!
base_url="http://localhost:11434" # Ollama's default API endpoint
)
prompt = f"Based on the following information: {context}\n\nAnswer the query: {user_query}"
response = ollama_llm.invoke(prompt) # Use Langchain-Ollama's invoke method
return response.content # Access response content

search_queries_vtt = ["What are the key features of dsRAG?", "How accurate is dsRAG?"]
results_vtt = kb.query(search_queries_vtt)

context_vtt = "\n\n".join([segment['content'] for segment in results_vtt])
user_query_vtt = "Summarize the key features and accuracy of dsRAG based on the provided context."

Generate response using local Ollama (langchain-ollama)

answer_ollama = get_ollama_final_response(context_vtt, user_query_vtt)
print("\nAnswer from Local Ollama (langchain-ollama):\n---")
print(answer_ollama) --thinking_budget 8192

Answer 1

Tu implementación se ve muy completa y parece que has integrado varios componentes de dsRAG (como la inyección de contexto, la segmentación semántica, el reranqueo y el uso de embeddings locales) para construir un flujo de trabajo que usa modelos locales en la mayor parte del proceso. A continuación, te ofrezco un resumen y algunas sugerencias:

Resumen del Flujo de Trabajo

Instalación y verificación de Ollama
- Utilizas un script de instalación vía curl para instalar Ollama y luego verificas su versión y ubicación.
- Además, inicias el servidor de Ollama y realizas varios intentos para confirmar que está respondiendo correctamente en el puerto definido.
Descarga de modelos
- Descargas modelos necesarios (por ejemplo, un modelo de embeddings y otro para el LLM) con un mecanismo de reintentos en caso de error.
- Noto que defines la función pull_model dos veces; podrías consolidarla para evitar duplicidad.
Definición de clases personalizadas
- LocalEmbedding: Extiende la clase OllamaEmbedding eliminando el parámetro redundante “model” y estableciendo manualmente la dimensión del embedding.
- MultilingualReranker: Implementa un reranqueo basado en un modelo CrossEncoder para ordenar los chunks según su relevancia.
- LocalAutoContextLLM: Define cómo se arma el prompt (añadiendo tokens especiales) y se realiza la llamada al LLM a través del endpoint de Ollama.
Ejecución del flujo de dsRAG
- Se crea un Knowledge Base (KB) y se añade un documento (en este caso, un ejemplo en formato VTT).
- Se consulta el KB con algunas queries y se visualizan los resultados.
Generación de la respuesta final
- Para la respuesta final, utilizas la librería langchain-ollama (a través de la función get_ollama_final_response) que arma un prompt con el contexto obtenido del KB y la query del usuario.

Consideraciones y Sugerencias

Uso de modelos locales vs. respuesta final con Open Router:
Si tu objetivo es usar modelos locales para todo el proceso de RAG excepto la respuesta final, deberás asegurarte de que en la última parte (donde generas la respuesta final) se invoque el endpoint de Open Router en lugar del servidor local de Ollama.
- Por ejemplo, en la función get_ollama_final_response, podrías parametrizar el base_url y el modelo para que apunten al servicio de Open Router (asegurándote de que la autenticación y demás parámetros sean los requeridos por dicho servicio).
Consolidación de funciones duplicadas:
Noté que defines dos versiones de la función pull_model. Consolidarla en una única función te ayudará a mantener el código más limpio y evitar posibles inconsistencias.
Manejo de errores y reintentos:
El uso de bloques try/except y la lógica de reintentos son muy adecuados para asegurarse de que los servicios (como Ollama) estén disponibles. Asegúrate de capturar y loggear suficientes detalles en caso de error, de modo que sea más fácil diagnosticar problemas en entornos de producción.
Construcción del prompt y tokens especiales:
La forma en que concatenas los mensajes para crear el prompt en LocalAutoContextLLM es interesante. Considera revisar que la sintaxis y los tokens especiales sean consistentes con las expectativas del modelo que usas. Podrías modularizar esa lógica si en el futuro decides cambiar el formato.
Verificación del KB y consulta de resultados:
El flujo para añadir documentos al KB y consultarlos parece funcionar correctamente. Si en el futuro trabajas con documentos de mayor tamaño o con múltiples tipos de contenido, podrías considerar agregar más parámetros de configuración a la hora de segmentar o agregar contexto.
Documentación y comentarios:
La documentación incluida en el notebook es muy útil para entender el propósito de cada bloque. Mantener comentarios claros en el código ayuda a la mantenibilidad, especialmente en proyectos complejos como este.

Conclusión

Tu implementación integra varias técnicas avanzadas para mejorar la calidad de la recuperación y la generación en RAG. Si tu objetivo es mantener el procesamiento de RAG (incluyendo chunking, embeddings y reranqueo) de forma local, y únicamente externalizar la generación final (por ejemplo, a Open Router), asegúrate de:

Configurar correctamente la función final para que haga la llamada al endpoint de Open Router (cambiando el base_url y las credenciales si es necesario).
Probar la integración final para verificar que el prompt y la respuesta se manejen de forma coherente con el servicio externo.

Si tienes alguna duda específica sobre algún componente o necesitas ayuda con la integración con Open Router, ¡déjame saber y con gusto profundizo en ello!

Answer 2

A continuación te presento una versión completa y pulida del código, donde se usan modelos locales para el proceso RAG (embedding, reranqueo, segmentación, etc.) y se utiliza Open Router para generar la respuesta final. He consolidado funciones duplicadas, mejorado el manejo de errores y parametrizado la llamada final a Open Router. Reemplaza las constantes (como la API key de Open Router) según tu entorno:

python
import subprocess
import os
import time
import sys
import requests
import json
import traceback
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# =============================================================================
# Paso 1: Instalación y verificación de Ollama
# =============================================================================

def install_and_verify_ollama():
    print("Instalando Ollama...")
    try:
        install_result = subprocess.run(
            ["curl", "-fsSL", "https://ollama.com/install.sh"],
            stdout=subprocess.PIPE,
            check=True
        )
        install_script = install_result.stdout.decode('utf-8')
        # Ejecutar el script de instalación
        subprocess.run(["bash", "-c", install_script], check=True)
        print("✅ Ollama instalado correctamente")
    except subprocess.CalledProcessError as e:
        print(f"❌ Error instalando Ollama: {e}")
        sys.exit(1)

    try:
        version_result = subprocess.run(["ollama", "--version"], capture_output=True, text=True)
        print(f"Versión de Ollama: {version_result.stdout.strip()}")
    except Exception as e:
        print(f"❌ Error verificando la versión de Ollama: {e}")
        find_result = subprocess.run(["which", "ollama"], capture_output=True, text=True)
        print(f"Ubicación del binario Ollama: {find_result.stdout.strip()}")

def run_ollama_server():
    try:
        print("Iniciando servidor Ollama...")
        ollama_path = subprocess.run(["which", "ollama"], capture_output=True, text=True).stdout.strip()
        if not ollama_path:
            print("❌ No se encontró el binario de Ollama en PATH")
            return None
        print(f"Usando binario Ollama en: {ollama_path}")
        process = subprocess.Popen(
            [ollama_path, "serve"],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        print(f"Proceso de Ollama iniciado con PID: {process.pid}")
        return process
    except Exception as e:
        print(f"❌ Error iniciando el servidor Ollama: {e}")
        return None

def check_ollama_server(max_attempts=5):
    for attempt in range(max_attempts):
        try:
            response = requests.get("http://localhost:11434/api/tags")
            if response.status_code == 200:
                print(f"✅ Servidor Ollama funcionando correctamente (intento {attempt+1})")
                return True
            else:
                print(f"⚠️ Servidor Ollama respondió con código {response.status_code} (intento {attempt+1})")
        except requests.exceptions.ConnectionError:
            print(f"⚠️ Conexión rechazada por el servidor Ollama (intento {attempt+1})")
        except Exception as e:
            print(f"❌ Error verificando el servidor Ollama: {e} (intento {attempt+1})")
        if attempt < max_attempts - 1:
            print("Esperando 10 segundos más...")
            time.sleep(10)
    return False

def pull_model(model_name, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            print(f"Descargando modelo {model_name} (intento {attempt+1})...")
            subprocess.run(
                ["ollama", "pull", model_name],
                capture_output=True,
                text=True,
                check=True
            )
            print(f"✅ Modelo descargado: {model_name}")
            return True
        except subprocess.CalledProcessError as e:
            print(f"❌ Error descargando modelo {model_name}: {e.stderr}")
            if attempt < max_attempts - 1:
                print("Esperando antes de reintentar...")
                time.sleep(5)
    return False

# =============================================================================
# Paso 2: Configuración de dsRAG y sus componentes locales
# =============================================================================

# Configurar variables de entorno (si fuese necesario)
os.environ["OPENAI_API_KEY"] = ""
os.environ["CO_API_KEY"] = ""

# Importamos módulos de dsRAG y otros
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf
from dsrag.rse import get_best_segments
from dsrag.embedding import Embedding, OllamaEmbedding
from dsrag.reranker import Reranker
from dsrag.llm import LLM
from dsrag.database.vector.types import VectorSearchResult
from sentence_transformers import CrossEncoder
from langchain_ollama import ChatOllama

# Clase de embedding local: elimina el parámetro "model" de kwargs para evitar duplicidad
class LocalEmbedding(OllamaEmbedding):
    def __init__(self, **kwargs):
        if 'model' in kwargs:
            kwargs.pop('model')
        kwargs['dimension'] = 1024
        super().__init__(model="jeffh/intfloat-multilingual-e5-large:f16", **kwargs)

# Clase de reranker multilingüe basado en CrossEncoder
class MultilingualReranker(Reranker):
    def __init__(self, model_name="jinaai/jina-reranker-v2-base-multilingual"):
        super().__init__()
        self.model = CrossEncoder(model_name, trust_remote_code=True)

    def rerank_search_results(self, query: str, search_results: list) -> list:
        sentences1 = [query] * len(search_results)
        sentences2 = [result.metadata['chunk_text'] for result in search_results]
        scores = self.model.predict(list(zip(sentences1, sentences2)))
        for i in range(len(search_results)):
            search_results[i].similarity = scores[i]
        return sorted(search_results, key=lambda x: x.similarity, reverse=True)

# Clase de LLM para AutoContext utilizando Ollama local
class LocalAutoContextLLM(LLM):
    def __init__(self, **kwargs):
        super().__init__()
        self.model_name = "MHKetbi/Unsloth_gemma3-1b-it:q5_K_M"
        self.base_url = "http://localhost:11434"

    def make_llm_call(self, chat_messages: list) -> str:
        prompt_text = ""
        for msg in chat_messages:
            if msg["role"] == "user":
                prompt_text += "<start_of_turn>user " + msg["content"].strip() + "<end_of_turn>"
            else:
                prompt_text += "<start_of_turn>model " + msg["content"].strip() + "<end_of_turn>"
        prompt_text += "<start_of_turn>model"
        try:
            payload = {
                "model": self.model_name,
                "messages": [{"role": "user", "content": prompt_text}]
            }
            response = requests.post(f"{self.base_url}/api/chat", json=payload)
            response.raise_for_status()
            response_json = response.json()
            if "message" in response_json and "content" in response_json["message"]:
                return response_json["message"]["content"].strip()
            elif "choices" in response_json:
                return response_json["choices"][0]["message"]["content"].strip()
            else:
                print(f"Estructura de respuesta inesperada: {response_json}")
                return str(response_json)
        except Exception as e:
            print(f"Error en la llamada LLM: {e}")
            return f"Error generando respuesta: {str(e)}"

# =============================================================================
# Paso 3: Flujo de trabajo dsRAG (carga de documento, consulta, etc.)
# =============================================================================

def run_dsrag_workflow():
    try:
        print("Inicializando modelo de embeddings...")
        local_embedding_model = LocalEmbedding()
        print("Modelo de embeddings inicializado")

        print("Inicializando reranker...")
        multilingual_reranker = MultilingualReranker()
        print("Reranker inicializado")

        print("Inicializando AutoContext LLM local...")
        local_autocontext_llm = LocalAutoContextLLM()
        print("AutoContext LLM inicializado")

        print("Inicializando Knowledge Base...")
        kb = KnowledgeBase(
            kb_id="local_kb_vtt_openrouter",
            embedding_model=local_embedding_model,
            reranker=multilingual_reranker,
            auto_context_model=local_autocontext_llm,
        )
        print("Knowledge Base inicializado")

        # Ejemplo de documento en formato VTT
        vtt_content = """WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:05.000
This is the first subtitle, talking about dsRAG features.

00:00:05.000 --> 00:00:10.000
dsRAG excels at handling complex queries over dense text.

00:00:10.000 --> 00:00:15.000
It uses Semantic Sectioning, AutoContext, and RSE for improved RAG.

00:00:15.000 --> 00:00:20.000
dsRAG achieves high accuracy on benchmarks like FinanceBench.
"""
        # Procesar VTT y añadir documento al KB
        vtt_chunks = [line for line in vtt_content.strip().split('\n\n') if line.strip()]
        doc_id_vtt = "vtt_document_1"
        print("Añadiendo documento al Knowledge Base...")
        kb.add_document(
            doc_id=doc_id_vtt,
            text="\n\n".join(vtt_chunks),
            document_title="VTT Document Example",
            auto_context_config={"get_document_summary": False},
            semantic_sectioning_config={"use_semantic_sectioning": False}
        )
        print("Documento añadido exitosamente!")
        
        # Realizar consulta al KB
        search_queries_vtt = ["What are the key features of dsRAG?", "How accurate is dsRAG?"]
        print(f"Consultando KB con: {search_queries_vtt}")
        results_vtt = kb.query(search_queries_vtt)
        print("Consulta realizada. Resultados:")
        for i, result in enumerate(results_vtt):
            print(f"Resultado {i+1}: {result.get('content', '')[:100]}...")
        return kb, results_vtt
    except Exception as e:
        print(f"Error en el workflow: {e}")
        traceback.print_exc()
        return None, None

# =============================================================================
# Paso 4: Generar respuesta final usando Open Router
# =============================================================================

def get_open_router_final_response(context, user_query, open_router_api_key, 
                                   open_router_base_url="https://api.openrouter.ai/chat/completions"):
    """
    Genera una respuesta final utilizando Open Router.
    """
    prompt = f"Based on the following information:\n{context}\n\nAnswer the query: {user_query}"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {open_router_api_key}"
    }
    payload = {
        "model": "gpt-3.5-turbo",  # Ajusta el modelo si es necesario
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.7,
    }
    try:
        response = requests.post(open_router_base_url, headers=headers, json=payload)
        response.raise_for_status()
        response_json = response.json()
        if "choices" in response_json:
            return response_json["choices"][0]["message"]["content"].strip()
        else:
            return str(response_json)
    except Exception as e:
        print(f"Error generando respuesta final con Open Router: {e}")
        return f"Error: {str(e)}"

# =============================================================================
# Main: Ejecución del flujo completo
# =============================================================================

def main():
    # Paso 1: Instalar y verificar Ollama, iniciar el servidor
    install_and_verify_ollama()
    ollama_process = run_ollama_server()
    print("Esperando 30 segundos para que el servidor Ollama se inicie...")
    time.sleep(30)
    if not check_ollama_server():
        print("❌ No se pudo conectar al servidor Ollama tras múltiples intentos.")
        if ollama_process:
            print("Intentando reiniciar el servidor Ollama...")
            ollama_process.terminate()
            time.sleep(5)
            ollama_process = run_ollama_server()
            time.sleep(20)
            if not check_ollama_server():
                print("❌ Falló el reinicio del servidor Ollama. Saliendo.")
                sys.exit(1)
    
    # Paso 2: Descargar modelos necesarios
    models_to_pull = [
        "MHKetbi/Unsloth_gemma3-1b-it:q5_K_M",
        "jeffh/intfloat-multilingual-e5-large:f16"
    ]
    for model in models_to_pull:
        if not pull_model(model):
            print(f"❌ Falló la descarga del modelo {model}. Continuando, pero podrían ocurrir errores.")
    
    # Paso 3: Ejecutar el workflow de dsRAG local
    kb, results = run_dsrag_workflow()
    if kb is None or results is None:
        print("Error en la ejecución del workflow dsRAG. Saliendo.")
        sys.exit(1)
    
    # Preparar contexto para la respuesta final a partir de los resultados del KB
    context_vtt = "\n\n".join([segment.get('content', '') for segment in results])
    user_query_vtt = "Summarize the key features and accuracy of dsRAG based on the provided context."
    
    # Paso 4: Generar respuesta final usando Open Router
    # Reemplaza 'TU_OPEN_ROUTER_API_KEY' con tu API key real de Open Router.
    open_router_api_key = "TU_OPEN_ROUTER_API_KEY"
    open_router_base_url = "https://api.openrouter.ai/chat/completions"
    
    print("Generando respuesta final usando Open Router...")
    final_answer = get_open_router_final_response(context_vtt, user_query_vtt, open_router_api_key, open_router_base_url)
    print("\nRespuesta final:")
    print(final_answer)
    
    # Finalizar el proceso de Ollama si es necesario
    if ollama_process:
        print("Terminando el proceso de Ollama...")
        ollama_process.terminate()

if __name__ == "__main__":
    main()

Detalles importantes de esta implementación

Instalación y verificación de Ollama:
Se ejecuta el script de instalación y se comprueba la versión. Se inicia el servidor y se verifica su disponibilidad (reintentando si es necesario).
Componentes de dsRAG locales:
Se definen clases personalizadas para el embedding, reranqueo y LLM de AutoContext, de modo que se utilicen modelos locales (Ollama para LLM y un modelo de embeddings específico).
Flujo de trabajo (Knowledge Base):
Se agrega un ejemplo de documento en formato VTT al KB, se realizan consultas y se imprimen los primeros resultados.
Generación de respuesta final con Open Router:
La función get_open_router_final_response arma un prompt combinando el contexto extraído del KB y la query del usuario, y realiza una llamada a la API de Open Router. Asegúrate de configurar la API key y la URL base correctamente según la documentación de Open Router.
Limpieza:
Al finalizar se termina el proceso del servidor Ollama.

Este código es un punto de partida integral para que el sistema funcione en su totalidad, adaptando el flujo de RAG a modelos locales y usando Open Router para la respuesta final. Asegúrate de probar cada componente en tu entorno y ajustar parámetros según tus necesidades.

Actualmente estoy trabajando con dsrag, un modelo ...

Pregunta

Confirmar que Ollama se instaló verificando la versión

Definir una función para iniciar Ollama más robusta

Iniciar el servidor y esperar

Esperar más tiempo para asegurar que el servidor esté listo

Verificar que el servidor está respondiendo

Descargar modelos con verificación

Descargar modelos necesarios

Descargar modelos con verificación

Descargar modelos necesarios

Exploremos la clase OllamaEmbedding para ver qué parámetros acepta

Clase de embedding corregida - NO pasa model en kwargs

Clase para reranker - se mantiene igual

Clase para LLM de AutoContext corregida

Función para verificar que Ollama está funcionando

Función para ejecutar todo el flujo de trabajo

Ejecutar el flujo de trabajo

Ejemplo de consulta final utilizando langchain-ollama si todo funcionó

Generate response using local Ollama (langchain-ollama)

Pensamiento

Respuesta

Resumen del Flujo de Trabajo

Consideraciones y Sugerencias

Conclusión

Pregunta

Pensamiento

Respuesta

Detalles importantes de esta implementación

Compartir este Q&A