Actualmente estoy trabajando con dsrag, un modelo ...
Creado el: 16 de marzo de 2025
Respondido usando O3 Mini High por Chat01
Creado el: 16 de marzo de 2025
Respondido usando O3 Mini High por Chat01
Actualmente estoy trabajando con dsrag, un modelo que es capaz de rendir mucho mejor que otros metodos rag con unos resultados de hasta 96% de precision, pero quiero cambiar todo el proceso de rag con modelos locales a excepcion de la respuesta final que quiero usar open router, aqui le comparto la documentacion del proyecto en forma de notebook: Solving the out-of-context chunk problem for RAG
A large percentage of the problems developers face with RAG comes down
to this: Individual chunks oftentimes do not contain sufficient context
to be properly used by the retrieval system or the LLM. This leads to
the inability to answer questions and, more worryingly, hallucinations.
Examples of this problem
What would a solution look like?
There are two solutions that, when used together, solve the bulk of
these problems.
Contextual chunk headers
The idea here is to add in higher-level context to the chunk by
prepending a chunk header. This chunk header could be as simple as just
the document title, or it could use a combination of document title, a
concise document summary, and the full hierarchy of section and
sub-section titles.
Chunks -> segments
Large chunks provide better context to the LLM than small chunks, but
they also make it harder to precisely retrieve specific pieces of
information. Some queries (like simple factoid questions) are best
handled by small chunks, while other queries (like higher-level
questions) require very large chunks. What we really need is a more
dynamic system that can retrieve short chunks when that's all that's
needed, but can also retrieve very large chunks when required. How do we
do that?
Given that relevant chunks tend to be clustered within their original
documents, what if we could identify those clusters and concatenate all
of the chunks back into their original form? This would provide much
more complete and understandable context to the LLM than just providing
the individual (and out-of-order) chunks.
Setup
First, we need to import some things and set API keys.
textimport sys sys.path.append("../") from dsrag.knowledge_base import KnowledgeBase from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf from dsrag.rse import get_best_segments import cohere import os from scipy.stats import beta import numpy as np import matplotlib.pyplot as plt # set API keys import os os.environ["OPENAI_API_KEY"] = "" os.environ["CO_API_KEY"] = ""
Load in document and add to KB
We'll go ahead and add the document to a dsRAG knowledge base, which
will take care of semantic sectioning and chunking. Semantic sectioning
is the process of breaking a document into semantically cohesive
sections. These sections are then broken up into smaller chunks if
needed. Each section is given a descriptive title, which we'll use later
in the ranking process.
text# load in some data #file_path = "../tests/data/levels_of_agi.pdf" file_path = "../tests/data/nike_2023_annual_report.txt" doc_id = os.path.basename(file_path).split(".")[0] # grab the file name without the extension so we can use it as the doc_id # kb_id = "levels_of_agi" kb_id = "nike_10k" if file_path.endswith(".pdf"): document_text, _ = extract_text_from_pdf(file_path) else: with open(file_path, "r") as f: document_text = f.read() print (document_text[:1000]) FORM 10-K FORM 10-KUNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K (Mark One) ☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 FOR THE FISCAL YEAR ENDED MAY 31, 2023 OR ☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 FOR THE TRANSITION PERIOD FROM TO . Commission File No. 1-10635 NIKE, Inc. (Exact name of Registrant as specified in its charter) Oregon 93-0584541 (State or other jurisdiction of incorporation) (IRS Employer Identification No.) One Bowerman Drive, Beaverton, Oregon 97005-6453 (Address of principal executive offices and zip code) (503) 671-6453 (Registrant's telephone number, including area code) SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT: Class B Common Stock NKE New York Stock Exchange (Title of each class) (Trading symbol) (Name of each exchange on which registered) SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT: NONE Indica # NOTE: the two sample KBs used in this notebook already exist in the example_kb_data directory, so you don't need to run this cell unless you want to create a new KB # add to knowledge base #kb = KnowledgeBase(kb_id=kb_id, exists_ok=False, storage_directory='example_kb_data') # create a new knowledge base #kb.add_document(doc_id=doc_id, text=document_text) # add the document to the knowledge base
View chunks and sections
text# load in chunks kb = KnowledgeBase(kb_id=kb_id, exists_ok=True, storage_directory='example_kb_data') num_chunks = len(kb.chunk_db.data[doc_id]) print (num_chunks) chunks = [] for i in range(num_chunks): chunk = { "section_title": kb.chunk_db.get_section_title(doc_id, i), "chunk_text": kb.chunk_db.get_chunk_text(doc_id, i), } chunks.append(chunk) chunks[0] 493 {'section_title': 'Form 10-K Cover Page and Company Information', 'chunk_text': "FORM 10-K FORM 10-KUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-K \n(Mark One)\n☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE FISCAL YEAR ENDED MAY 31, 2023 \nOR\n☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE TRANSITION PERIOD FROM TO .\nCommission File No. 1-10635 \nNIKE, Inc. \n(Exact name of Registrant as specified in its charter)\nOregon 93-0584541\n(State or other jurisdiction of incorporation) (IRS Employer Identification No.)\nOne Bowerman Drive, Beaverton, Oregon 97005-6453 \n(Address of principal executive offices and zip code)\n(503) 671-6453 \n(Registrant's telephone number, including area code)\nSECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:"}
Now let's print out all of the sections that the semantic sectioning
process identified.
text# print all section titles unique_section_titles = [] for i in range(num_chunks): section_title = chunks[i]["section_title"] if section_title not in unique_section_titles: print (section_title) unique_section_titles.append(section_title)
Add contextual chunk headers and measure impact
We'll use the Cohere Rerank API for rank our chunks. Normally, we'd
start with a vector and/or keyword search to narrow down the list of
candidates, but since we're just dealing with a single document here we
can just send all chunks directly to the reranker, keeping things a bit
simpler.
textdef transform(x): """ Transformation function to map the absolute relevance value to a value that is more uniformly distributed between 0 and 1 - This is critical for the new version of RSE to work properly, because it utilizes the absolute relevance values to calculate the similarity scores - The relevance values given by the Cohere reranker tend to be very close to 0 or 1. This beta function used here helps to spread out the values more uniformly. """ a, b = 0.4, 0.4 # These can be adjusted to change the distribution shape return beta.cdf(x, a, b) def rerank_documents(query: str, documents: list) -> list: """ Use Cohere Rerank API to rerank the search results """ model = "rerank-v3.5" client = cohere.Client(api_key=os.environ["CO_API_KEY"]) decay_rate = 30 reranked_results = client.rerank(model=model, query=query, documents=documents) results = reranked_results.results reranked_indices = [result.index for result in results] reranked_similarity_scores = [result.relevance_score for result in results] # in order of reranked_indices # convert back to order of original documents and calculate the chunk values similarity_scores = [0] * len(documents) chunk_values = [0] * len(documents) for i, index in enumerate(reranked_indices): absolute_relevance_value = transform(reranked_similarity_scores[i]) similarity_scores[index] = absolute_relevance_value v = np.exp(-i/decay_rate)*absolute_relevance_value # decay the relevance value based on the rank chunk_values[index] = v return similarity_scores, chunk_values
Contextual chunk headers
We'll add both a document context and a section context to each chunk.
textif kb_id == "levels_of_agi": document_context = "Document: Levels of AGI" elif kb_id == "nike_10k": document_context = "Document: Nike 10-K FY2023" else: document_context = "Document: Unknown" documents = [] documents_no_context = [] # baseline for comparison for i in range(num_chunks): section_context = f"Section: {chunks[i]['section_title']}" chunk_text = chunks[i]["chunk_text"] document = f"{document_context}\n{section_context}\n\n{chunk_text}" documents.append(document) documents_no_context.append(chunk_text) chunk_index_to_inspect = 226 query = "Nike operating segment results" print (documents[chunk_index_to_inspect]) Document: Nike 10-K FY2023 Section: Operating Segments Overview Asia Pacific & Latin America(2) 6,431 5,955 8 % 17 % 5,343 11 % 16 % Global Brand Divisions(3) 58 102 -43 % -43 % 25 308 % 302 % TOTAL NIKE BRAND $ 48,763 $ 44,436 10 % 16 % $ 42,293 5 % 6 % Converse 2,427 2,346 3 % 8 % 2,205 6 % 7 % Corporate(4) 27 (72) — — 40 — — TOTAL NIKE, INC. REVENUES $ 51,217 $ 46,710 10 % 16 % $ 44,538 5 % 6 % (1) The percent change excluding currency changes represents a non-GAAP financial measure. For further information, see "Use of Non-GAAP Financial Measures". (2) For additional information on the transition of our NIKE Brand businesses within our CASA territory to a third-party distributor, see Note 18 —
Prepending document and section title information adds important context
to the chunk. Let's see how much of a difference it makes in the
relevance ranking.
text# run this chunk through the Cohere Rerank API with and without the context header similarity_scores, chunk_values = rerank_documents(query, [documents[chunk_index_to_inspect], documents_no_context[chunk_index_to_inspect]]) print (f"Similarity with contextual chunk header: {similarity_scores[0]}") print (f"Similarity without contextual chunk header: {similarity_scores[1]}") Similarity with contextual chunk header: 0.7322965626380898 Similarity without contextual chunk header: 0.615422859563519
That's a pretty decent difference.
Run queries and visualize results
text# Levels of AGI #query = "Principles for defining levels of AGI" #query = "Levels of AGI paper references" #query = "CoALA paper references" #query = "What are the levels of AGI?" #query = "What is Autonomy Level 3" #query = "Levels of AGI Introduction section" #query = "Use of existing AI benchmarks like Big-bench and HELM" #query = "How does autonomy factor into AGI?" #query = "What are the levels of AGI?" # Nike 10-K #query = "Nike cash flow statement" #query = "Nike revenue growth" #query = "2023 Nike North America segment results" #query = "Nike board of directors" #query = "Nike supply chain challenges" query = "Nike stock-based compensation expenses" #query = "Nike consolidated balance sheet" #query = "What is the mission of Nike?" #query = "Nike CEO tenure" similarity_scores, chunk_values = rerank_documents(query, documents) plt.figure(figsize=(12, 5)) plt.title(f"Similarity of each chunk in the document to the search query") plt.ylim(0, 1) plt.xlabel("Chunk index") plt.ylabel("Query-chunk similarity") plt.scatter(range(len(chunk_values)), chunk_values)
How to interpret the chunk relevance plot above
In the plot above, the x-axis represents the chunk index. The first
chunk in the document has index 0, the next chunk has index 1, etc. The
y-axis represents the relevance of each chunk to the query. Viewing it
this way lets us see how relevant chunks tend to be clustered in one or
more sections of a document.
Note: the relevance values in this plot are actually a combination of
the raw relevance value and the relevance ranks. An exponential decay
function is applied to the ranks, and that is then multiplied by the raw
relevance value. Using this combination provides a more robust measure
of relevance than using just one or the other, at least in most cases.
What can we do with these clusters of relevant chunks?
The core idea is that clusters of relevant chunks, in their original
contiguous form, provide much better context to the LLM than individual
chunks can. Now for the hard part: how do we actually identify these
clusters?
If we can calculate chunk values in such a way that the value of a
segment is just the sum of the values of its constituent chunks, then
finding the optimal segment is a version of the maximum subarray
problem, for which a solution can be found relatively easily. How do we
define chunk values in such a way? We'll start with the idea that highly
relevant chunks are good, and irrelevant chunks are bad. We already have
a good measure of chunk relevance, on a scale of 0-1, so all we need to
do is subtract a constant threshold value from it. This will turn the
chunk value of irrelevant chunks to a negative number, while keeping the
values of relevant chunks positive. We call this the
irrelevant_chunk_penalty. A value around 0.2 seems to work well
empirically.
Relevant segment deep dive
text# be sure you're using the Nike 10-K KB for these next few cells, as we'll be focusing on a single example query for that document query = "Nike stock-based compensation expenses" similarity_scores, chunk_values = rerank_documents(query, documents) irrelevant_chunk_penalty = 0.2 all_relevance_values = [[v - irrelevant_chunk_penalty for v in chunk_values]] document_splits = [] max_length = 30 overall_max_length = 50 minimum_value = 0.5 # get_best_segments solves a constrained version of the maximum sum subarray problem best_segments, scores = get_best_segments(all_relevance_values, document_splits, max_length, overall_max_length, minimum_value) print (best_segments) print (scores) print () # print the best segments for segment_start, segment_end in best_segments: # concatenate the text of the chunks in the segment segment_text = f"[{document_context}]\n" for i in range(segment_start, segment_end): chunk_text = chunks[i]["chunk_text"] segment_text += chunk_text + "\n" print (segment_text) print ("\n---\n")
Let's zoom in to the best segment and see what the chunk relevance
values look like.
text# plot the relevance values of the best segment best_segment_chunk_indexes = list(range(best_segments[0][0], best_segments[0][1])) best_segment_chunk_values = chunk_values[best_segments[0][0]:best_segments[0][1]] plt.figure(figsize=(12, 5)) plt.title(f"Relevance values of the best segment") plt.ylim(0, 1) plt.xlabel("Chunk index") plt.ylabel("Query-chunk similarity") plt.scatter(best_segment_chunk_indexes, best_segment_chunk_values)
You'll notice that a few of the chunks (especially 402-404) aren't
particularly relevant, but they're surrounded by relevant chunks, so we
include them in the segment for continuity.
text# print the individual chunks in the best segment - annotated with their chunk indexes and relevance values for chunk_index in best_segment_chunk_indexes: chunk_text = chunks[chunk_index]["chunk_text"] chunk_value = chunk_values[chunk_index] print (f"Chunk index: {chunk_index} - Relevance value: {chunk_value}") print (chunk_text) print ("\n---\n")
Looking at the content of each of these chunks, it's clear that chunks
397-401 are highly relevant, as expected. But looking closely at chunks
402-404, we can see they're actually also relevant, despite being marked
as irrelevant by our ranking model. This is a common theme: chunks that
are marked as not relevant, but are sandwiched between highly relevant
chunks, are oftentimes quite relevant. In this case, the chunks were
about stock option valuation, so while they weren't explicitly
discussing stock-based compensation expenses (which is what we were
searching for), in the context of the surrounding chunks it's clear that
they are actually relevant.
aqui el codigo que he hecho:
import subprocess
import os
import time
import threading
import sys
print("Installing Ollama...")
try:
install_result = subprocess.run(
["curl", "-fsSL", "https://ollama.com/install.sh"],
stdout=subprocess.PIPE,
check=True
)
install_script = install_result.stdout.decode('utf-8')
text# Ejecutar el script de instalación install_process = subprocess.run( ["bash", "-c", install_script], check=True ) print("✅ Ollama installed successfully")
except subprocess.CalledProcessError as e:
print(f"❌ Error installing Ollama: {e}")
sys.exit(1)
try:
version_result = subprocess.run(["ollama", "--version"], capture_output=True, text=True)
print(f"Ollama version: {version_result.stdout.strip()}")
except Exception as e:
print(f"❌ Error checking Ollama version: {e}")
# Intentar encontrar el binario
find_result = subprocess.run(["which", "ollama"], capture_output=True, text=True)
print(f"Ollama binary location: {find_result.stdout.strip()}")
def run_ollama_server():
try:
print("Starting Ollama server...")
# Usar la ruta completa del binario si es necesario
ollama_path = subprocess.run(["which", "ollama"], capture_output=True, text=True).stdout.strip()
textif not ollama_path: print("❌ Ollama binary not found in PATH") return None print(f"Using Ollama binary at: {ollama_path}") # Iniciar el servidor con salida redirigida process = subprocess.Popen( [ollama_path, "serve"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True ) print(f"Ollama process started with PID: {process.pid}") return process except Exception as e: print(f"❌ Error starting Ollama server: {e}") return None
ollama_process = run_ollama_server()
print("Waiting for Ollama server to start (30 seconds)...")
time.sleep(30)
import requests
def check_ollama_server():
max_attempts = 5
for attempt in range(max_attempts):
try:
response = requests.get("http://localhost:11434/api/tags")
if response.status_code == 200:
print(f"✅ Ollama server is running correctly (attempt {attempt+1})")
return True
else:
print(f"⚠️ Ollama server responded with status code: {response.status_code} (attempt {attempt+1})")
except requests.exceptions.ConnectionError:
print(f"⚠️ Ollama server connection refused (attempt {attempt+1})")
except Exception as e:
print(f"❌ Error checking Ollama server: {e} (attempt {attempt+1})")
textif attempt < max_attempts - 1: print("Waiting 10 more seconds...") time.sleep(10) return False
if not check_ollama_server():
print("❌ Could not connect to Ollama server after multiple attempts")
# Intentar reiniciar el servidor
if ollama_process:
print("Attempting to restart Ollama server...")
ollama_process.terminate()
time.sleep(5)
ollama_process = run_ollama_server()
time.sleep(20)
if not check_ollama_server():
print("❌ Failed to restart Ollama server")
sys.exit(1)
def pull_model(model_name):
max_attempts = 3
for attempt in range(max_attempts):
try:
print(f"Pulling model {model_name} (attempt {attempt+1})...")
result = subprocess.run(
["ollama", "pull", model_name],
capture_output=True,
text=True,
check=True
)
print(f"✅ Successfully pulled model: {model_name}")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Error pulling model {model_name}: {e.stderr}")
if attempt < max_attempts - 1:
print("Waiting before retry...")
time.sleep(5)
return False
models_to_pull = [
"MHKetbi/Unsloth_gemma3-1b-it:q5_K_M",
"jeffh/intfloat-multilingual-e5-large:f16"
]
for model in models_to_pull:
if not pull_model(model):
print(f"❌ Failed to pull {model} after multiple attempts")
print("Continuing anyway, but this may cause errors later")
def pull_model(model_name):
try:
result = subprocess.run(["ollama", "pull", model_name],
capture_output=True, text=True, check=True)
print(f"Successfully pulled model: {model_name}")
return True
except subprocess.CalledProcessError as e:
print(f"Error pulling model {model_name}: {e.stderr}")
return False
models_to_pull = [
"MHKetbi/Unsloth_gemma3-1b-it:q5_K_M",
"jeffh/intfloat-multilingual-e5-large:f16"
]
for model in models_to_pull:
pull_success = pull_model(model)
if not pull_success:
print(f"Warning: Failed to pull {model}, this may cause errors later")
from dsrag.embedding import Embedding, OllamaEmbedding
from dsrag.reranker import Reranker, NoReranker
from dsrag.llm import LLM, OllamaAPI
from dsrag.knowledge_base import KnowledgeBase
from typing import List, Dict, Optional
from sentence_transformers import CrossEncoder
from langchain_ollama import ChatOllama
from dsrag.database.vector.types import Vector, VectorSearchResult
import requests
import json
print("\nExploring OllamaEmbedding parameters:")
import inspect
try:
print(inspect.signature(OllamaEmbedding.init))
print("OllamaEmbedding docstring:")
print(OllamaEmbedding.init.doc)
except Exception as e:
print(f"Error exploring OllamaEmbedding: {e}")
class LocalEmbedding(OllamaEmbedding):
def init(self, **kwargs):
# IMPORTANTE: Eliminar 'model' de kwargs para evitar duplicación
if 'model' in kwargs:
kwargs.pop('model')
text# Establecer dimensión manualmente kwargs['dimension'] = 1024 # Llamar al constructor padre con model como argumento posicional super().__init__(model="jeffh/intfloat-multilingual-e5-large:f16", **kwargs)
class MultilingualReranker(Reranker):
def init(self, model_name="jinaai/jina-reranker-v2-base-multilingual"):
super().init()
self.model = CrossEncoder(model_name, trust_remote_code=True)
textdef rerank_search_results(self, query: str, search_results: list) -> list[VectorSearchResult]: sentences1 = [query] * len(search_results) sentences2 = [result.metadata['chunk_text'] for result in search_results] scores = self.model.predict(list(zip(sentences1, sentences2))) for i in range(len(search_results)): search_results[i].similarity = scores[i] return sorted(search_results, key=lambda x: x.similarity, reverse=True)
class LocalAutoContextLLM(LLM):
def init(self, **kwargs):
super().init()
self.model_name = "MHKetbi/Unsloth_gemma3-1b-it:q5_K_M"
self.base_url = "http://localhost:11434"
textdef make_llm_call(self, chat_messages: list[dict]) -> str: prompt_text = "" for msg in chat_messages: if msg["role"] == "user": prompt_text += "<start_of_turn>user " + msg["content"].strip() + "<end_of_turn>" else: prompt_text += "<start_of_turn>model " + msg["content"].strip() + "<end_of_turn>" prompt_text += "<start_of_turn>model" # Llamada a la API de Ollama try: payload = { "model": self.model_name, "messages": [{"role": "user", "content": prompt_text}] } response = requests.post(f"{self.base_url}/api/chat", json=payload) response.raise_for_status() response_json = response.json() # Manejar diferentes estructuras de respuesta posibles if "message" in response_json and "content" in response_json["message"]: return response_json["message"]["content"].strip() elif "choices" in response_json: return response_json["choices"][0]["message"]["content"].strip() else: print(f"Unexpected response structure: {response_json}") return str(response_json) except Exception as e: print(f"Error in LLM call: {e}") return f"Error generating response: {str(e)}"
def check_ollama_server():
try:
response = requests.get("http://localhost:11434/api/tags")
if response.status_code == 200:
print("✅ Ollama server is running correctly")
return True
else:
print(f"⚠️ Ollama server responded with status code: {response.status_code}")
return False
except Exception as e:
print(f"❌ Error connecting to Ollama server: {e}")
return False
def run_dsrag_workflow():
try:
print("Initializing embedding model...")
local_embedding_model = LocalEmbedding()
print("Embedding model initialized successfully")
textprint("Initializing reranker...") multilingual_reranker = MultilingualReranker() print("Reranker initialized successfully") print("Initializing AutoContext LLM...") local_autocontext_llm = LocalAutoContextLLM() print("AutoContext LLM initialized successfully") print("Initializing Knowledge Base...") kb = KnowledgeBase( kb_id="local_kb_vtt_ollama", embedding_model=local_embedding_model, reranker=multilingual_reranker, auto_context_model=local_autocontext_llm, ) print("Knowledge Base initialized successfully") # Ejemplo de uso con VTT vtt_content = """WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:05.000
This is the first subtitle, talking about dsRAG features.
00:00:05.000 --> 00:00:10.000
dsRAG excels at handling complex queries over dense text.
00:00:10.000 --> 00:00:15.000
It uses Semantic Sectioning, AutoContext, and RSE for improved RAG.
00:00:15.000 --> 00:00:20.000
dsRAG achieves high accuracy on benchmarks like FinanceBench.
"""
text# Procesar VTT vtt_chunks = [line for line in vtt_content.strip().split('\n\n') if line.strip()] # Añadir documento al KB doc_id_vtt = "vtt_document_1" print("Adding document to Knowledge Base...") kb.add_document( doc_id=doc_id_vtt, text="\n\n".join(vtt_chunks), document_title="VTT Document Example", auto_context_config={"get_document_summary": False}, semantic_sectioning_config={"use_semantic_sectioning": False} ) print("Document added successfully!") # Consultar el KB search_queries_vtt = ["What are the key features of dsRAG?"] print(f"Querying Knowledge Base with: {search_queries_vtt}") results_vtt = kb.query(search_queries_vtt) print("Query successful! Results:") for i, result in enumerate(results_vtt): print(f"Result {i+1}: {result.get('content', '')[:100]}...") return kb, results_vtt except Exception as e: print(f"Error in workflow: {e}") import traceback traceback.print_exc() return None, None
kb, results = run_dsrag_workflow()
if kb is not None and results is not None:
try:
from langchain_ollama import ChatOllama
text# Preparar contexto context = "\n\n".join([segment.get('content', '') for segment in results]) # Consulta del usuario user_query = "Summarize the key features and accuracy of dsRAG based on the provided context." # Generar respuesta final print("\nGenerating final response with langchain-ollama...") ollama_llm = ChatOllama( model="MHKetbi/Unsloth_gemma3-1b-it:q5_K_M", base_url="http://localhost:11434" ) prompt = f"Based on the following information:\n{context}\n\nAnswer the query: {user_query}" response = ollama_llm.invoke(prompt) print("\nFinal answer:") print(response.content) except Exception as e: print(f"Error generating final response: {e}") import traceback traceback.print_exc()
from langchain_ollama import ChatOllama
def get_ollama_final_response(context, user_query):
"""
Generates a response using local Ollama via langchain-ollama.
"""
ollama_llm = ChatOllama(
model="gemma-7b-it", # Correct Ollama model name. MAKE SURE THIS IS INSTALLED!
base_url="http://localhost:11434" # Ollama's default API endpoint
)
prompt = f"Based on the following information: {context}\n\nAnswer the query: {user_query}"
response = ollama_llm.invoke(prompt) # Use Langchain-Ollama's invoke method
return response.content # Access response content
search_queries_vtt = ["What are the key features of dsRAG?", "How accurate is dsRAG?"]
results_vtt = kb.query(search_queries_vtt)
context_vtt = "\n\n".join([segment['content'] for segment in results_vtt])
user_query_vtt = "Summarize the key features and accuracy of dsRAG based on the provided context."
answer_ollama = get_ollama_final_response(context_vtt, user_query_vtt)
print("\nAnswer from Local Ollama (langchain-ollama):\n---")
print(answer_ollama) --thinking_budget 8192
Tu implementación se ve muy completa y parece que has integrado varios componentes de dsRAG (como la inyección de contexto, la segmentación semántica, el reranqueo y el uso de embeddings locales) para construir un flujo de trabajo que usa modelos locales en la mayor parte del proceso. A continuación, te ofrezco un resumen y algunas sugerencias:
Instalación y verificación de Ollama
Descarga de modelos
pull_model
dos veces; podrías consolidarla para evitar duplicidad.Definición de clases personalizadas
OllamaEmbedding
eliminando el parámetro redundante “model” y estableciendo manualmente la dimensión del embedding.Ejecución del flujo de dsRAG
Generación de la respuesta final
langchain-ollama
(a través de la función get_ollama_final_response
) que arma un prompt con el contexto obtenido del KB y la query del usuario.Uso de modelos locales vs. respuesta final con Open Router:
Si tu objetivo es usar modelos locales para todo el proceso de RAG excepto la respuesta final, deberás asegurarte de que en la última parte (donde generas la respuesta final) se invoque el endpoint de Open Router en lugar del servidor local de Ollama.
get_ollama_final_response
, podrías parametrizar el base_url
y el modelo para que apunten al servicio de Open Router (asegurándote de que la autenticación y demás parámetros sean los requeridos por dicho servicio).Consolidación de funciones duplicadas:
Noté que defines dos versiones de la función pull_model
. Consolidarla en una única función te ayudará a mantener el código más limpio y evitar posibles inconsistencias.
Manejo de errores y reintentos:
El uso de bloques try/except
y la lógica de reintentos son muy adecuados para asegurarse de que los servicios (como Ollama) estén disponibles. Asegúrate de capturar y loggear suficientes detalles en caso de error, de modo que sea más fácil diagnosticar problemas en entornos de producción.
Construcción del prompt y tokens especiales:
La forma en que concatenas los mensajes para crear el prompt en LocalAutoContextLLM
es interesante. Considera revisar que la sintaxis y los tokens especiales sean consistentes con las expectativas del modelo que usas. Podrías modularizar esa lógica si en el futuro decides cambiar el formato.
Verificación del KB y consulta de resultados:
El flujo para añadir documentos al KB y consultarlos parece funcionar correctamente. Si en el futuro trabajas con documentos de mayor tamaño o con múltiples tipos de contenido, podrías considerar agregar más parámetros de configuración a la hora de segmentar o agregar contexto.
Documentación y comentarios:
La documentación incluida en el notebook es muy útil para entender el propósito de cada bloque. Mantener comentarios claros en el código ayuda a la mantenibilidad, especialmente en proyectos complejos como este.
Tu implementación integra varias técnicas avanzadas para mejorar la calidad de la recuperación y la generación en RAG. Si tu objetivo es mantener el procesamiento de RAG (incluyendo chunking, embeddings y reranqueo) de forma local, y únicamente externalizar la generación final (por ejemplo, a Open Router), asegúrate de:
base_url
y las credenciales si es necesario).Si tienes alguna duda específica sobre algún componente o necesitas ayuda con la integración con Open Router, ¡déjame saber y con gusto profundizo en ello!
Okey, en base a lo que acabas de responder, podrias implementar correctamente los cambios para que este sistema finalmente funcione? que hemos tenido muchos problemas para adaptar este proyecto en su totalidad, tomese todo el tiempo para pensar y tambien para mejorar y pulir el codigo para que funcione totalmente.
A continuación te presento una versión completa y pulida del código, donde se usan modelos locales para el proceso RAG (embedding, reranqueo, segmentación, etc.) y se utiliza Open Router para generar la respuesta final. He consolidado funciones duplicadas, mejorado el manejo de errores y parametrizado la llamada final a Open Router. Reemplaza las constantes (como la API key de Open Router) según tu entorno:
pythonimport subprocess import os import time import sys import requests import json import traceback import numpy as np import matplotlib.pyplot as plt from scipy.stats import beta # ============================================================================= # Paso 1: Instalación y verificación de Ollama # ============================================================================= def install_and_verify_ollama(): print("Instalando Ollama...") try: install_result = subprocess.run( ["curl", "-fsSL", "https://ollama.com/install.sh"], stdout=subprocess.PIPE, check=True ) install_script = install_result.stdout.decode('utf-8') # Ejecutar el script de instalación subprocess.run(["bash", "-c", install_script], check=True) print("✅ Ollama instalado correctamente") except subprocess.CalledProcessError as e: print(f"❌ Error instalando Ollama: {e}") sys.exit(1) try: version_result = subprocess.run(["ollama", "--version"], capture_output=True, text=True) print(f"Versión de Ollama: {version_result.stdout.strip()}") except Exception as e: print(f"❌ Error verificando la versión de Ollama: {e}") find_result = subprocess.run(["which", "ollama"], capture_output=True, text=True) print(f"Ubicación del binario Ollama: {find_result.stdout.strip()}") def run_ollama_server(): try: print("Iniciando servidor Ollama...") ollama_path = subprocess.run(["which", "ollama"], capture_output=True, text=True).stdout.strip() if not ollama_path: print("❌ No se encontró el binario de Ollama en PATH") return None print(f"Usando binario Ollama en: {ollama_path}") process = subprocess.Popen( [ollama_path, "serve"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True ) print(f"Proceso de Ollama iniciado con PID: {process.pid}") return process except Exception as e: print(f"❌ Error iniciando el servidor Ollama: {e}") return None def check_ollama_server(max_attempts=5): for attempt in range(max_attempts): try: response = requests.get("http://localhost:11434/api/tags") if response.status_code == 200: print(f"✅ Servidor Ollama funcionando correctamente (intento {attempt+1})") return True else: print(f"⚠️ Servidor Ollama respondió con código {response.status_code} (intento {attempt+1})") except requests.exceptions.ConnectionError: print(f"⚠️ Conexión rechazada por el servidor Ollama (intento {attempt+1})") except Exception as e: print(f"❌ Error verificando el servidor Ollama: {e} (intento {attempt+1})") if attempt < max_attempts - 1: print("Esperando 10 segundos más...") time.sleep(10) return False def pull_model(model_name, max_attempts=3): for attempt in range(max_attempts): try: print(f"Descargando modelo {model_name} (intento {attempt+1})...") subprocess.run( ["ollama", "pull", model_name], capture_output=True, text=True, check=True ) print(f"✅ Modelo descargado: {model_name}") return True except subprocess.CalledProcessError as e: print(f"❌ Error descargando modelo {model_name}: {e.stderr}") if attempt < max_attempts - 1: print("Esperando antes de reintentar...") time.sleep(5) return False # ============================================================================= # Paso 2: Configuración de dsRAG y sus componentes locales # ============================================================================= # Configurar variables de entorno (si fuese necesario) os.environ["OPENAI_API_KEY"] = "" os.environ["CO_API_KEY"] = "" # Importamos módulos de dsRAG y otros from dsrag.knowledge_base import KnowledgeBase from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf from dsrag.rse import get_best_segments from dsrag.embedding import Embedding, OllamaEmbedding from dsrag.reranker import Reranker from dsrag.llm import LLM from dsrag.database.vector.types import VectorSearchResult from sentence_transformers import CrossEncoder from langchain_ollama import ChatOllama # Clase de embedding local: elimina el parámetro "model" de kwargs para evitar duplicidad class LocalEmbedding(OllamaEmbedding): def __init__(self, **kwargs): if 'model' in kwargs: kwargs.pop('model') kwargs['dimension'] = 1024 super().__init__(model="jeffh/intfloat-multilingual-e5-large:f16", **kwargs) # Clase de reranker multilingüe basado en CrossEncoder class MultilingualReranker(Reranker): def __init__(self, model_name="jinaai/jina-reranker-v2-base-multilingual"): super().__init__() self.model = CrossEncoder(model_name, trust_remote_code=True) def rerank_search_results(self, query: str, search_results: list) -> list: sentences1 = [query] * len(search_results) sentences2 = [result.metadata['chunk_text'] for result in search_results] scores = self.model.predict(list(zip(sentences1, sentences2))) for i in range(len(search_results)): search_results[i].similarity = scores[i] return sorted(search_results, key=lambda x: x.similarity, reverse=True) # Clase de LLM para AutoContext utilizando Ollama local class LocalAutoContextLLM(LLM): def __init__(self, **kwargs): super().__init__() self.model_name = "MHKetbi/Unsloth_gemma3-1b-it:q5_K_M" self.base_url = "http://localhost:11434" def make_llm_call(self, chat_messages: list) -> str: prompt_text = "" for msg in chat_messages: if msg["role"] == "user": prompt_text += "<start_of_turn>user " + msg["content"].strip() + "<end_of_turn>" else: prompt_text += "<start_of_turn>model " + msg["content"].strip() + "<end_of_turn>" prompt_text += "<start_of_turn>model" try: payload = { "model": self.model_name, "messages": [{"role": "user", "content": prompt_text}] } response = requests.post(f"{self.base_url}/api/chat", json=payload) response.raise_for_status() response_json = response.json() if "message" in response_json and "content" in response_json["message"]: return response_json["message"]["content"].strip() elif "choices" in response_json: return response_json["choices"][0]["message"]["content"].strip() else: print(f"Estructura de respuesta inesperada: {response_json}") return str(response_json) except Exception as e: print(f"Error en la llamada LLM: {e}") return f"Error generando respuesta: {str(e)}" # ============================================================================= # Paso 3: Flujo de trabajo dsRAG (carga de documento, consulta, etc.) # ============================================================================= def run_dsrag_workflow(): try: print("Inicializando modelo de embeddings...") local_embedding_model = LocalEmbedding() print("Modelo de embeddings inicializado") print("Inicializando reranker...") multilingual_reranker = MultilingualReranker() print("Reranker inicializado") print("Inicializando AutoContext LLM local...") local_autocontext_llm = LocalAutoContextLLM() print("AutoContext LLM inicializado") print("Inicializando Knowledge Base...") kb = KnowledgeBase( kb_id="local_kb_vtt_openrouter", embedding_model=local_embedding_model, reranker=multilingual_reranker, auto_context_model=local_autocontext_llm, ) print("Knowledge Base inicializado") # Ejemplo de documento en formato VTT vtt_content = """WEBVTT Kind: captions Language: en 00:00:00.000 --> 00:00:05.000 This is the first subtitle, talking about dsRAG features. 00:00:05.000 --> 00:00:10.000 dsRAG excels at handling complex queries over dense text. 00:00:10.000 --> 00:00:15.000 It uses Semantic Sectioning, AutoContext, and RSE for improved RAG. 00:00:15.000 --> 00:00:20.000 dsRAG achieves high accuracy on benchmarks like FinanceBench. """ # Procesar VTT y añadir documento al KB vtt_chunks = [line for line in vtt_content.strip().split('\n\n') if line.strip()] doc_id_vtt = "vtt_document_1" print("Añadiendo documento al Knowledge Base...") kb.add_document( doc_id=doc_id_vtt, text="\n\n".join(vtt_chunks), document_title="VTT Document Example", auto_context_config={"get_document_summary": False}, semantic_sectioning_config={"use_semantic_sectioning": False} ) print("Documento añadido exitosamente!") # Realizar consulta al KB search_queries_vtt = ["What are the key features of dsRAG?", "How accurate is dsRAG?"] print(f"Consultando KB con: {search_queries_vtt}") results_vtt = kb.query(search_queries_vtt) print("Consulta realizada. Resultados:") for i, result in enumerate(results_vtt): print(f"Resultado {i+1}: {result.get('content', '')[:100]}...") return kb, results_vtt except Exception as e: print(f"Error en el workflow: {e}") traceback.print_exc() return None, None # ============================================================================= # Paso 4: Generar respuesta final usando Open Router # ============================================================================= def get_open_router_final_response(context, user_query, open_router_api_key, open_router_base_url="https://api.openrouter.ai/chat/completions"): """ Genera una respuesta final utilizando Open Router. """ prompt = f"Based on the following information:\n{context}\n\nAnswer the query: {user_query}" headers = { "Content-Type": "application/json", "Authorization": f"Bearer {open_router_api_key}" } payload = { "model": "gpt-3.5-turbo", # Ajusta el modelo si es necesario "messages": [{"role": "user", "content": prompt}], "temperature": 0.7, } try: response = requests.post(open_router_base_url, headers=headers, json=payload) response.raise_for_status() response_json = response.json() if "choices" in response_json: return response_json["choices"][0]["message"]["content"].strip() else: return str(response_json) except Exception as e: print(f"Error generando respuesta final con Open Router: {e}") return f"Error: {str(e)}" # ============================================================================= # Main: Ejecución del flujo completo # ============================================================================= def main(): # Paso 1: Instalar y verificar Ollama, iniciar el servidor install_and_verify_ollama() ollama_process = run_ollama_server() print("Esperando 30 segundos para que el servidor Ollama se inicie...") time.sleep(30) if not check_ollama_server(): print("❌ No se pudo conectar al servidor Ollama tras múltiples intentos.") if ollama_process: print("Intentando reiniciar el servidor Ollama...") ollama_process.terminate() time.sleep(5) ollama_process = run_ollama_server() time.sleep(20) if not check_ollama_server(): print("❌ Falló el reinicio del servidor Ollama. Saliendo.") sys.exit(1) # Paso 2: Descargar modelos necesarios models_to_pull = [ "MHKetbi/Unsloth_gemma3-1b-it:q5_K_M", "jeffh/intfloat-multilingual-e5-large:f16" ] for model in models_to_pull: if not pull_model(model): print(f"❌ Falló la descarga del modelo {model}. Continuando, pero podrían ocurrir errores.") # Paso 3: Ejecutar el workflow de dsRAG local kb, results = run_dsrag_workflow() if kb is None or results is None: print("Error en la ejecución del workflow dsRAG. Saliendo.") sys.exit(1) # Preparar contexto para la respuesta final a partir de los resultados del KB context_vtt = "\n\n".join([segment.get('content', '') for segment in results]) user_query_vtt = "Summarize the key features and accuracy of dsRAG based on the provided context." # Paso 4: Generar respuesta final usando Open Router # Reemplaza 'TU_OPEN_ROUTER_API_KEY' con tu API key real de Open Router. open_router_api_key = "TU_OPEN_ROUTER_API_KEY" open_router_base_url = "https://api.openrouter.ai/chat/completions" print("Generando respuesta final usando Open Router...") final_answer = get_open_router_final_response(context_vtt, user_query_vtt, open_router_api_key, open_router_base_url) print("\nRespuesta final:") print(final_answer) # Finalizar el proceso de Ollama si es necesario if ollama_process: print("Terminando el proceso de Ollama...") ollama_process.terminate() if __name__ == "__main__": main()
Instalación y verificación de Ollama:
Se ejecuta el script de instalación y se comprueba la versión. Se inicia el servidor y se verifica su disponibilidad (reintentando si es necesario).
Componentes de dsRAG locales:
Se definen clases personalizadas para el embedding, reranqueo y LLM de AutoContext, de modo que se utilicen modelos locales (Ollama para LLM y un modelo de embeddings específico).
Flujo de trabajo (Knowledge Base):
Se agrega un ejemplo de documento en formato VTT al KB, se realizan consultas y se imprimen los primeros resultados.
Generación de respuesta final con Open Router:
La función get_open_router_final_response
arma un prompt combinando el contexto extraído del KB y la query del usuario, y realiza una llamada a la API de Open Router. Asegúrate de configurar la API key y la URL base correctamente según la documentación de Open Router.
Limpieza:
Al finalizar se termina el proceso del servidor Ollama.
Este código es un punto de partida integral para que el sistema funcione en su totalidad, adaptando el flujo de RAG a modelos locales y usando Open Router para la respuesta final. Asegúrate de probar cada componente en tu entorno y ajustar parámetros según tus necesidades.