tavily_conn.py
Overview
This file defines the Tavily class, a wrapper around the TavilyClient API designed to facilitate advanced search queries against the Tavily knowledge base. It handles querying, processing search results, and formatting them into structured chunks and aggregates for downstream use, such as retrieval-augmented generation (RAG) or knowledge retrieval tasks.
This integration abstracts direct interaction with the Tavily API, providing an easy-to-use interface that includes:
Executing a search query with specific parameters.
Transforming raw search results into tokenized, metadata-rich chunks.
Logging relevant search activities and results for debugging and monitoring.
Classes and Methods
Class: Tavily
A client interface to interact with Tavily's search API and process returned documents into structured chunks.
__init__(self, api_key: str)
Constructor.
Parameters:
api_key(str): The API key string used to authenticate requests against the Tavily API.
Behavior:
Initializes an instance of
TavilyClientusing the provided API key.Stores the client instance in
self.tavily_clientfor use in other methods.
Usage Example:
tavily = Tavily(api_key="your_api_key_here")
search(self, query) -> list[dict]
Executes a search query using the Tavily API.
Parameters:
query(str): The search string or question to query against the Tavily knowledge base.
Returns:
A list of dictionaries, each representing a search result with keys:
url(str): URL of the source document.title(str): Title or name of the document.content(str): Extracted content snippet from the document.score(float): Relevance score assigned by the search engine.
Behavior:
Calls
self.tavily_client.searchwith:search_depthset to"advanced"for a richer search.max_resultscapped at 6.
On success, processes the raw results into the summarized list format.
On failure (any exception), logs the exception with traceback and returns an empty list.
Usage Example:
results = tavily.search("What is quantum computing?")
for res in results:
print(res["title"], res["url"])
retrieve_chunks(self, question) -> dict
Processes a search question into tokenized content chunks and document aggregates.
Parameters:
question(str): The query string to retrieve relevant knowledge chunks.
Returns:
A dictionary with two keys:
"chunks": A list of dictionaries, each representing a content chunk with metadata:chunk_id(str): Unique UUID for the chunk.content_ltks(list): List of tokens from the chunk content, generated byrag_tokenizer.tokenize.content_with_weight(str): Raw content text.doc_id(str): Document ID (same as chunk_id here).docnm_kwd(str): Document title or keyword.kb_id(list): Empty list placeholder for knowledge base IDs.important_kwd(list): Empty list placeholder for important keywords.image_id(str): Empty string placeholder for image ID.similarity(float): Similarity score from the search result.vector_similarity(float): Fixed value 1.0 — likely a placeholder or default.term_similarity(float): Fixed value 0 — placeholder.vector(list): Empty vector placeholder.positions(list): Empty list for token positions.url(str): URL of the document.
"doc_aggs": A list of dictionaries with document aggregation info:doc_name(str): Document title.doc_id(str): Document ID.count(int): Count of chunks (always 1 here).url(str): Document URL.
Behavior:
Logs the incoming question.
Calls
search(question)and iterates over each result.For each result:
Generates a new UUID.
Tokenizes the content via
rag_tokenizer.tokenize.Constructs a chunk dictionary and an aggregation dictionary.
Logs the snippet of content (first 128 characters).
Returns the structured data for use in downstream pipelines (e.g., RAG retrieval).
Usage Example:
data = tavily.retrieve_chunks("Explain AI ethics.")
print(data["chunks"][0]["content_with_weight"])
print(data["doc_aggs"][0]["doc_name"])
Implementation Details
Error Handling:
Thesearchmethod wraps the API call in a try-except block, logging exceptions withlogging.exceptionwhich includes stack trace information. This ensures that failures in the API call do not crash the application but result in an empty search result.Tokenization:
The file depends on an external tokenizerrag_tokenizer.tokenize(presumably from a RAG NLP module). This tokenizer breaks down content into suitable tokens for embedding or similarity comparison.UUID Generation:
Each chunk and document is assigned a unique identifier usingget_uuid()imported fromapi.utils, ensuring consistent uniqueness across calls.Fixed Similarity Metrics:
Thevector_similarityandterm_similarityfields are currently hardcoded as1.and0respectively, suggesting placeholders or default values to be refined by other components.Logging:
The class logs key events at INFO level, including the input query and snippet of each retrieved document content, aiding debugging and traceability.
Interaction with Other Components
TavilyClient(fromtavilymodule):
Core client to communicate with the Tavily API. This file acts as a higher-level wrapper adding result processing and error handling.get_uuid(fromapi.utils):
Utility function to generate unique identifiers for documents and chunks.rag_tokenizer(fromrag.nlp):
Tokenizer used to convert the retrieved content into tokens for further NLP or embedding processing.Logging:
Uses Python’s standardloggingmodule to report errors and info messages.
Visual Diagram
classDiagram
class Tavily {
-tavily_client: TavilyClient
+__init__(api_key: str)
+search(query: str) list~dict~
+retrieve_chunks(question: str) dict
}
Tavily ..> TavilyClient : uses
Tavily ..> get_uuid : calls
Tavily ..> rag_tokenizer : calls
Summary
tavily_conn.py encapsulates Tavily API interactions, transforming search queries into structured, tokenized chunks suitable for knowledge retrieval and downstream NLP workflows. It provides robustness through error handling and logs key operations for observability. This module is integral for components in the system requiring knowledge extraction and augmentation from external data sources via Tavily's advanced search capabilities.