tavily_conn.py

Overview

This file defines the Tavily class, a wrapper around the TavilyClient API designed to facilitate advanced search queries against the Tavily knowledge base. It handles querying, processing search results, and formatting them into structured chunks and aggregates for downstream use, such as retrieval-augmented generation (RAG) or knowledge retrieval tasks.

This integration abstracts direct interaction with the Tavily API, providing an easy-to-use interface that includes:

Executing a search query with specific parameters.
Transforming raw search results into tokenized, metadata-rich chunks.
Logging relevant search activities and results for debugging and monitoring.

Classes and Methods

Class: `Tavily`

A client interface to interact with Tavily's search API and process returned documents into structured chunks.

`init(self, api_key: str)`

Constructor.

Parameters:
- api_key (str): The API key string used to authenticate requests against the Tavily API.
Behavior:
- Initializes an instance of TavilyClient using the provided API key.
- Stores the client instance in self.tavily_client for use in other methods.
Usage Example:

tavily = Tavily(api_key="your_api_key_here")

`search(self, query) -> list[dict]`

Executes a search query using the Tavily API.

Parameters:
- query (str): The search string or question to query against the Tavily knowledge base.
Returns:
- A list of dictionaries, each representing a search result with keys:
  - url (str): URL of the source document.
  - title (str): Title or name of the document.
  - content (str): Extracted content snippet from the document.
  - score (float): Relevance score assigned by the search engine.
Behavior:
- Calls self.tavily_client.search with:
  - search_depth set to "advanced" for a richer search.
  - max_results capped at 6.
- On success, processes the raw results into the summarized list format.
- On failure (any exception), logs the exception with traceback and returns an empty list.
Usage Example:

results = tavily.search("What is quantum computing?")
for res in results:
    print(res["title"], res["url"])

`retrieve_chunks(self, question) -> dict`

Processes a search question into tokenized content chunks and document aggregates.

Parameters:
- question (str): The query string to retrieve relevant knowledge chunks.
Returns:
- A dictionary with two keys:
  - "chunks": A list of dictionaries, each representing a content chunk with metadata:
    - chunk_id (str): Unique UUID for the chunk.
    - content_ltks (list): List of tokens from the chunk content, generated by rag_tokenizer.tokenize.
    - content_with_weight (str): Raw content text.
    - doc_id (str): Document ID (same as chunk_id here).
    - docnm_kwd (str): Document title or keyword.
    - kb_id (list): Empty list placeholder for knowledge base IDs.
    - important_kwd (list): Empty list placeholder for important keywords.
    - image_id (str): Empty string placeholder for image ID.
    - similarity (float): Similarity score from the search result.
    - vector_similarity (float): Fixed value 1.0 — likely a placeholder or default.
    - term_similarity (float): Fixed value 0 — placeholder.
    - vector (list): Empty vector placeholder.
    - positions (list): Empty list for token positions.
    - url (str): URL of the document.
  - "doc_aggs": A list of dictionaries with document aggregation info:
    - doc_name (str): Document title.
    - doc_id (str): Document ID.
    - count (int): Count of chunks (always 1 here).
    - url (str): Document URL.
Behavior:
- Logs the incoming question.
- Calls search(question) and iterates over each result.
- For each result:
  - Generates a new UUID.
  - Tokenizes the content via rag_tokenizer.tokenize.
  - Constructs a chunk dictionary and an aggregation dictionary.
  - Logs the snippet of content (first 128 characters).
- Returns the structured data for use in downstream pipelines (e.g., RAG retrieval).
Usage Example:

data = tavily.retrieve_chunks("Explain AI ethics.")
print(data["chunks"][0]["content_with_weight"])
print(data["doc_aggs"][0]["doc_name"])

Implementation Details

Error Handling:
The search method wraps the API call in a try-except block, logging exceptions with logging.exception which includes stack trace information. This ensures that failures in the API call do not crash the application but result in an empty search result.
Tokenization:
The file depends on an external tokenizer rag_tokenizer.tokenize (presumably from a RAG NLP module). This tokenizer breaks down content into suitable tokens for embedding or similarity comparison.
UUID Generation:
Each chunk and document is assigned a unique identifier using get_uuid() imported from api.utils, ensuring consistent uniqueness across calls.
Fixed Similarity Metrics:
The vector_similarity and term_similarity fields are currently hardcoded as 1. and 0 respectively, suggesting placeholders or default values to be refined by other components.
Logging:
The class logs key events at INFO level, including the input query and snippet of each retrieved document content, aiding debugging and traceability.

Interaction with Other Components

TavilyClient (from tavily module):
Core client to communicate with the Tavily API. This file acts as a higher-level wrapper adding result processing and error handling.
get_uuid (from api.utils):
Utility function to generate unique identifiers for documents and chunks.
rag_tokenizer (from rag.nlp):
Tokenizer used to convert the retrieved content into tokens for further NLP or embedding processing.
Logging:
Uses Python’s standard logging module to report errors and info messages.

Visual Diagram

classDiagram
    class Tavily {
        -tavily_client: TavilyClient
        +__init__(api_key: str)
        +search(query: str) list~dict~
        +retrieve_chunks(question: str) dict
    }
    Tavily ..> TavilyClient : uses
    Tavily ..> get_uuid : calls
    Tavily ..> rag_tokenizer : calls

Summary

tavily_conn.py encapsulates Tavily API interactions, transforming search queries into structured, tokenized chunks suitable for knowledge retrieval and downstream NLP workflows. It provides robustness through error handling and logs key operations for observability. This module is integral for components in the system requiring knowledge extraction and augmentation from external data sources via Tavily's advanced search capabilities.