tag.py

Overview

The tag.py file is a utility module designed for parsing and processing question-answer tagging data from various file formats, primarily Excel, CSV, and TXT files. It extracts structured chunks of content paired with tags, tokenizes text content for Natural Language Processing (NLP) uses, and interfaces with knowledge bases to label questions with relevant tags. The file is part of the InfiniFlow project and serves as a preprocessing and annotation tool in a larger NLP or retrieval-augmented generation (RAG) system.

Detailed Description

Imports and Dependencies

Standard Libraries: json, re, csv, deepcopy from copy.
Project-specific modules:
- deepdoc.parser.utils.get_text - utility to read text content from files.
- rag.app.qa.Excel - Excel file parser.
- rag.nlp.rag_tokenizer - tokenizer used for segmenting text into tokens.
Dynamically imported in label_question:
- KnowledgebaseService - for knowledge base access.
- Caching utilities get_tags_from_cache and set_tags_to_cache.
- settings - presumably holding retrieval system configuration and methods.

Functions

`beAdoc(d, q, a, eng, row_num=-1)`

Constructs and enriches a document dictionary with tokenized content and tags.

Parameters:
- d (dict): Base document dictionary to update.
- q (str): Question or content string.
- a (str): Tags string, comma-separated.
- eng (bool): Flag indicating if the language is English.
- row_num (int): Row number or index, default is -1 (optional).
Returns:
- dict: Updated dictionary with keys:
  - "content_with_weight": raw question/content string.
  - "content_ltks": list of tokens from rag_tokenizer.tokenize.
  - "content_sm_ltks": fine-grained tokens via rag_tokenizer.fine_grained_tokenize.
  - "tag_kwd": list of cleaned tags with dots replaced by underscores.
  - "top_int": list containing row number if valid.

Usage Example:

base_doc = {"docnm_kwd": "example.xlsx", "title_tks": ["example"]}
enriched_doc = beAdoc(base_doc, "What is AI?", "AI, Machine Learning", True, 10)

Notes:
- Strips and replaces dots in tags to maintain a consistent tag format.
- Uses tokenizer from rag.nlp to support downstream NLP tasks.

`chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs)`

Parses a file containing question-tag pairs into structured "chunks".

Parameters:
- filename (str): Path to input file (supports .xlsx, .csv, .txt).
- binary (optional): Binary content or flag for file reading.
- lang (str): Language of the content ("Chinese" by default).
- callback (callable): Progress callback function accepting (progress: float, message: str).
- **kwargs: Additional keyword arguments (e.g., from_page, to_page but ignored in this function).
Returns:
- list of dictionaries: Each dictionary is a chunk with tokenized content and tags (output of beAdoc).
File Format Handling:
- Excel (.xls, .xlsx): Uses rag.app.qa.Excel parser. Assumes two columns without headers: content and tags.
- TXT: UTF-8 encoded text with either comma or tab delimiter separating content and tags. Deformed lines are merged into previous content.
- CSV: UTF-8 encoded CSV, two columns per row expected.
Progress Reporting:
- Calls callback with progress updates (e.g., start of parse, number of tags extracted).
- Reports failures in parsing lines where expected format is not met.
Errors:
- Raises NotImplementedError if file extension is unsupported.

Usage Example:

def progress_cb(progress, message):
    print(f"{progress*100:.1f}% - {message}")

chunks = chunk("data/questions.xlsx", lang="English", callback=progress_cb)

Implementation Details:
- Tokenizes the filename (without extension) as a title_tks property for all chunks.
- Uses a heuristic to guess delimiter in TXT files by counting lines with commas vs tabs.
- Appends lines that do not fit the delimiter pattern to the previous content (to handle multi-line content).

`label_question(question, kbs)`

Labels an input question with relevant tags derived from provided knowledge bases.

Parameters:
- question (str): The question text to be tagged.
- kbs (list): List of knowledge base objects, each expected to have parser_config and tenant_id.
Returns:
- list or None: Tags associated with the question based on knowledge base tagging logic; returns None if no tagging KBs found.
Workflow:
1. Extracts tag_kb_ids from the knowledge bases' parser configurations.
2. Attempts to retrieve tags from cache.
3. If cache miss, fetches tags from the retrieval system and caches them.
4. Queries the retrieval system to tag the question with top N tags.
5. Returns the tag list.

Usage Example:

tags = label_question("What is deep learning?", [kb1, kb2])
print(tags)

Notes:
- Depends on external services (KnowledgebaseService, retrieval system in settings).
- Uses caching for performance optimization.

Implementation Notes

The file is designed to be extensible for multiple file formats but currently only supports Excel, CSV, and TXT.
Uses a combination of regex and heuristics to identify file type and delimiters.
Employs deep copies of document dictionaries to avoid mutation across chunks.
Callback mechanism provides integration hooks for UI or logging systems to monitor progress.
The tokenizer is used extensively to prepare text for downstream NLP or RAG tasks.
label_question function tightly couples with external knowledge base and caching infrastructure, implying this module is part of a larger ecosystem.

Interaction with Other System Components

deepdoc.parser.utils.get_text: Reads raw text content for file parsing.
rag.app.qa.Excel: Parses Excel files into question-answer pairs.
rag.nlp.rag_tokenizer: Tokenizes textual content at multiple granularity levels.
Knowledgebase and retrieval system (api.db.services.knowledgebase_service, settings.retrievaler): Used in label_question for tagging and cache management.
Callback functions: Allow external systems to track parsing progress and results.

Main Execution Block

When run as a script, the file calls chunk on the first command-line argument (filename), with dummy callback and limited page range (though page range arguments are ignored in chunk).
Example:
```
python tag.py sample.xlsx
```

Mermaid Class Diagram

classDiagram
    class tag.py {
        +dict beAdoc(dict d, str q, str a, bool eng, int row_num=-1)
        +list chunk(str filename, binary=None, str lang="Chinese", callable callback=None, **kwargs)
        +list label_question(str question, list kbs)
    }

Summary

The tag.py module is a core utility in the InfiniFlow project for transforming raw question-tag data from various file formats into richly tokenized and structured chunks suitable for downstream NLP and retrieval tasks. It supports file parsing, tokenization, tagging, and knowledge base integration with progress reporting, forming a bridge between raw data and intelligent tagging services.