tag.py
Overview
The tag.py file is a utility module designed for parsing and processing question-answer tagging data from various file formats, primarily Excel, CSV, and TXT files. It extracts structured chunks of content paired with tags, tokenizes text content for Natural Language Processing (NLP) uses, and interfaces with knowledge bases to label questions with relevant tags. The file is part of the InfiniFlow project and serves as a preprocessing and annotation tool in a larger NLP or retrieval-augmented generation (RAG) system.
Detailed Description
Imports and Dependencies
Standard Libraries:
json,re,csv, deepcopy fromcopy.Project-specific modules:
deepdoc.parser.utils.get_text- utility to read text content from files.rag.app.qa.Excel- Excel file parser.rag.nlp.rag_tokenizer- tokenizer used for segmenting text into tokens.
Dynamically imported in
label_question:KnowledgebaseService- for knowledge base access.Caching utilities get_tags_from_cache and
set_tags_to_cache.settings- presumably holding retrieval system configuration and methods.
Functions
beAdoc(d, q, a, eng, row_num=-1)
Constructs and enriches a document dictionary with tokenized content and tags.
Parameters:
d (dict): Base document dictionary to update.q (str): Question or content string.a (str): Tags string, comma-separated.eng (bool): Flag indicating if the language is English.row_num (int): Row number or index, default is -1 (optional).
Returns:
dict: Updated dictionary with keys:"content_with_weight": raw question/content string."content_ltks": list of tokens fromrag_tokenizer.tokenize."content_sm_ltks": fine-grained tokens viarag_tokenizer.fine_grained_tokenize."tag_kwd": list of cleaned tags with dots replaced by underscores."top_int": list containing row number if valid.
Usage Example:
base_doc = {"docnm_kwd": "example.xlsx", "title_tks": ["example"]} enriched_doc = beAdoc(base_doc, "What is AI?", "AI, Machine Learning", True, 10)Notes:
Strips and replaces dots in tags to maintain a consistent tag format.
Uses tokenizer from
rag.nlpto support downstream NLP tasks.
chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs)
Parses a file containing question-tag pairs into structured "chunks".
Parameters:
filename (str): Path to input file (supports.xlsx,.csv,.txt).binary (optional): Binary content or flag for file reading.lang (str): Language of the content ("Chinese" by default).callback (callable): Progress callback function accepting(progress: float, message: str).**kwargs: Additional keyword arguments (e.g.,from_page,to_pagebut ignored in this function).
Returns:
listof dictionaries: Each dictionary is a chunk with tokenized content and tags (output ofbeAdoc).
File Format Handling:
Excel (
.xls,.xlsx): Usesrag.app.qa.Excelparser. Assumes two columns without headers: content and tags.TXT: UTF-8 encoded text with either comma or tab delimiter separating content and tags. Deformed lines are merged into previous content.
CSV: UTF-8 encoded CSV, two columns per row expected.
Progress Reporting:
Calls
callbackwith progress updates (e.g., start of parse, number of tags extracted).Reports failures in parsing lines where expected format is not met.
Errors:
Raises
NotImplementedErrorif file extension is unsupported.
Usage Example:
def progress_cb(progress, message): print(f"{progress*100:.1f}% - {message}") chunks = chunk("data/questions.xlsx", lang="English", callback=progress_cb)Implementation Details:
Tokenizes the filename (without extension) as a
title_tksproperty for all chunks.Uses a heuristic to guess delimiter in TXT files by counting lines with commas vs tabs.
Appends lines that do not fit the delimiter pattern to the previous content (to handle multi-line content).
label_question(question, kbs)
Labels an input question with relevant tags derived from provided knowledge bases.
Parameters:
question (str): The question text to be tagged.kbs (list): List of knowledge base objects, each expected to haveparser_configandtenant_id.
Returns:
listorNone: Tags associated with the question based on knowledge base tagging logic; returnsNoneif no tagging KBs found.
Workflow:
Extracts
tag_kb_idsfrom the knowledge bases' parser configurations.Attempts to retrieve tags from cache.
If cache miss, fetches tags from the retrieval system and caches them.
Queries the retrieval system to tag the question with top N tags.
Returns the tag list.
Usage Example:
tags = label_question("What is deep learning?", [kb1, kb2]) print(tags)Notes:
Depends on external services (
KnowledgebaseService, retrieval system insettings).Uses caching for performance optimization.
Implementation Notes
The file is designed to be extensible for multiple file formats but currently only supports Excel, CSV, and TXT.
Uses a combination of regex and heuristics to identify file type and delimiters.
Employs deep copies of document dictionaries to avoid mutation across chunks.
Callback mechanism provides integration hooks for UI or logging systems to monitor progress.
The tokenizer is used extensively to prepare text for downstream NLP or RAG tasks.
label_questionfunction tightly couples with external knowledge base and caching infrastructure, implying this module is part of a larger ecosystem.
Interaction with Other System Components
deepdoc.parser.utils.get_text: Reads raw text content for file parsing.rag.app.qa.Excel: Parses Excel files into question-answer pairs.rag.nlp.rag_tokenizer: Tokenizes textual content at multiple granularity levels.Knowledgebase and retrieval system (
api.db.services.knowledgebase_service,settings.retrievaler): Used inlabel_questionfor tagging and cache management.Callback functions: Allow external systems to track parsing progress and results.
Main Execution Block
When run as a script, the file calls
chunkon the first command-line argument (filename), with dummy callback and limited page range (though page range arguments are ignored inchunk).Example:
python tag.py sample.xlsx
Mermaid Class Diagram
classDiagram
class tag.py {
+dict beAdoc(dict d, str q, str a, bool eng, int row_num=-1)
+list chunk(str filename, binary=None, str lang="Chinese", callable callback=None, **kwargs)
+list label_question(str question, list kbs)
}
Summary
The tag.py module is a core utility in the InfiniFlow project for transforming raw question-tag data from various file formats into richly tokenized and structured chunks suitable for downstream NLP and retrieval tasks. It supports file parsing, tokenization, tagging, and knowledge base integration with progress reporting, forming a bridge between raw data and intelligent tagging services.