qa.py

Overview

The qa.py module is designed for extracting and processing Question & Answer (Q&A) pairs from various document formats including Excel, PDF, DOCX, Markdown, CSV, and TXT files. It provides specialized parsers for each document type leveraging external libraries and custom logic to identify Q&A structures, extract content, images, tables, and tokenize text for downstream NLP tasks such as information retrieval or knowledge graph construction.

Key functionalities include:

Parsing structured Q&A pairs from Excel and CSV/TXT files.
OCR and layout analysis for PDF documents to extract Q&A blocks, images, and tables.
DOCX document parsing with support for nested question levels and embedded images.
Markdown parsing to identify hierarchical questions via heading levels.
Utilities to clean and tokenize extracted Q&A content.
A unified chunk function that auto-detects file type and applies the appropriate parser.

This module is part of a larger system (likely InfiniFlow) focused on document understanding, question answering, or knowledge extraction.

Classes and Functions

Class: `Excel(ExcelParser)`

Purpose:
Parses Excel files (.xls, .xlsx) extracting Q&A pairs from sheets assumed to have questions in the first populated cell of a row and answers in the second.

Methods:

call(fnm, binary=None, callback=None) -> List[Tuple[str, str]]
Parameters:
- fnm (str): Path to the Excel file.
- binary (bytes, optional): Binary content of the Excel file instead of a file path.
- callback (callable, optional): A function to report progress.
Returns:
List of tuples (question, answer) extracted from the Excel sheets.
Description:
Loads the workbook, iterates over sheets and rows, extracting the first two non-empty cells as question and answer. Calls callback periodically to report progress. Also detects if questions are mostly English.

Class: `Pdf(PdfParser)`

Purpose:
Handles PDF files, performing OCR, layout analysis, table extraction, and identification of Q&A bullet points to segment text into Q&A pairs with associated images and positional metadata.

Methods:

call(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) -> Tuple[List[Tuple[str, str, Image, List]], List]
Parameters:
- filename (str): Path to the PDF file.
- binary (bytes, optional): PDF file content as bytes.
- from_page (int): Start page for processing (0-based).
- to_page (int): End page for processing.
- zoomin (int): Zoom factor for OCR and layout analysis.
- callback (callable, optional): Function for progress reporting.
Returns:
- qai_list: List of tuples (question, answer, image, positions) representing extracted Q&A blocks.
- tbls: List of extracted tables and figures.
Description:
Runs OCR, layout, and table analysis. Uses regex and bullet detection to parse questions and answers. Aggregates images and positional data using crop. Raises error if no Q&A structure detected.
get_tbls_info(tbls, tbl_index) -> Tuple
Parameters:
- tbls (List): List of table data extracted.
- tbl_index (int): Index of the table to retrieve info for.
Returns:
Tuple containing page number, bounding box coordinates, a formatted tag string, and concatenated table text.

Class: `Docx(DocxParser)`

Purpose:
Parses DOCX files extracting hierarchical questions, answers, embedded images, and tables.

Methods:

init()
Default constructor.
get_picture(document, paragraph) -> PIL.Image or None
Extracts and returns the first image found in a paragraph.
call(filename, binary=None, from_page=0, to_page=100000, callback=None) -> Tuple[List[Tuple[str, str, Image]], List]
Parses the DOCX document, extracting question-answer-image tuples based on heading levels. Also extracts tables as HTML strings. Supports page breaks indicated by runs in paragraphs.

Utility Functions

rmPrefix(txt: str) -> str
Removes common Q&A prefixes such as 'Question:', 'Answer:', 'Q', 'A', and their Chinese equivalents from the start of a string.
beAdocPdf(d: dict, q: str, a: str, eng: bool, image, poss: list) -> dict
Formats a Q&A pair extracted from a PDF into a dictionary with tokenized content and positional data.
beAdocDocx(d: dict, q: str, a: str, eng: bool, image, row_num: int = -1) -> dict
Similar to beAdocPdf but adapted for DOCX data, optionally includes row number.
beAdoc(d: dict, q: str, a: str, eng: bool, row_num: int = -1) -> dict
Generic formatter for Q&A pairs without images.
mdQuestionLevel(s: str) -> Tuple[int, str]
Parses Markdown heading to determine question level (count of leading #) and stripped question text.
chunk(filename: str, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> List[dict]
Parameters:
- filename (str): File path or name.
- binary (bytes, optional): File content in bytes.
- from_page (int): Start page for paginated formats.
- to_page (int): End page.
- lang (str): Language, "English" or "Chinese".
- callback (callable, optional): For progress updates.
- **kwargs: Additional optional arguments.
Returns:
List of dictionaries, each representing a chunked Q&A pair with tokenized content and metadata.
Description:
Main entry point to process files of different types, auto-detecting format by extension and dispatching to appropriate parser. Supports Excel, CSV, TXT, PDF, Markdown, DOCX. Raises NotImplementedError for unsupported formats.

Important Implementation Details

Q&A Extraction Logic:
Each parser applies heuristics or uses structural markers (like bullet points, heading levels, or column positions) to identify questions and answers.
Image Handling:
PDF and DOCX parsers extract images associated with Q&A pairs and encode positional information for downstream use.
Tokenization and Cleaning:
After extraction, question texts are cleaned (prefix removal) and tokenized using rag_tokenizer to prepare for NLP tasks.
Callback Usage:
Parsers optionally report progress via a callback function, useful for UI updates or logging.
Multi-level Questions in DOCX and Markdown:
Uses stacks to maintain nested question levels based on heading or paragraph styles.
Table Extraction:
Extracted tables are converted to HTML strings for Markdown/DOCX, and positional metadata is maintained for PDFs.

Interaction with Other System Components

Depends on external modules:
- deepdoc.parser for base parsers (PdfParser, ExcelParser, DocxParser).
- rag.nlp for NLP utilities like tokenization, question bullet detection.
- deepdoc.parser.utils.get_text for text extraction from files.
- rag.utils.get_float for positional parsing.
The output data structures (dictionaries with tokenized content, images, positions) are designed for downstream consumption by indexing, search, or QA systems.
The chunk function serves as a unified interface used by other system components to parse various document inputs.

Usage Examples

from qa import chunk

# Parse an Excel file containing Q&A pairs
qa_chunks = chunk("questions_answers.xlsx", lang="English")

# Parse a PDF document with Q&A content
qa_chunks_pdf = chunk("sample.pdf", from_page=0, to_page=10, callback=print)

# Parse a Markdown file with hierarchical questions
qa_chunks_md = chunk("faq.md")

for qa in qa_chunks:
    print(qa["content_with_weight"])

Mermaid Class Diagram

classDiagram
    class Excel {
        +__call__(fnm, binary=None, callback=None) List~Tuple~str,str~
        - is_english: bool
    }
    class Pdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) Tuple~List~Tuple~str,str,Image,List~, List~
        +get_tbls_info(tbls, tbl_index) Tuple
    }
    class Docx {
        +__init__()
        +get_picture(document, paragraph) Image or None
        +__call__(filename, binary=None, from_page=0, to_page=100000, callback=None) Tuple~List~Tuple~str,str,Image~, List~
    }
    class Functions {
        +rmPrefix(txt) str
        +beAdocPdf(d, q, a, eng, image, poss) dict
        +beAdocDocx(d, q, a, eng, image, row_num=-1) dict
        +beAdoc(d, q, a, eng, row_num=-1) dict
        +mdQuestionLevel(s) Tuple~int,str~
        +chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) List~dict~
    }

Summary

The qa.py file is a core utility module in the InfiniFlow project focused on robust, multi-format extraction of question-answer pairs from documents. It supports a wide variety of file formats and integrates OCR, NLP, and document parsing techniques to convert raw file contents into structured, tokenized Q&A data ready for further processing in knowledge systems or search engines.

qa.py

Overview

Classes and Functions

Class: Excel(ExcelParser)

Class: Pdf(PdfParser)

Class: Docx(DocxParser)

Utility Functions

Important Implementation Details

Interaction with Other System Components

Usage Examples

Mermaid Class Diagram

Summary

Class: `Excel(ExcelParser)`

Class: `Pdf(PdfParser)`

Class: `Docx(DocxParser)`