qa.py
Overview
The qa.py module is designed for extracting and processing Question & Answer (Q&A) pairs from various document formats including Excel, PDF, DOCX, Markdown, CSV, and TXT files. It provides specialized parsers for each document type leveraging external libraries and custom logic to identify Q&A structures, extract content, images, tables, and tokenize text for downstream NLP tasks such as information retrieval or knowledge graph construction.
Key functionalities include:
Parsing structured Q&A pairs from Excel and CSV/TXT files.
OCR and layout analysis for PDF documents to extract Q&A blocks, images, and tables.
DOCX document parsing with support for nested question levels and embedded images.
Markdown parsing to identify hierarchical questions via heading levels.
Utilities to clean and tokenize extracted Q&A content.
A unified
chunkfunction that auto-detects file type and applies the appropriate parser.
This module is part of a larger system (likely InfiniFlow) focused on document understanding, question answering, or knowledge extraction.
Classes and Functions
Class: Excel(ExcelParser)
Purpose:
Parses Excel files (.xls, .xlsx) extracting Q&A pairs from sheets assumed to have questions in the first populated cell of a row and answers in the second.
Methods:
call(fnm, binary=None, callback=None) -> List[Tuple[str, str]]Parameters:
fnm(str): Path to the Excel file.binary(bytes, optional): Binary content of the Excel file instead of a file path.callback(callable, optional): A function to report progress.
Returns:
List of tuples (question, answer) extracted from the Excel sheets.Description:
Loads the workbook, iterates over sheets and rows, extracting the first two non-empty cells as question and answer. Callscallbackperiodically to report progress. Also detects if questions are mostly English.
Class: Pdf(PdfParser)
Purpose:
Handles PDF files, performing OCR, layout analysis, table extraction, and identification of Q&A bullet points to segment text into Q&A pairs with associated images and positional metadata.
Methods:
call(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) -> Tuple[List[Tuple[str, str, Image, List]], List]Parameters:
filename(str): Path to the PDF file.binary(bytes, optional): PDF file content as bytes.from_page(int): Start page for processing (0-based).to_page(int): End page for processing.zoomin(int): Zoom factor for OCR and layout analysis.callback(callable, optional): Function for progress reporting.
Returns:
qai_list: List of tuples (question, answer, image, positions) representing extracted Q&A blocks.tbls: List of extracted tables and figures.
Description:
Runs OCR, layout, and table analysis. Uses regex and bullet detection to parse questions and answers. Aggregates images and positional data usingcrop. Raises error if no Q&A structure detected.get_tbls_info(tbls, tbl_index) -> TupleParameters:
tbls(List): List of table data extracted.tbl_index(int): Index of the table to retrieve info for.
Returns:
Tuple containing page number, bounding box coordinates, a formatted tag string, and concatenated table text.
Class: Docx(DocxParser)
Purpose:
Parses DOCX files extracting hierarchical questions, answers, embedded images, and tables.
Methods:
init()Default constructor.
get_picture(document, paragraph) -> PIL.Image or NoneExtracts and returns the first image found in a paragraph.
call(filename, binary=None, from_page=0, to_page=100000, callback=None) -> Tuple[List[Tuple[str, str, Image]], List]Parses the DOCX document, extracting question-answer-image tuples based on heading levels. Also extracts tables as HTML strings. Supports page breaks indicated by runs in paragraphs.
Utility Functions
rmPrefix(txt: str) -> strRemoves common Q&A prefixes such as 'Question:', 'Answer:', 'Q', 'A', and their Chinese equivalents from the start of a string.
beAdocPdf(d: dict, q: str, a: str, eng: bool, image, poss: list) -> dictFormats a Q&A pair extracted from a PDF into a dictionary with tokenized content and positional data.
beAdocDocx(d: dict, q: str, a: str, eng: bool, image, row_num: int = -1) -> dictSimilar to
beAdocPdfbut adapted for DOCX data, optionally includes row number.beAdoc(d: dict, q: str, a: str, eng: bool, row_num: int = -1) -> dictGeneric formatter for Q&A pairs without images.
mdQuestionLevel(s: str) -> Tuple[int, str]Parses Markdown heading to determine question level (count of leading
#) and stripped question text.chunk(filename: str, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> List[dict]Parameters:
filename(str): File path or name.binary(bytes, optional): File content in bytes.from_page(int): Start page for paginated formats.to_page(int): End page.lang(str): Language, "English" or "Chinese".callback(callable, optional): For progress updates.**kwargs: Additional optional arguments.
Returns:
List of dictionaries, each representing a chunked Q&A pair with tokenized content and metadata.Description:
Main entry point to process files of different types, auto-detecting format by extension and dispatching to appropriate parser. Supports Excel, CSV, TXT, PDF, Markdown, DOCX. Raises NotImplementedError for unsupported formats.
Important Implementation Details
Q&A Extraction Logic:
Each parser applies heuristics or uses structural markers (like bullet points, heading levels, or column positions) to identify questions and answers.Image Handling:
PDF and DOCX parsers extract images associated with Q&A pairs and encode positional information for downstream use.Tokenization and Cleaning:
After extraction, question texts are cleaned (prefix removal) and tokenized using rag_tokenizer to prepare for NLP tasks.Callback Usage:
Parsers optionally report progress via acallbackfunction, useful for UI updates or logging.Multi-level Questions in DOCX and Markdown:
Uses stacks to maintain nested question levels based on heading or paragraph styles.Table Extraction:
Extracted tables are converted to HTML strings for Markdown/DOCX, and positional metadata is maintained for PDFs.
Interaction with Other System Components
Depends on external modules:
deepdoc.parser for base parsers (
PdfParser,ExcelParser,DocxParser).rag.nlpfor NLP utilities like tokenization, question bullet detection.deepdoc.parser.utils.get_text for text extraction from files.
rag.utils.get_float for positional parsing.
The output data structures (dictionaries with tokenized content, images, positions) are designed for downstream consumption by indexing, search, or QA systems.
The
chunkfunction serves as a unified interface used by other system components to parse various document inputs.
Usage Examples
from qa import chunk
# Parse an Excel file containing Q&A pairs
qa_chunks = chunk("questions_answers.xlsx", lang="English")
# Parse a PDF document with Q&A content
qa_chunks_pdf = chunk("sample.pdf", from_page=0, to_page=10, callback=print)
# Parse a Markdown file with hierarchical questions
qa_chunks_md = chunk("faq.md")
for qa in qa_chunks:
print(qa["content_with_weight"])
Mermaid Class Diagram
classDiagram
class Excel {
+__call__(fnm, binary=None, callback=None) List~Tuple~str,str~
- is_english: bool
}
class Pdf {
+__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) Tuple~List~Tuple~str,str,Image,List~, List~
+get_tbls_info(tbls, tbl_index) Tuple
}
class Docx {
+__init__()
+get_picture(document, paragraph) Image or None
+__call__(filename, binary=None, from_page=0, to_page=100000, callback=None) Tuple~List~Tuple~str,str,Image~, List~
}
class Functions {
+rmPrefix(txt) str
+beAdocPdf(d, q, a, eng, image, poss) dict
+beAdocDocx(d, q, a, eng, image, row_num=-1) dict
+beAdoc(d, q, a, eng, row_num=-1) dict
+mdQuestionLevel(s) Tuple~int,str~
+chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) List~dict~
}
Summary
The qa.py file is a core utility module in the InfiniFlow project focused on robust, multi-format extraction of question-answer pairs from documents. It supports a wide variety of file formats and integrates OCR, NLP, and document parsing techniques to convert raw file contents into structured, tokenized Q&A data ready for further processing in knowledge systems or search engines.