qa.py


Overview

The qa.py module is designed for extracting and processing Question & Answer (Q&A) pairs from various document formats including Excel, PDF, DOCX, Markdown, CSV, and TXT files. It provides specialized parsers for each document type leveraging external libraries and custom logic to identify Q&A structures, extract content, images, tables, and tokenize text for downstream NLP tasks such as information retrieval or knowledge graph construction.

Key functionalities include:

This module is part of a larger system (likely InfiniFlow) focused on document understanding, question answering, or knowledge extraction.


Classes and Functions

Class: Excel(ExcelParser)

Purpose:
Parses Excel files (.xls, .xlsx) extracting Q&A pairs from sheets assumed to have questions in the first populated cell of a row and answers in the second.

Methods:


Class: Pdf(PdfParser)

Purpose:
Handles PDF files, performing OCR, layout analysis, table extraction, and identification of Q&A bullet points to segment text into Q&A pairs with associated images and positional metadata.

Methods:


Class: Docx(DocxParser)

Purpose:
Parses DOCX files extracting hierarchical questions, answers, embedded images, and tables.

Methods:


Utility Functions


Important Implementation Details


Interaction with Other System Components


Usage Examples

from qa import chunk

# Parse an Excel file containing Q&A pairs
qa_chunks = chunk("questions_answers.xlsx", lang="English")

# Parse a PDF document with Q&A content
qa_chunks_pdf = chunk("sample.pdf", from_page=0, to_page=10, callback=print)

# Parse a Markdown file with hierarchical questions
qa_chunks_md = chunk("faq.md")

for qa in qa_chunks:
    print(qa["content_with_weight"])

Mermaid Class Diagram

classDiagram
    class Excel {
        +__call__(fnm, binary=None, callback=None) List~Tuple~str,str~
        - is_english: bool
    }
    class Pdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) Tuple~List~Tuple~str,str,Image,List~, List~
        +get_tbls_info(tbls, tbl_index) Tuple
    }
    class Docx {
        +__init__()
        +get_picture(document, paragraph) Image or None
        +__call__(filename, binary=None, from_page=0, to_page=100000, callback=None) Tuple~List~Tuple~str,str,Image~, List~
    }
    class Functions {
        +rmPrefix(txt) str
        +beAdocPdf(d, q, a, eng, image, poss) dict
        +beAdocDocx(d, q, a, eng, image, row_num=-1) dict
        +beAdoc(d, q, a, eng, row_num=-1) dict
        +mdQuestionLevel(s) Tuple~int,str~
        +chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) List~dict~
    }

Summary

The qa.py file is a core utility module in the InfiniFlow project focused on robust, multi-format extraction of question-answer pairs from documents. It supports a wide variety of file formats and integrates OCR, NLP, and document parsing techniques to convert raw file contents into structured, tokenized Q&A data ready for further processing in knowledge systems or search engines.