laws.py

Overview

laws.py is a core utility module within the InfiniFlow system designed to parse and chunk legal or law-related documents in various formats including DOCX, PDF, TXT, HTML, and legacy DOC. It provides specialized parsers that extract structured text from document files, identify hierarchical question/answer levels in DOCX files, perform OCR and layout analysis for PDFs, and tokenize the extracted text into manageable chunks for downstream processing such as NLP or search indexing.

The file integrates multiple parsing strategies and NLP preprocessing techniques to handle complex legal document structures effectively. It supports callback mechanisms to report progress during long-running operations like OCR, making it suitable for interactive or batch processing environments.


Classes and Functions

Class: Docx

Inheritance: DocxParser (from deepdoc.parser)

A parser specialized for DOCX files that extracts text paragraphs and identifies hierarchical question levels within paragraphs, enabling the reconstruction of nested question-answer sections commonly found in legal documents.

Methods:


Class: Pdf

Inheritance: PdfParser (from deepdoc.parser)

A parser specialized for PDF documents, implementing OCR-based image extraction and layout analysis to extract textual content with positional metadata.

Methods:


Function: chunk

A high-level utility function that detects the file type of the input document and delegates parsing to the appropriate parser (Docx, Pdf, PlainParser, HtmlParser, or uses Tika for .doc files). It then applies NLP preprocessing to tokenize and chunk the extracted text for further processing.

Signature

chunk(filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000,
      lang: str = "Chinese", callback: callable = None, **kwargs) -> list

Parameters

Returns

Usage Example

def progress_callback(progress=None, msg=""):
    if progress is not None:
        print(f"Progress: {progress*100:.1f}% - {msg}")
    else:
        print(msg)

chunks = chunk("legal_document.docx", lang="English", callback=progress_callback)
for chunk_text in chunks:
    print(chunk_text)

Implementation Details


Important Implementation Details and Algorithms


System Interaction


Diagram: Class Structure for Docx and Pdf

classDiagram
    class Docx {
        -doc: Document
        -__clean(line: str): str
        +old_call(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000): list~str~
        +__call__(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000): list~str~
        +__str__(): str
    }
    Docx --|> DocxParser

    class Pdf {
        -model_speciess: ParserType
        +__init__()
        +__call__(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000, zoomin: int=3, callback: callable=None): tuple~list~tuple~str, Any~~, None~
    }
    Pdf --|> PdfParser

Summary

The laws.py file is a specialized document parsing module optimized for legal documents, supporting multiple input formats with advanced hierarchical parsing and OCR capabilities. It serves as a foundational component in the InfiniFlow system's document ingestion pipeline, preparing legal texts for semantic processing by transforming complex document structures into clean, tokenized text chunks.


If you have any questions or need further elaboration on specific parts, feel free to ask!