laws.py

Overview

laws.py is a core utility module within the InfiniFlow system designed to parse and chunk legal or law-related documents in various formats including DOCX, PDF, TXT, HTML, and legacy DOC. It provides specialized parsers that extract structured text from document files, identify hierarchical question/answer levels in DOCX files, perform OCR and layout analysis for PDFs, and tokenize the extracted text into manageable chunks for downstream processing such as NLP or search indexing.

The file integrates multiple parsing strategies and NLP preprocessing techniques to handle complex legal document structures effectively. It supports callback mechanisms to report progress during long-running operations like OCR, making it suitable for interactive or batch processing environments.

Classes and Functions

Class: `Docx`

Inheritance: DocxParser (from deepdoc.parser)

A parser specialized for DOCX files that extracts text paragraphs and identifies hierarchical question levels within paragraphs, enabling the reconstruction of nested question-answer sections commonly found in legal documents.

Methods:

init(self)
Initializes the Docx parser instance. No parameters required.
__clean(self, line: str) -> str
Cleans a single line of text by replacing full-width spaces (\u3000) with normal spaces and stripping leading/trailing whitespace.
- Parameters:
  - line: A string representing a single line of text.
- Returns: A cleaned string.
old_call(self, filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000) -> list[str]
Legacy method to parse DOCX files, returning a simple list of cleaned paragraph texts between specified page ranges.
- Parameters:
  - filename: Path to the DOCX file.
  - binary: Optional binary content of the DOCX file.
  - from_page: Start page number (inclusive).
  - to_page: End page number (exclusive).
- Returns: List of paragraph strings.
- Note: Uses page breaks detected via XML runs to count pages.
call(self, filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000) -> list[str]
Main parsing method. Extracts paragraphs with their question levels, groups related paragraphs into sections based on hierarchical question levels, and returns the merged sections as a list of strings.
- Parameters:
  - filename: Path to the DOCX file.
  - binary: Optional binary content of the DOCX file.
  - from_page: Starting page index.
  - to_page: Ending page index.
- Returns: List of section strings.
- Implementation details:
  - Uses docx_question_level to determine the hierarchy level of each paragraph.
  - Merges paragraphs into sections by grouping paragraphs of increasing question levels.
  - Skips empty paragraphs.
  - Detects page breaks from the underlying XML structure.
  - Uses bullets_category to classify bullet styles to assist in structure detection.
str(self) -> str
Returns a string representation of the object showing question, answer, level, and child elements. These properties appear to be expected but are not defined in the visible code, suggesting this method is a placeholder or relies on inherited attributes.

Class: `Pdf`

Inheritance: PdfParser (from deepdoc.parser)

A parser specialized for PDF documents, implementing OCR-based image extraction and layout analysis to extract textual content with positional metadata.

Methods:

init(self)
Initializes the Pdf parser instance and sets the model species to ParserType.LAWS, indicating this parser is tuned for legal document structures.
call(self, filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000, zoomin: int = 3, callback: callable = None) -> tuple[list[tuple[str, Any]], None]
Parses PDF files by performing OCR on specified pages, analyzing the layout, merging text blocks, and extracting text lines along with tagging information.
- Parameters:
  - filename: Path to PDF file.
  - binary: Optional binary content of the PDF file.
  - from_page, to_page: Page range to process.
  - zoomin: Zoom factor for image processing before OCR (default 3).
  - callback: Optional function to report progress, called with either a float or message string.
- Returns: Tuple containing a list of tuples (text, line_tag) and None.
- Implementation details:
  - Uses internal methods images for OCR image extraction.
  - _layouts_rec for layout recognition.
  - _naive_vertical_merge for merging vertically aligned text blocks.
  - Progress is reported via the callback at multiple stages (OCR start/end, layout analysis, text extraction).
  - Timing is measured for performance logging.

Function: `chunk`

A high-level utility function that detects the file type of the input document and delegates parsing to the appropriate parser (Docx, Pdf, PlainParser, HtmlParser, or uses Tika for .doc files). It then applies NLP preprocessing to tokenize and chunk the extracted text for further processing.

Signature

chunk(filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000,
      lang: str = "Chinese", callback: callable = None, **kwargs) -> list

Parameters

filename: File path or file name string to parse.
binary: Optional binary content of the file.
from_page: Page number from which to start parsing (default 0).
to_page: Page number to stop parsing (default 100000).
lang: Language of the document, affects tokenization (default "Chinese").
callback: Optional callback for progress reporting.
**kwargs: Additional parameters; notably:
- parser_config: Dict for parser options such as chunk token number, delimiter, and layout recognition method.

Returns

A list of tokenized text chunks suitable for NLP pipelines.

Usage Example

def progress_callback(progress=None, msg=""):
    if progress is not None:
        print(f"Progress: {progress*100:.1f}% - {msg}")
    else:
        print(msg)

chunks = chunk("legal_document.docx", lang="English", callback=progress_callback)
for chunk_text in chunks:
    print(chunk_text)

Implementation Details

Detects file type by file extension using regex.
For .docx files: uses Docx parser.
For .pdf files: uses Pdf parser or PlainParser if specified.
For .txt files: reads text directly.
For .html/.htm files: uses HtmlParser.
For legacy .doc files: uses Apache Tika for extraction.
After parsing, it removes "Contents" sections, adjusts titles, categorizes bullets, and merges hierarchically.
Finally tokenizes the merged chunks using rag_tokenizer.
Raises NotImplementedError if unsupported file type.

Important Implementation Details and Algorithms

Hierarchical Question Level Parsing (Docx):
The Docx class uses docx_question_level to assign numeric levels indicating the hierarchy of questions or sections. It then groups paragraphs into sections by finding contiguous paragraphs with incrementing levels, enabling the extraction of nested legal provisions or FAQs.
Page Break Detection (Docx):
Page breaks are inferred by inspecting the XML of runs within paragraphs, looking for specific tags like lastRenderedPageBreak or w:br with type="page", to increment the page number counter.
PDF OCR and Layout Analysis:
The Pdf class first extracts images of pages for OCR, then performs layout recognition to identify text blocks, followed by merging operations to combine vertically aligned text segments, improving the structural integrity of extracted text.
Tokenization and Chunking:
The chunk function applies several NLP preprocessing steps including removal of contents tables, title normalization using colon detection, bullet categorization, and hierarchical merging of text chunks before tokenization. This multi-step process improves the semantic coherence of text chunks for downstream tasks.

System Interaction

Imports from Other Modules:
- deepdoc.parser: Provides base parser classes (PdfParser, DocxParser, PlainParser, HtmlParser).
- rag.nlp: Supplies NLP utilities for bullet detection, content removal, merging, tokenization.
- api.db: Supplies enumeration ParserType to specify parsing model types.
- External libraries like tika.parser for .doc files, docx for DOCX reading.
Callbacks:
The parsers and chunk function accept optional callbacks to report progress and status messages, enabling integration with UI or logging systems.
Return Values:
The main entry function chunk returns processed, tokenized chunks, ready for use in NLP pipelines, retrieval, or indexing components within the InfiniFlow platform.

Diagram: Class Structure for `Docx` and `Pdf`

classDiagram
    class Docx {
        -doc: Document
        -__clean(line: str): str
        +old_call(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000): list~str~
        +__call__(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000): list~str~
        +__str__(): str
    }
    Docx --|> DocxParser

    class Pdf {
        -model_speciess: ParserType
        +__init__()
        +__call__(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000, zoomin: int=3, callback: callable=None): tuple~list~tuple~str, Any~~, None~
    }
    Pdf --|> PdfParser

Summary

The laws.py file is a specialized document parsing module optimized for legal documents, supporting multiple input formats with advanced hierarchical parsing and OCR capabilities. It serves as a foundational component in the InfiniFlow system's document ingestion pipeline, preparing legal texts for semantic processing by transforming complex document structures into clean, tokenized text chunks.

If you have any questions or need further elaboration on specific parts, feel free to ask!

laws.py

Overview

Classes and Functions

Class: Docx

Methods:

Class: Pdf

Methods:

Function: chunk

Signature

Parameters

Returns

Usage Example

Implementation Details

Important Implementation Details and Algorithms

System Interaction

Diagram: Class Structure for Docx and Pdf

Summary

Class: `Docx`

Class: `Pdf`

Function: `chunk`

Diagram: Class Structure for `Docx` and `Pdf`