laws.py
Overview
laws.py is a core utility module within the InfiniFlow system designed to parse and chunk legal or law-related documents in various formats including DOCX, PDF, TXT, HTML, and legacy DOC. It provides specialized parsers that extract structured text from document files, identify hierarchical question/answer levels in DOCX files, perform OCR and layout analysis for PDFs, and tokenize the extracted text into manageable chunks for downstream processing such as NLP or search indexing.
The file integrates multiple parsing strategies and NLP preprocessing techniques to handle complex legal document structures effectively. It supports callback mechanisms to report progress during long-running operations like OCR, making it suitable for interactive or batch processing environments.
Classes and Functions
Class: Docx
Inheritance: DocxParser (from deepdoc.parser)
A parser specialized for DOCX files that extracts text paragraphs and identifies hierarchical question levels within paragraphs, enabling the reconstruction of nested question-answer sections commonly found in legal documents.
Methods:
init(self)Initializes the
Docxparser instance. No parameters required.__clean(self, line: str) -> strCleans a single line of text by replacing full-width spaces (
\u3000) with normal spaces and stripping leading/trailing whitespace.Parameters:
line: A string representing a single line of text.
Returns: A cleaned string.
old_call(self, filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000) -> list[str]Legacy method to parse DOCX files, returning a simple list of cleaned paragraph texts between specified page ranges.
Parameters:
filename: Path to the DOCX file.binary: Optional binary content of the DOCX file.from_page: Start page number (inclusive).to_page: End page number (exclusive).
Returns: List of paragraph strings.
Note: Uses page breaks detected via XML runs to count pages.
call(self, filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000) -> list[str]Main parsing method. Extracts paragraphs with their question levels, groups related paragraphs into sections based on hierarchical question levels, and returns the merged sections as a list of strings.
Parameters:
filename: Path to the DOCX file.binary: Optional binary content of the DOCX file.from_page: Starting page index.to_page: Ending page index.
Returns: List of section strings.
Implementation details:
Uses
docx_question_levelto determine the hierarchy level of each paragraph.Merges paragraphs into sections by grouping paragraphs of increasing question levels.
Skips empty paragraphs.
Detects page breaks from the underlying XML structure.
Uses
bullets_categoryto classify bullet styles to assist in structure detection.
str(self) -> strReturns a string representation of the object showing question, answer, level, and child elements. These properties appear to be expected but are not defined in the visible code, suggesting this method is a placeholder or relies on inherited attributes.
Class: Pdf
Inheritance: PdfParser (from deepdoc.parser)
A parser specialized for PDF documents, implementing OCR-based image extraction and layout analysis to extract textual content with positional metadata.
Methods:
init(self)Initializes the
Pdfparser instance and sets the model species toParserType.LAWS, indicating this parser is tuned for legal document structures.call(self, filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000, zoomin: int = 3, callback: callable = None) -> tuple[list[tuple[str, Any]], None]Parses PDF files by performing OCR on specified pages, analyzing the layout, merging text blocks, and extracting text lines along with tagging information.
Parameters:
filename: Path to PDF file.binary: Optional binary content of the PDF file.from_page,to_page: Page range to process.zoomin: Zoom factor for image processing before OCR (default 3).callback: Optional function to report progress, called with either a float or message string.
Returns: Tuple containing a list of tuples (text, line_tag) and
None.Implementation details:
Uses internal methods images for OCR image extraction.
_layouts_recfor layout recognition._naive_vertical_merge for merging vertically aligned text blocks.
Progress is reported via the callback at multiple stages (OCR start/end, layout analysis, text extraction).
Timing is measured for performance logging.
Function: chunk
A high-level utility function that detects the file type of the input document and delegates parsing to the appropriate parser (Docx, Pdf, PlainParser, HtmlParser, or uses Tika for .doc files). It then applies NLP preprocessing to tokenize and chunk the extracted text for further processing.
Signature
chunk(filename: str, binary: bytes = None, from_page: int = 0, to_page: int = 100000,
lang: str = "Chinese", callback: callable = None, **kwargs) -> list
Parameters
filename: File path or file name string to parse.binary: Optional binary content of the file.from_page: Page number from which to start parsing (default 0).to_page: Page number to stop parsing (default 100000).lang: Language of the document, affects tokenization (default"Chinese").callback: Optional callback for progress reporting.**kwargs: Additional parameters; notably:parser_config: Dict for parser options such as chunk token number, delimiter, and layout recognition method.
Returns
A list of tokenized text chunks suitable for NLP pipelines.
Usage Example
def progress_callback(progress=None, msg=""):
if progress is not None:
print(f"Progress: {progress*100:.1f}% - {msg}")
else:
print(msg)
chunks = chunk("legal_document.docx", lang="English", callback=progress_callback)
for chunk_text in chunks:
print(chunk_text)
Implementation Details
Detects file type by file extension using regex.
For
.docxfiles: usesDocxparser.For
.pdffiles: usesPdfparser orPlainParserif specified.For
.txtfiles: reads text directly.For
.html/.htmfiles: usesHtmlParser.For legacy
.docfiles: uses Apache Tika for extraction.After parsing, it removes "Contents" sections, adjusts titles, categorizes bullets, and merges hierarchically.
Finally tokenizes the merged chunks using
rag_tokenizer.Raises
NotImplementedErrorif unsupported file type.
Important Implementation Details and Algorithms
Hierarchical Question Level Parsing (Docx):
TheDocxclass usesdocx_question_levelto assign numeric levels indicating the hierarchy of questions or sections. It then groups paragraphs into sections by finding contiguous paragraphs with incrementing levels, enabling the extraction of nested legal provisions or FAQs.Page Break Detection (Docx):
Page breaks are inferred by inspecting the XML of runs within paragraphs, looking for specific tags likelastRenderedPageBreakorw:brwithtype="page", to increment the page number counter.PDF OCR and Layout Analysis:
ThePdfclass first extracts images of pages for OCR, then performs layout recognition to identify text blocks, followed by merging operations to combine vertically aligned text segments, improving the structural integrity of extracted text.Tokenization and Chunking:
Thechunkfunction applies several NLP preprocessing steps including removal of contents tables, title normalization using colon detection, bullet categorization, and hierarchical merging of text chunks before tokenization. This multi-step process improves the semantic coherence of text chunks for downstream tasks.
System Interaction
Imports from Other Modules:
deepdoc.parser: Provides base parser classes (PdfParser,DocxParser,PlainParser,HtmlParser).rag.nlp: Supplies NLP utilities for bullet detection, content removal, merging, tokenization.api.db: Supplies enumerationParserTypeto specify parsing model types.External libraries like
tika.parserfor.docfiles,docxfor DOCX reading.
Callbacks:
The parsers and chunk function accept optional callbacks to report progress and status messages, enabling integration with UI or logging systems.Return Values:
The main entry functionchunkreturns processed, tokenized chunks, ready for use in NLP pipelines, retrieval, or indexing components within the InfiniFlow platform.
Diagram: Class Structure for Docx and Pdf
classDiagram
class Docx {
-doc: Document
-__clean(line: str): str
+old_call(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000): list~str~
+__call__(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000): list~str~
+__str__(): str
}
Docx --|> DocxParser
class Pdf {
-model_speciess: ParserType
+__init__()
+__call__(filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000, zoomin: int=3, callback: callable=None): tuple~list~tuple~str, Any~~, None~
}
Pdf --|> PdfParser
Summary
The laws.py file is a specialized document parsing module optimized for legal documents, supporting multiple input formats with advanced hierarchical parsing and OCR capabilities. It serves as a foundational component in the InfiniFlow system's document ingestion pipeline, preparing legal texts for semantic processing by transforming complex document structures into clean, tokenized text chunks.
If you have any questions or need further elaboration on specific parts, feel free to ask!