one.py
Overview
The one.py file is a document parsing utility designed to handle multiple file formats including PDF, DOCX, DOC, Excel, TXT, and HTML. It provides functionality to extract and organize text and table data while maintaining the original structure and order of the content. The primary interface exposed is the chunk function, which acts as a unified entry point for parsing different document types into structured "chunks" suitable for downstream processing, such as natural language processing (NLP) or information retrieval.
A specialized Pdf class extends a base PDF parser (PdfParser) to perform detailed OCR, layout analysis, table extraction, and text merging specifically tailored for PDF documents. The file also integrates with external libraries and internal modules for tokenization, text extraction, and document parsing.
Classes and Functions
Class: Pdf
Inheritance
Inherits from
PdfParser(imported fromdeepdoc.parser)
Purpose
Pdf enhances the base PDF parsing capabilities by adding a detailed OCR step, layout and table analysis, and text merging to produce structured text and table data from PDF files.
Method: __call__
__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
Parameters
filename(str): Path to the PDF file or raw binary content ifbinaryis provided.binary(bytes or None): Optional raw binary content of the PDF. If provided, parsing uses this instead of loading fromfilename.from_page(int): Starting page number to parse (default 0).to_page(int): Ending page number to parse (default 100,000).zoomin(int): Zoom factor for OCR and layout processing to improve accuracy (default 3).callback(callable or None): Optional function to receive progress updates. Should accept either a float progress value and a message, or just a message via keyword argumentmsg.
Returns
Tuple:
List of tuples
(text, "")wheretextis extracted text or table content, maintaining order based on their positions in the document.None(reserved for future metadata or error information)
Description
This method performs the following workflow:
Starts an OCR process on the PDF pages, reporting progress via
callback.Performs layout analysis to identify text blocks and their positions.
Runs a table analysis job to detect and extract tables.
Merges fragmented text blocks into coherent text segments.
Extracts tables and figures, appending them to the text sections.
Sorts all extracted sections by their position within the document.
Usage Example
pdf_parser = Pdf()
sections, _ = pdf_parser("sample.pdf", from_page=0, to_page=5, callback=print)
for text, _ in sections:
print(text)
Function: chunk
chunk(filename, binary=None, from_page=0, to_page=100000,
lang="Chinese", callback=None, **kwargs)
Purpose
The chunk function is a unified document parser that supports multiple file formats: DOCX, PDF, Excel, TXT, HTML, and DOC. It extracts text content in a structured chunked format while maintaining the original order and basic layout features. It uses specialized parsers depending on the file type and provides progress updates via a callback.
Parameters
filename(str): Path to the document file.binary(bytes or None): Optional binary content of the file.from_page(int): Starting page number for parsing (applicable mainly to PDFs).to_page(int): Ending page number for parsing (applicable mainly to PDFs).lang(str): Language of the document, defaults to"Chinese". Used to determine tokenization behavior.callback(callable or None): Function to receive progress updates. Signature typically(float progress, str message).**kwargs: Additional keyword arguments. Recognized key:"parser_config"(dict): Configuration options for parsing, e.g., chunk sizes, delimiters, and layout recognition method.
Returns
List containing a single dictionary representing the parsed document with keys:
"docnm_kwd": Original filename."title_tks": Tokenized document title (filename without extension)."title_sm_tks": Fine-grained tokenization of the title.Additional tokenization results from the document text.
Detailed Behavior
Detects file type using regex on the filename extension.
For
.docx: Usesrag.app.naive.Docxparser to extract sections and tables.For
.pdf: UsesPdfclass or a plain text PDF parser based onparser_config.For
.xlsx?: UsesExcelParserto extract HTML sections.For text files (
.txt, .md, .markdown): Reads plain text and splits into lines.For HTML files (
.htm, .html): UsesHtmlParser.For
.doc: Uses Apache Tika parser to extract text content.Raises
NotImplementedErroron unsupported file types.
After extraction, it tokenizes the filename and document text using rag_tokenizer and tokenize.
Usage Example
def progress_callback(progress=None, msg=""):
if progress is not None:
print(f"{progress*100:.1f}% - {msg}")
else:
print(msg)
docs = chunk("example.pdf", from_page=0, to_page=10, callback=progress_callback)
print(docs[0]["docnm_kwd"]) # Prints filename
print(docs[0]["title_tks"]) # Prints tokenized title
Important Implementation Details and Algorithms
PDF Parsing Workflow: The
Pdfclass methodcallorchestrates several steps:OCR with zoom factor to improve text recognition quality.
Layout analysis to detect text blocks and their spatial positions.
Table recognition and extraction.
Text merging to unify fragmented text areas.
Sorting extracted content based on positional metrics to preserve document reading order.
Chunking Strategy: The
chunkfunction ensures that extracted text maintains its original document order and grouping, suitable for downstream NLP tasks that require coherent context windows.Tokenization: Uses
rag_tokenizerandtokenizefrom therag.nlpmodule to generate token lists both coarse and fine-grained, optimizing for languages like Chinese and English.File Type Handling: The function uses regex extensions to route files to the appropriate parser, ensuring extensibility for future formats.
Progress Reporting: Both the
Pdfparser and thechunkfunction accept acallbackparameter to report progress updates, which is useful for long-running parsing jobs in UI or asynchronous environments.
Interaction with Other Modules
deepdoc.parser: Provides base parsers such as
PdfParser,ExcelParser,PlainParser, andHtmlParserused for specific file-type parsing.deepdoc.parser.utils.get_text: Utility to extract raw text from files.
rag.app.naive: Contains the
Docxparser used for DOCX documents.rag.nlp: Supplies tokenizers and tokenization functions (
rag_tokenizer,tokenize) used to process text chunks.tika.parser: Apache Tika integration used for DOC file parsing.
io.BytesIO: Used for in-memory binary stream handling, especially for DOC files.
This file primarily acts as a bridge that integrates these components into a cohesive document chunking pipeline.
Visual Diagram
classDiagram
class PdfParser {
<<abstract>>
+__call__(filename, binary, from_page, to_page, zoomin, callback)
+_layouts_rec()
+_table_transformer_job()
+_text_merge()
+_extract_table_figure()
+_concat_downward()
}
class Pdf {
+__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
-__images__()
}
class Docx {
+__call__(filename, binary)
}
class ExcelParser {
+html(binary, max_size)
}
class PlainParser {
+__call__(filename, binary, from_page, to_page, zoomin, callback)
}
class HtmlParser {
+__call__(filename, binary)
}
Pdf --|> PdfParser
Pdf ..> "callback: function" : uses
Pdf ..> "OCR, Layout, Table, Text merging" : workflow
chunk ..> Pdf : uses for PDF files
chunk ..> Docx : uses for DOCX files
chunk ..> ExcelParser : uses for Excel files
chunk ..> PlainParser : optional PDF plain text parser
chunk ..> HtmlParser : uses for HTML files
chunk ..> tika.parser : uses for DOC files
chunk ..> rag_tokenizer : tokenization
chunk ..> tokenize : tokenization
Summary
one.py provides a flexible, multi-format document parsing utility.
The
Pdfclass extends PDF parsing with OCR and layout-aware processing.The
chunkfunction is the main entry point, supporting several common document formats.It integrates multiple external and internal parsers and NLP tokenizers.
Progress callbacks support interactive or asynchronous processing.
Suitable for preprocessing documents for NLP, search indexing, or document understanding systems.
If you need additional details on any class or function, or integration examples, feel free to ask!