document.ts

Overview

document.ts defines TypeScript interfaces and types that model the structure and metadata of documents within the system, particularly focusing on document processing status, parsing configurations, and filtering criteria. This file does not contain executable code but provides a strongly-typed contract for how document-related data should be represented and handled throughout the application.

Key purposes include:

Defining the shape of a document's metadata and processing information (IDocumentInfo).
Specifying how parser configurations for documents are structured (IParserConfig and related types).
Defining filters to categorize or query documents based on their run status or file suffixes (IDocumentInfoFilter).

These interfaces enable type safety, autocompletion, and easier maintenance when working with document-related data in the system.

Interfaces and Types

`IDocumentInfo`

Represents detailed metadata and operational data for a document in the system. This includes identification, processing state, parser settings, and various descriptive fields.

Property	Type	Description
`chunk_num`	`number`	Number of chunks the document is divided into.
`create_date`	`string`	Date when the document was created (likely in `YYYY-MM-DD` format).
`create_time`	`number`	Timestamp or time value representing creation time (exact format depends on implementation).
`created_by`	`string`	Identifier or name of the user/system that created the document.
`id`	`string`	Unique identifier for the document.
`kb_id`	`string`	Identifier for the knowledge base to which this document belongs.
`location`	`string`	File path or storage location of the document.
`name`	`string`	Name or title of the document.
`parser_config`	`IParserConfig`	Configuration object detailing how the document should be parsed.
`parser_id`	`string`	Identifier of the parser used on the document.
`process_begin_at`	`string` (optional)	Timestamp when document processing began.
`process_duration`	`number`	Duration of processing (likely in milliseconds or seconds).
`progress`	`number`	Numeric representation of processing progress (e.g., percentage).
`progress_msg`	`string`	Human-readable message about current processing progress or status.
`run`	`RunningStatus`	Enum value indicating the current running status of document processing (imported).
`size`	`number`	Size of the document (likely in bytes).
`source_type`	`string`	Type/category of the source document (e.g., "pdf", "docx").
`status`	`string`	Current status of the document (e.g., "completed", "failed").
`thumbnail`	`string`	URL or base64 string for a thumbnail preview image of the document.
`token_num`	`number`	Number of tokens extracted or associated with the document (potentially for NLP use).
`type`	`string`	Type of document (might overlap with `source_type` or be a higher-level category).
`update_date`	`string`	Date when the document metadata was last updated (likely `YYYY-MM-DD` format).
`update_time`	`number`	Timestamp or time value for last update time.
`meta_fields`	`Record<string, any>` (optional)	Additional metadata fields as key-value pairs, flexible for extensibility.

Usage Example

const docInfo: IDocumentInfo = {
  chunk_num: 10,
  create_date: "2024-06-01",
  create_time: 1654095600,
  created_by: "user123",
  id: "doc_001",
  kb_id: "kb_abc",
  location: "/files/doc_001.pdf",
  name: "Project Plan",
  parser_config: {
    pages: [1, 2, 3],
    delimiter: "\n",
    html4excel: false,
    layout_recognize: true,
    raptor: { use_raptor: true },
    graphrag: { use_graphrag: false }
  },
  parser_id: "parser_v1",
  process_duration: 1200,
  progress: 100,
  progress_msg: "Completed",
  run: RunningStatus.Completed,
  size: 204800,
  source_type: "pdf",
  status: "completed",
  thumbnail: "https://example.com/thumbs/doc_001.png",
  token_num: 4500,
  type: "report",
  update_date: "2024-06-10",
  update_time: 1654879200,
  meta_fields: { department: "finance" }
};

`IParserConfig`

Defines configuration options for how a document should be parsed. This interface is used within IDocumentInfo to customize processing behavior.

Property	Type	Description
`delimiter`	`string` (optional)	String used to split or parse the document's content (e.g., newline, comma).
`html4excel`	`boolean` (optional)	Flag indicating if the document is an HTML file formatted for Excel export.
`layout_recognize`	`boolean` (optional)	Enables layout recognition to better understand document structure during parsing.
`pages`	`any[]`	Array representing page numbers or page objects to parse from the document.
`raptor`	`Raptor` (optional)	Configuration specific to the "Raptor" parsing module (see below).
`graphrag`	`GraphRag` (optional)	Configuration specific to the "GraphRag" parsing module (see below).

`Raptor`

Configuration options for the Raptor parsing engine/module.

Property	Type	Description
`use_raptor`	`boolean`	Indicates whether to use the Raptor parser.

`GraphRag`

Configuration options for the GraphRag parsing engine/module, which appears to add graph-based semantic parsing features.

Property	Type	Description
`community`	`boolean` (optional)	Enables community detection features during parsing.
`entity_types`	`string[]` (optional)	Specifies which entity types to recognize in the document.
`method`	`string` (optional)	Parsing method or algorithm name to use within GraphRag.
`resolution`	`boolean` (optional)	Enables resolution features (e.g., entity resolution).
`use_graphrag`	`boolean` (optional)	Flag to enable use of the GraphRag parser.

`IDocumentInfoFilter`

Defines filters used to query or categorize documents based on their processing run status and file suffix.

Property	Type	Description
`run_status`	`Record<number, number>`	A map from run status codes to their counts or weights.
`suffix`	`Record<string, number>`	A map from file suffixes/extensions (e.g., "pdf", "docx") to counts or weights.

Implementation Details and Algorithms

The file primarily defines data shapes rather than implementing algorithms.
The parser_config interface supports plug-in configurations for different parsing modules (Raptor and GraphRag), indicating extensible parsing strategies.
Usage of optional properties (e.g., process_begin_at, meta_fields) allows flexible data representation depending on document state or available metadata.
The run property uses an imported enum RunningStatus, which likely provides a standardized set of statuses like Running, Completed, Failed, etc., to track document processing lifecycle.

Interaction with Other Parts of the System

The RunningStatus enum imported from @/constants/knowledge shows this file depends on a centralized constants module, ensuring consistent usage of document run states.
Interfaces here are likely consumed by:
- Document management services that handle storing, updating, and querying document metadata.
- Processing pipelines that parse documents according to the IParserConfig and update progress/status fields in IDocumentInfo.
- UI components that display document lists, statuses, and metadata, using these interfaces for type safety.
- Filtering modules or search functionality that utilize IDocumentInfoFilter to enable users to filter documents by status or file type.

Visual Diagram

classDiagram
    class IDocumentInfo {
        +chunk_num: number
        +create_date: string
        +create_time: number
        +created_by: string
        +id: string
        +kb_id: string
        +location: string
        +name: string
        +parser_config: IParserConfig
        +parser_id: string
        +process_begin_at?: string
        +process_duration: number
        +progress: number
        +progress_msg: string
        +run: RunningStatus
        +size: number
        +source_type: string
        +status: string
        +thumbnail: string
        +token_num: number
        +type: string
        +update_date: string
        +update_time: number
        +meta_fields?: Record<string, any>
    }

    class IParserConfig {
        +delimiter?: string
        +html4excel?: boolean
        +layout_recognize?: boolean
        +pages: any[]
        +raptor?: Raptor
        +graphrag?: GraphRag
    }

    class Raptor {
        +use_raptor: boolean
    }

    class GraphRag {
        +community?: boolean
        +entity_types?: string[]
        +method?: string
        +resolution?: boolean
        +use_graphrag?: boolean
    }

    class IDocumentInfoFilter {
        +run_status: Record<number, number>
        +suffix: Record<string, number>
    }

    IDocumentInfo --> IParserConfig : has
    IParserConfig --> Raptor : optional
    IParserConfig --> GraphRag : optional

Summary

The document.ts file is a foundational part of the system's type definitions for handling documents. It carefully models document metadata, parsing configurations, and filtering criteria, enabling consistent use of document data throughout the application layers, from back-end processing to front-end display.

By defining clear interfaces, this file supports extensible parsing strategies and flexible document processing tracking, ensuring the system can grow and adapt to different document types and parsing requirements.