document.ts


Overview

document.ts defines TypeScript interfaces and types that model the structure and metadata of documents within the system, particularly focusing on document processing status, parsing configurations, and filtering criteria. This file does not contain executable code but provides a strongly-typed contract for how document-related data should be represented and handled throughout the application.

Key purposes include:

These interfaces enable type safety, autocompletion, and easier maintenance when working with document-related data in the system.


Interfaces and Types

IDocumentInfo

Represents detailed metadata and operational data for a document in the system. This includes identification, processing state, parser settings, and various descriptive fields.

Property

Type

Description

chunk_num

number

Number of chunks the document is divided into.

create_date

string

Date when the document was created (likely in YYYY-MM-DD format).

create_time

number

Timestamp or time value representing creation time (exact format depends on implementation).

created_by

string

Identifier or name of the user/system that created the document.

id

string

Unique identifier for the document.

kb_id

string

Identifier for the knowledge base to which this document belongs.

location

string

File path or storage location of the document.

name

string

Name or title of the document.

parser_config

IParserConfig

Configuration object detailing how the document should be parsed.

parser_id

string

Identifier of the parser used on the document.

process_begin_at

string (optional)

Timestamp when document processing began.

process_duration

number

Duration of processing (likely in milliseconds or seconds).

progress

number

Numeric representation of processing progress (e.g., percentage).

progress_msg

string

Human-readable message about current processing progress or status.

run

RunningStatus

Enum value indicating the current running status of document processing (imported).

size

number

Size of the document (likely in bytes).

source_type

string

Type/category of the source document (e.g., "pdf", "docx").

status

string

Current status of the document (e.g., "completed", "failed").

thumbnail

string

URL or base64 string for a thumbnail preview image of the document.

token_num

number

Number of tokens extracted or associated with the document (potentially for NLP use).

type

string

Type of document (might overlap with source_type or be a higher-level category).

update_date

string

Date when the document metadata was last updated (likely YYYY-MM-DD format).

update_time

number

Timestamp or time value for last update time.

meta_fields

Record<string, any> (optional)

Additional metadata fields as key-value pairs, flexible for extensibility.

Usage Example

const docInfo: IDocumentInfo = {
  chunk_num: 10,
  create_date: "2024-06-01",
  create_time: 1654095600,
  created_by: "user123",
  id: "doc_001",
  kb_id: "kb_abc",
  location: "/files/doc_001.pdf",
  name: "Project Plan",
  parser_config: {
    pages: [1, 2, 3],
    delimiter: "\n",
    html4excel: false,
    layout_recognize: true,
    raptor: { use_raptor: true },
    graphrag: { use_graphrag: false }
  },
  parser_id: "parser_v1",
  process_duration: 1200,
  progress: 100,
  progress_msg: "Completed",
  run: RunningStatus.Completed,
  size: 204800,
  source_type: "pdf",
  status: "completed",
  thumbnail: "https://example.com/thumbs/doc_001.png",
  token_num: 4500,
  type: "report",
  update_date: "2024-06-10",
  update_time: 1654879200,
  meta_fields: { department: "finance" }
};

IParserConfig

Defines configuration options for how a document should be parsed. This interface is used within IDocumentInfo to customize processing behavior.

Property

Type

Description

delimiter

string (optional)

String used to split or parse the document's content (e.g., newline, comma).

html4excel

boolean (optional)

Flag indicating if the document is an HTML file formatted for Excel export.

layout_recognize

boolean (optional)

Enables layout recognition to better understand document structure during parsing.

pages

any[]

Array representing page numbers or page objects to parse from the document.

raptor

Raptor (optional)

Configuration specific to the "Raptor" parsing module (see below).

graphrag

GraphRag (optional)

Configuration specific to the "GraphRag" parsing module (see below).


Raptor

Configuration options for the Raptor parsing engine/module.

Property

Type

Description

use_raptor

boolean

Indicates whether to use the Raptor parser.


GraphRag

Configuration options for the GraphRag parsing engine/module, which appears to add graph-based semantic parsing features.

Property

Type

Description

community

boolean (optional)

Enables community detection features during parsing.

entity_types

string[] (optional)

Specifies which entity types to recognize in the document.

method

string (optional)

Parsing method or algorithm name to use within GraphRag.

resolution

boolean (optional)

Enables resolution features (e.g., entity resolution).

use_graphrag

boolean (optional)

Flag to enable use of the GraphRag parser.


IDocumentInfoFilter

Defines filters used to query or categorize documents based on their processing run status and file suffix.

Property

Type

Description

run_status

Record<number, number>

A map from run status codes to their counts or weights.

suffix

Record<string, number>

A map from file suffixes/extensions (e.g., "pdf", "docx") to counts or weights.


Implementation Details and Algorithms


Interaction with Other Parts of the System


Visual Diagram

classDiagram
    class IDocumentInfo {
        +chunk_num: number
        +create_date: string
        +create_time: number
        +created_by: string
        +id: string
        +kb_id: string
        +location: string
        +name: string
        +parser_config: IParserConfig
        +parser_id: string
        +process_begin_at?: string
        +process_duration: number
        +progress: number
        +progress_msg: string
        +run: RunningStatus
        +size: number
        +source_type: string
        +status: string
        +thumbnail: string
        +token_num: number
        +type: string
        +update_date: string
        +update_time: number
        +meta_fields?: Record<string, any>
    }

    class IParserConfig {
        +delimiter?: string
        +html4excel?: boolean
        +layout_recognize?: boolean
        +pages: any[]
        +raptor?: Raptor
        +graphrag?: GraphRag
    }

    class Raptor {
        +use_raptor: boolean
    }

    class GraphRag {
        +community?: boolean
        +entity_types?: string[]
        +method?: string
        +resolution?: boolean
        +use_graphrag?: boolean
    }

    class IDocumentInfoFilter {
        +run_status: Record<number, number>
        +suffix: Record<string, number>
    }

    IDocumentInfo --> IParserConfig : has
    IParserConfig --> Raptor : optional
    IParserConfig --> GraphRag : optional

Summary

The document.ts file is a foundational part of the system's type definitions for handling documents. It carefully models document metadata, parsing configurations, and filtering criteria, enabling consistent use of document data throughout the application layers, from back-end processing to front-end display.

By defining clear interfaces, this file supports extensible parsing strategies and flexible document processing tracking, ensuring the system can grow and adapt to different document types and parsing requirements.