infinity_mapping.json

Overview

The infinity_mapping.json file defines a structured schema mapping for a data storage or indexing system, most likely intended for use with a search engine or a document database that supports advanced text analysis features (e.g., Elasticsearch or Apache Solr). This mapping specifies the fields, their data types, default values, and text analyzers applied to various textual fields. It appears to be tailored for managing rich document metadata, keywords, content tokens, and entity relationships within a knowledge or document management system.

The key purpose of this file is to provide a consistent blueprint that governs how document records or knowledge base entries are indexed, queried, and stored, ensuring efficient retrieval and relevance scoring with support for tokenization, ranking features, and entity graph relationships.

Detailed Explanation of Fields

This file does not contain classes or functions; instead, it is a JSON schema mapping. Each key represents a field name, and the associated object provides metadata about the field's data type, default value, and optional analyzers or features.

Field Breakdown

Field Name	Type	Default	Analyzer / Feature	Description / Usage
`id`	varchar	`""`		Unique identifier for the record/document.
`doc_id`	varchar	`""`		Document identifier, possibly linking to a source document.
`kb_id`	varchar	`""`		Knowledge base ID, indicating the knowledge base source or category.
`create_time`	varchar	`""`		Creation timestamp as a string.
`create_timestamp_flt`	float	`0.0`		Creation timestamp as a floating point number (e.g., Unix timestamp).
`img_id`	varchar	`""`		Identifier for an associated image, if any.
`docnm_kwd`	varchar	`""`		Document name keywords, likely for keyword-based search.
`title_tks`	varchar	`""`	whitespace	Tokenized title field analyzed by whitespace tokenizer (splits on spaces).
`title_sm_tks`	varchar	`""`	whitespace	Smaller or simplified tokenized title field, also whitespace analyzed.
`name_kwd`	varchar	`""`	whitespace-#	Name keywords analyzed with a custom analyzer (likely splitting on whitespace and `#` symbol).
`important_kwd`	varchar	`""`	whitespace-#	Important keywords extracted from the document, analyzed similarly as above.
`tag_kwd`	varchar	`""`	whitespace-#	Tags associated with the document for categorization, analyzed with custom analyzer.
`important_tks`	varchar	`""`	whitespace	Tokenized important words, whitespace tokenized.
`question_kwd`	varchar	`""`	whitespace-#	Keywords extracted from questions or queries related to the document.
`question_tks`	varchar	`""`	whitespace	Tokenized question terms, whitespace analyzer.
`content_with_weight`	varchar	`""`		Content of the document coupled with weight annotations (likely weighted terms).
`content_ltks`	varchar	`""`	whitespace	Tokenized content with light tokenization (whitespace).
`content_sm_ltks`	varchar	`""`	whitespace	Smaller or simplified tokenized content.
`authors_tks`	varchar	`""`	whitespace	Tokenized authors of the document.
`authors_sm_tks`	varchar	`""`	whitespace	Smaller or simplified tokenized authors.
`page_num_int`	varchar	`""`		Page number of the document or content segment.
`top_int`	varchar	`""`		Possibly a ranking or position indicator within a page or list.
`position_int`	varchar	`""`		Position index within a larger document or dataset.
`weight_int`	integer	`0`		Integer weight for relevance, importance, or ranking.
`weight_flt`	float	`0.0`		Floating point weight for finer-grained relevance scoring.
`rank_int`	integer	`0`		Integer rank for sorting or priority purposes.
`rank_flt`	float	`0`		Floating rank score.
`available_int`	integer	`1`		Availability flag (e.g., 1 = available, 0 = unavailable).
knowledge_graph_kwd	varchar	`""`		Keywords representing the knowledge graph entities or concepts linked to the document.
`entities_kwd`	varchar	`""`	whitespace-#	Entity keywords extracted from the document, analyzed with the custom analyzer.
`pagerank_fea`	integer	`0`		Feature capturing PageRank or similar graph-based ranking metric.
`tag_feas`	varchar	`""`	rankfeatures	Rank features derived from tags, analyzed with a rank features analyzer (likely used to improve search relevance).

Entity and Graph Relationship Fields

Field Name	Type	Default	Analyzer / Feature	Description
`from_entity_kwd`	varchar	`""`	whitespace-#	Starting entity keyword in a graph edge or relationship.
`to_entity_kwd`	varchar	`""`	whitespace-#	Ending entity keyword in a graph edge or relationship.
`entity_kwd`	varchar	`""`	whitespace-#	Entity keywords in general.
`entity_type_kwd`	varchar	`""`	whitespace-#	Entity type keywords, e.g., person, organization, location, etc.
`source_id`	varchar	`""`	whitespace-#	Source identifier for the entity or relationship.
`n_hop_with_weight`	varchar	`""`		N-hop neighbors with weight information, used for graph traversal or influence propagation.
`removed_kwd`	varchar	`""`	whitespace-#	Keywords marked as removed or deprecated.
`doc_type_kwd`	varchar	`""`	whitespace-#	Document type keywords, e.g., article, report, FAQ, etc.

Important Implementation Details

Data Types: Most fields are of type varchar (string) with default empty strings, except for certain ranking and weighting fields which use integer or float. This flexibility supports both textual content and numeric metadata.
Analyzers: Several fields specify analyzers such as:
- whitespace: Splits text into tokens based on spaces.
- whitespace-#: A customized analyzer that splits tokens on whitespace and the # character, useful for compound keywords or tags.
- rankfeatures: Used for fields that represent numeric rank features to boost search relevance.
Weight and Rank Fields: The presence of both integer and float versions of weight and rank imply support for different scoring granularities or possibly multiple ranking metrics.
Graph and Entity Modeling: The inclusion of fields like from_entity_kwd, to_entity_kwd, and n_hop_with_weight suggests integration with a knowledge graph or entity relationship graph, enabling multi-hop queries and graph-based scoring.
Default Values: Every field includes a default value to ensure consistent indexing even if data is missing.

Usage Examples

Since this is a schema mapping file for indexing or storage, usage examples would be in context of indexing documents or querying the system.

Example: Indexing a Document

{
  "id": "doc123",
  "doc_id": "D-4567",
  "kb_id": "kb789",
  "create_time": "2024-06-05T12:00:00Z",
  "create_timestamp_flt": 1717646400.0,
  "title_tks": "infinity mapping schema",
  "weight_int": 10,
  "rank_flt": 0.95,
  "entities_kwd": "entity1#entity2#entity3",
  "from_entity_kwd": "entity1",
  "to_entity_kwd": "entity2",
  "n_hop_with_weight": "entity3:0.5,entity4:0.3",
  "available_int": 1
}

This document would be indexed according to the mapping, supporting complex queries on tokenized titles, entity graph traversals, and relevance ranking.

Interaction with Other System Components

Search Engine / Indexing Engine: This mapping is consumed by the indexing/search backend to define how documents are parsed, tokenized, and scored. It ensures that fields are queried correctly and efficiently.
Data Ingestion Pipeline: Upstream components prepare and normalize data to conform to this schema before indexing.
Query Layer: Search or query APIs leverage these field definitions to build search queries, filters, and ranking algorithms.
Knowledge Graph Module: The entity and graph-related fields enable integration with a knowledge graph service, facilitating semantic search and graph traversals.
Ranking and Scoring Algorithms: Weight and rank fields are used by ranking algorithms to compute document relevance.

Visual Diagram: Flowchart of Main Field Categories and Relationships

flowchart TD
    A[infinity_mapping.json Schema] --> B[Document Metadata Fields]
    A --> C[Textual Content Fields]
    A --> D[Keyword & Token Fields]
    A --> E[Ranking & Weight Fields]
    A --> F[Entity & Graph Relationship Fields]

    B --> B1[id, doc_id, kb_id, create_time, create_timestamp_flt, img_id, page_num_int]
    C --> C1[title_tks, title_sm_tks, content_with_weight, content_ltks, content_sm_ltks]
    D --> D1[docnm_kwd, name_kwd, important_kwd, tag_kwd, important_tks, question_kwd, question_tks, authors_tks, authors_sm_tks]
    E --> E1[weight_int, weight_flt, rank_int, rank_flt, available_int, pagerank_fea, tag_feas, top_int, position_int]
    F --> F1[from_entity_kwd, to_entity_kwd, entity_kwd, entity_type_kwd, source_id, n_hop_with_weight, removed_kwd, doc_type_kwd]

    style B fill:#f9f,stroke:#333,stroke-width:1px
    style C fill:#bbf,stroke:#333,stroke-width:1px
    style D fill:#bfb,stroke:#333,stroke-width:1px
    style E fill:#fbb,stroke:#333,stroke-width:1px
    style F fill:#ffb,stroke:#333,stroke-width:1px

Summary

Purpose: Defines a comprehensive mapping schema for indexing and querying documents with rich metadata, tokenized text fields, ranking features, and entity graph relationships.
Functionality: Supports multi-field indexing with customized analyzers, numeric ranking fields, and graph-based entity modeling.
Integration: Serves as a core component for search/indexing backend, enabling advanced semantic search and relevance ranking in a knowledge/document management system.
Extensibility: Easily extended by adding new fields or analyzers as business or domain requirements evolve.

This JSON mapping file is essential for any system component that indexes and queries knowledge documents, ensuring data consistency, search efficiency, and semantic richness.