os_mapping.json

Overview

The os_mapping.json file is a comprehensive Elasticsearch index template configuration. It defines the settings and mappings for an Elasticsearch index, focusing on optimized data storage, custom similarity scoring, and flexible dynamic field mappings based on naming conventions.

This configuration is designed to:

Optimize shard and replica settings for performance.
Enable k-nearest neighbors (k-NN) search support with vector fields.
Customize text similarity scoring using a scripted similarity.
Dynamically map fields into appropriate Elasticsearch data types based on field name patterns.
Support geographic data, date formats, nested objects, and rank features.

Such a file is typically used during index creation or update to enforce consistent indexing behavior and enable advanced search features.

Detailed Explanation

The file is structured as a JSON object with two main keys:

settings
mappings

`settings`

Defines index-level settings that control shards, replicas, refresh intervals, and similarity algorithms.

Properties:

number_of_shards (integer):
Number of primary shards for the index.
Value: 2
number_of_replicas (integer):
Number of replicas per shard (for redundancy).
Value: 0 (no replicas, likely for performance/testing)
refresh_interval (string):
How often the index is refreshed and made searchable.
Value: "1000ms" (1 second)
knn (boolean):
Enables k-NN indexing and search capabilities.
Value: true
similarity (object):
Defines a custom similarity algorithm for text scoring called scripted_sim.

`scripted_sim` similarity:

type: "scripted" — uses a custom script to calculate similarity.
script > source: A Java-like script that calculates an IDF-weighted score based on document and term frequencies, adjusting the score by query.boost and capping document frequency contribution with Math.min(doc.freq, 1). This provides a custom TF-IDF variant tailored to the dataset.

`mappings`

Defines how fields in documents are interpreted and indexed.

Properties:

lat_lon (geo_point):
Geographic coordinate stored as a point, stored for retrieval.
date_detection: "true"
Enables automatic detection of date fields.
dynamic_templates: An array of dynamic mapping rules that map fields to types based on field name patterns or regex matches.

dynamic_templates

This is the core feature enabling flexible field mapping without explicitly defining every field.

Each template includes:

match or match_pattern + match: A wildcard or regex pattern to match field names.
mapping: The Elasticsearch type and additional settings for matched fields.

Templates Summary:

Template Name	Match Pattern	Type	Additional Settings
int	`*_int`	integer	stored
ulong	*_ulong	unsigned_long	stored
long	*_long	long	stored
short	*_short	short	stored
numeric	`*_flt`	float	stored
tks	`*_tks`	text	analyzer: whitespace, similarity: scripted_sim, stored
ltks	*_ltks	text	analyzer: whitespace, stored
kwd	regex: [^(.*_(kwd	id	ids
dt	regex: [^.*(_dt	_time	_at)$](/projects/311/73485)
nested	`*_nst`	nested
object	`*_obj`	object	dynamic: true
string	regex: [^.*_(with_weight	list)$](/projects/311/73485)	text
rank_feature	*_fea	rank_feature
rank_features	*_feas	rank_features
knn_vector	`_512_vec`, `_768_vec`, `_1024_vec`, _1536_vec	knn_vector	index: true, space_type: cosinesimil, dimension varies (512, 768, 1024, 1536)
binary	`*_bin`	binary

Important Implementation Details

Custom Similarity Script:
The scripted similarity uses a logarithmic function to compute an inverse document frequency (IDF) component, normalizing based on the document frequency and document count. This allows a custom scoring model adapted to the specific corpus.
Dynamic Field Mapping:
Instead of statically defining every field, the mappings rely heavily on dynamic_templates that detect field types by suffixes or regex. This approach allows enforcing consistent data types and indexing behavior dynamically, facilitating large and evolving schemas.
k-NN Vector Fields:
Multiple vector dimensions are supported for approximate nearest neighbor search with the cosine similarity space (cosinesimil). This supports modern vector search use cases such as semantic search or similarity-based recommendations.
Date Formats:
Multiple date formats are detected automatically, supporting flexible input formats.

Usage Example

Suppose you index a document with fields like:

{
  "user_int": 42,
  "price_flt": 12.99,
  "title_tks": "Elasticsearch vector search",
  "created_dt": "2024-06-01 12:00:00",
  "embedding_512_vec": [0.01, 0.02, ..., 0.512],
  "location_lat_lon": "40.7128,-74.0060"
}

The dynamic templates will automatically map these fields as:

user_int → integer
price_flt → float
title_tks → text with whitespace analyzer and custom similarity
created_dt → date with specified formats
embedding_512_vec → knn_vector of dimension 512
location_lat_lon → geo_point

Interaction with Other System Components

Index Creation API:
This JSON is typically submitted via the Elasticsearch REST API to create or update an index with these settings and mappings.
Search and Query Layers:
The custom similarity and knn vector fields impact how queries are scored and executed in the search engine.
Data Ingestion Pipelines:
Upstream processes generating documents must follow naming conventions (e.g., field suffixes) to ensure proper field mapping.
Analytics and Retrieval:
Stored fields (store: true) enable efficient retrieval of original field values without fetching the entire _source.

Visual Diagram

The following Mermaid flowchart illustrates the workflow of dynamic field mapping and settings application in this os_mapping.json file:

flowchart TD
    A[Start: Index Creation] --> B[Apply Index Settings]
    B --> C{Enable k-NN Search?}
    C -- Yes --> D[Set knn: true]
    C -- No --> E[Skip k-NN]

    D --> F[Configure Shards, Replicas, Refresh]
    F --> G[Define Custom Similarity Script]

    G --> H[Apply Mappings]

    H --> I[Static Mapping: lat_lon (geo_point)]
    H --> J[Enable date_detection]

    H --> K[Dynamic Templates Start]

    K --> L1["*_int" → integer"]
    K --> L2["*_flt" → float"]
    K --> L3["*_tks" → text + scripted_sim"]
    K --> L4["*_512_vec" → knn_vector (dim 512)"]
    K --> L5[Regex Matches (keyword, date, string types)]
    K --> L6["*_nst" → nested"]
    K --> L7["*_obj" → object"]
    K --> L8["*_bin" → binary"]

    L1 & L2 & L3 & L4 & L5 & L6 & L7 & L8 --> M[End: Mapping Complete]

Summary

The os_mapping.json file is a powerful Elasticsearch index configuration enabling:

Efficient shard and refresh control.
Custom similarity scoring with a scripted formula.
Rich dynamic field mapping based on field name patterns.
Support for advanced data types including geospatial points, nested objects, rank features, and vector similarity search.

It is ideal for systems requiring flexible, high-performance search indexes with complex and evolving schemas.

os_mapping.json

Overview

Detailed Explanation

settings

Properties:

scripted_sim similarity:

mappings

Properties:

dynamic_templates

Templates Summary:

Important Implementation Details

Usage Example

Interaction with Other System Components

Visual Diagram

Summary

`settings`

`scripted_sim` similarity:

`mappings`