os_mapping.json
Overview
The os_mapping.json file is a comprehensive Elasticsearch index template configuration. It defines the settings and mappings for an Elasticsearch index, focusing on optimized data storage, custom similarity scoring, and flexible dynamic field mappings based on naming conventions.
This configuration is designed to:
Optimize shard and replica settings for performance.
Enable k-nearest neighbors (k-NN) search support with vector fields.
Customize text similarity scoring using a scripted similarity.
Dynamically map fields into appropriate Elasticsearch data types based on field name patterns.
Support geographic data, date formats, nested objects, and rank features.
Such a file is typically used during index creation or update to enforce consistent indexing behavior and enable advanced search features.
Detailed Explanation
The file is structured as a JSON object with two main keys:
settingsmappings
settings
Defines index-level settings that control shards, replicas, refresh intervals, and similarity algorithms.
Properties:
number_of_shards (
integer):
Number of primary shards for the index.
Value:2number_of_replicas (
integer):
Number of replicas per shard (for redundancy).
Value:0(no replicas, likely for performance/testing)refresh_interval (
string):
How often the index is refreshed and made searchable.
Value:"1000ms"(1 second)knn (
boolean):
Enables k-NN indexing and search capabilities.
Value:truesimilarity (
object):
Defines a custom similarity algorithm for text scoring calledscripted_sim.
scripted_sim similarity:
type:
"scripted"— uses a custom script to calculate similarity.script > source: A Java-like script that calculates an IDF-weighted score based on document and term frequencies, adjusting the score by query.boost and capping document frequency contribution with
Math.min(doc.freq, 1). This provides a custom TF-IDF variant tailored to the dataset.
mappings
Defines how fields in documents are interpreted and indexed.
Properties:
lat_lon (
geo_point):
Geographic coordinate stored as a point, stored for retrieval.date_detection: "true"
Enables automatic detection of date fields.dynamic_templates: An array of dynamic mapping rules that map fields to types based on field name patterns or regex matches.
dynamic_templates
This is the core feature enabling flexible field mapping without explicitly defining every field.
Each template includes:
match or match_pattern + match: A wildcard or regex pattern to match field names.
mapping: The Elasticsearch type and additional settings for matched fields.
Templates Summary:
Template Name | Match Pattern | Type | Additional Settings |
|---|---|---|---|
int |
| integer | stored |
ulong | unsigned_long | stored | |
long | long | stored | |
short | short | stored | |
numeric |
| float | stored |
tks |
| text | analyzer: whitespace, similarity: scripted_sim, stored |
ltks | text | analyzer: whitespace, stored | |
kwd | regex: [^(.*_(kwd | id | ids |
dt | regex: [^.*(_dt | _time | _at)$](/projects/311/73485) |
nested |
| nested | |
object |
| object | dynamic: true |
string | regex: [^.*_(with_weight | list)$](/projects/311/73485) | text |
rank_feature | rank_feature | ||
rank_features | rank_features | ||
knn_vector |
| knn_vector | index: true, space_type: cosinesimil, dimension varies (512, 768, 1024, 1536) |
binary |
| binary |
Important Implementation Details
Custom Similarity Script:
The scripted similarity uses a logarithmic function to compute an inverse document frequency (IDF) component, normalizing based on the document frequency and document count. This allows a custom scoring model adapted to the specific corpus.Dynamic Field Mapping:
Instead of statically defining every field, the mappings rely heavily on dynamic_templates that detect field types by suffixes or regex. This approach allows enforcing consistent data types and indexing behavior dynamically, facilitating large and evolving schemas.k-NN Vector Fields:
Multiple vector dimensions are supported for approximate nearest neighbor search with the cosine similarity space (cosinesimil). This supports modern vector search use cases such as semantic search or similarity-based recommendations.Date Formats:
Multiple date formats are detected automatically, supporting flexible input formats.
Usage Example
Suppose you index a document with fields like:
{
"user_int": 42,
"price_flt": 12.99,
"title_tks": "Elasticsearch vector search",
"created_dt": "2024-06-01 12:00:00",
"embedding_512_vec": [0.01, 0.02, ..., 0.512],
"location_lat_lon": "40.7128,-74.0060"
}
The dynamic templates will automatically map these fields as:
user_int→ integerprice_flt→ floattitle_tks→ text with whitespace analyzer and custom similaritycreated_dt→ date with specified formatsembedding_512_vec→ knn_vector of dimension 512location_lat_lon→ geo_point
Interaction with Other System Components
Index Creation API:
This JSON is typically submitted via the Elasticsearch REST API to create or update an index with these settings and mappings.Search and Query Layers:
The custom similarity and knn vector fields impact how queries are scored and executed in the search engine.Data Ingestion Pipelines:
Upstream processes generating documents must follow naming conventions (e.g., field suffixes) to ensure proper field mapping.Analytics and Retrieval:
Stored fields (store: true) enable efficient retrieval of original field values without fetching the entire_source.
Visual Diagram
The following Mermaid flowchart illustrates the workflow of dynamic field mapping and settings application in this os_mapping.json file:
flowchart TD
A[Start: Index Creation] --> B[Apply Index Settings]
B --> C{Enable k-NN Search?}
C -- Yes --> D[Set knn: true]
C -- No --> E[Skip k-NN]
D --> F[Configure Shards, Replicas, Refresh]
F --> G[Define Custom Similarity Script]
G --> H[Apply Mappings]
H --> I[Static Mapping: lat_lon (geo_point)]
H --> J[Enable date_detection]
H --> K[Dynamic Templates Start]
K --> L1["*_int" → integer"]
K --> L2["*_flt" → float"]
K --> L3["*_tks" → text + scripted_sim"]
K --> L4["*_512_vec" → knn_vector (dim 512)"]
K --> L5[Regex Matches (keyword, date, string types)]
K --> L6["*_nst" → nested"]
K --> L7["*_obj" → object"]
K --> L8["*_bin" → binary"]
L1 & L2 & L3 & L4 & L5 & L6 & L7 & L8 --> M[End: Mapping Complete]
Summary
The os_mapping.json file is a powerful Elasticsearch index configuration enabling:
Efficient shard and refresh control.
Custom similarity scoring with a scripted formula.
Rich dynamic field mapping based on field name patterns.
Support for advanced data types including geospatial points, nested objects, rank features, and vector similarity search.
It is ideal for systems requiring flexible, high-performance search indexes with complex and evolving schemas.