mapping.json
Overview
The mapping.json file defines the configuration and schema mappings for an Elasticsearch index. It specifies how documents are indexed, stored, and searched within the Elasticsearch cluster. This configuration includes index settings like shard count and refresh interval, a custom similarity scoring script for text fields, and detailed dynamic mapping templates that automatically map fields based on their names and patterns to appropriate Elasticsearch data types.
This file is crucial for shaping the behavior of the Elasticsearch index to optimize search relevance, storage efficiency, and query performance. It is typically used during index creation or update via the Elasticsearch REST API.
Detailed Explanation
Root Structure
settings: Contains index-level settings such as shard count, replica count, refresh interval, and custom similarity algorithms.
mappings: Defines the schema of the documents, including field types and dynamic templates for automatic type inference.
Settings Section
1. index
number_of_shards: integer (value: 2)
Defines how many primary shards the index will have. Sharding enables horizontal scaling.number_of_replicas: integer (value: 0)
Number of replica shards per primary shard. Here, zero replicas, meaning no redundancy.refresh_interval: string (value: "1000ms")
Frequency at which the index is refreshed and made searchable. 1000ms means 1 second.
2. similarity
Defines custom similarity scoring methods used during full-text search.
scripted_sim
type: "scripted"script.source: a custom Java-like script for calculating score based on:Inverse document frequency (IDF) calculated as:
idf = log(1 + (field.docCount - term.docFreq + 0.5) / (term.docFreq + 0.5)) / log(1 + ((field.docCount - 0.5) / 1.5))Frequency factor: min(doc.freq, 1)
Query boost factor: query.boost
Purpose: This custom similarity script modifies the scoring to prioritize terms that are rarer across the dataset, with a cap on frequency contribution to avoid over-weighting repeated terms. This improves relevance by emphasizing unique terms.
Mappings Section
1. properties
Defines explicit field mappings:
lat_lon
type: "geo_point"store: "true"
Stores geographical coordinates with indexing optimized for geo-spatial queries.
2. date_detection
"true"
Enables auto-detection and parsing of date fields in documents.
3. dynamic_templates
Dynamic templates allow automatic field mapping based on field name patterns and regex matching. Each template has:
Below are key templates:
Template Name | Match Pattern | Mapping Type | Notable Parameters | Description |
|---|---|---|---|---|
int |
| integer | store: true | Maps fields ending with |
ulong |
| unsigned_long | store: true | Unsigned long integers |
long |
| long | store: true | Signed long integers |
short |
| short | store: true | Short integers |
numeric |
| float | store: true | Floating-point numbers |
tks |
| text | analyzer: whitespace, similarity: scripted_sim, store: true | Tokenized text fields with custom similarity |
ltks |
| text | analyzer: whitespace, store: true | Tokenized text fields without custom similarity |
kwd | regex [^(.*_(kwd | id | ids | uid |
dt | regex [^.*(_dt | _time | _at)$](/projects/311/73485) | date |
nested |
| nested | Nested objects for complex hierarchies | |
object |
| object | dynamic: true | JSON-like objects with dynamic fields |
string | regex [^.*_(with_weight | list)$](/projects/311/73485) | text | index: false, store: true |
rank_feature |
| rank_feature | Fields to be used for ranking features | |
rank_features |
| rank_features | Sets of ranking features | |
dense_vector |
| dense_vector | dims: 512, similarity: cosine, index: true | 512-dimensional dense vectors for similarity search |
dense_vector |
| dense_vector | dims: 768, similarity: cosine, index: true | 768-dimensional dense vectors |
dense_vector |
| dense_vector | dims: 1024, similarity: cosine, index: true | 1024-dimensional dense vectors |
dense_vector |
| dense_vector | dims: 1536, similarity: cosine, index: true | 1536-dimensional dense vectors |
binary |
| binary | Binary data fields |
Important Implementation Details
Custom Similarity Script:
The scripted similarity modifies the traditional TF-IDF approach by applying a custom IDF formula and capping the frequency contribution to 1, thus mitigating term frequency bias.Dynamic Templates:
This approach allows flexible and automatic field typing based on naming conventions or regex patterns, avoiding the need for explicit mapping for every new field.Dense Vector Fields:
Support for multiple dense vector sizes and cosine similarity enables advanced vector-based similarity searches (e.g., for machine learning embeddings).Date Formats:
The date template accepts multiple formats, increasing robustness when ingesting data from different sources.
Interaction With Other System Components
Index Creation API:
This JSON is typically submitted to Elasticsearch when creating an index via the REST API (PUT /index_namewith body containing this JSON).Search Queries:
The similarity script impacts scoring during full-text search queries for fields mapped withscripted_sim.Ingestion Pipelines:
Data ingestion processes must ensure fields follow the naming conventions so dynamic templates apply correctly.Vector Search Modules:
Dense vector fields interact with ML models or vector search components that rely on cosine similarity for nearest neighbor queries.Query DSL:
Users or applications querying this index will leverage field types (keyword, date, nested, etc.) as defined here.
Usage Examples
Example 1: Creating an index with this mapping
curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d @mapping.json
Example 2: Indexing a document
{
"user_id": "12345_uid",
"created_at": "2024-06-01 15:30:00",
"location_lat_lon": "40.7128,-74.0060",
"description_tks": "quick brown fox jumps",
"embedding_512_vec": [0.12, 0.34, ..., 0.56] // 512 floats
}
The fields will be dynamically mapped according to the templates, e.g., user_id as keyword, created_at as date, etc.
Visual Diagram
flowchart TD
A[Settings] --> B[Index Settings]
A --> C[Similarity]
B --> B1[number_of_shards: 2]
B --> B2[number_of_replicas: 0]
B --> B3[refresh_interval: 1000ms]
C --> C1[scripted_sim]
C1 --> C1a[Custom IDF script]
D[Mappings] --> E[Properties]
E --> E1[lat_lon: geo_point]
D --> F[Dynamic Templates]
F --> F1[int: *_int -> integer]
F --> F2[ulong: *_ulong -> unsigned_long]
F --> F3[long: *_long -> long]
F --> F4[short: *_short -> short]
F --> F5[numeric: *_flt -> float]
F --> F6[tks: *_tks -> text + scripted_sim]
F --> F7[ltks: *_ltks -> text]
F --> F8[kwd: regex -> keyword + boolean similarity]
F --> F9[dt: regex -> date with multiple formats]
F --> F10[nested: *_nst -> nested]
F --> F11[object: *_obj -> object]
F --> F12[string: regex -> text (not indexed)]
F --> F13[rank_feature: *_fea -> rank_feature]
F --> F14[rank_features: *_feas -> rank_features]
F --> F15[dense_vector: *_512_vec -> 512 dims, cosine]
F --> F16[dense_vector: *_768_vec -> 768 dims, cosine]
F --> F17[dense_vector: *_1024_vec -> 1024 dims, cosine]
F --> F18[dense_vector: *_1536_vec -> 1536 dims, cosine]
F --> F19[binary: *_bin -> binary]
Summary
mapping.json is a comprehensive Elasticsearch index configuration file that governs index behavior, scoring, and field mapping. It leverages dynamic templates for flexible field typing, includes a custom similarity script to tailor search relevance, and supports advanced data types like geo points and dense vectors for modern search scenarios.
Correct use and maintenance of this file ensure efficient indexing, accurate search results, and seamless integration with other system components like query engines and ingestion pipelines.