Introduction

Elasticsearch is a search server based on Apache Lucene. It provides a distributed, multi-user capable full-text search engine with a RESTful API interface. Elasticsearch is developed in Java and released as open source under the Apache license terms. It is a popular enterprise-level search engine that allows you to explore your data at unprecedented speed and scale. It is used for full-text search, structured search, analytics, and combinations of these three functions:

  • Wikipedia uses Elasticsearch to provide full-text search with highlighted snippets, as well as search-as-you-type and did-you-mean suggestions.
  • The Guardian uses Elasticsearch to combine social network data with visitor logs to provide real-time feedback to its editors on public response to new articles.
  • Stack Overflow incorporates geolocation queries into full-text search and uses the more-like-this interface to find related questions and answers.
  • GitHub uses Elasticsearch to query over 130 billion lines of code.

Elasticsearch doesn’t just serve giant companies. It also helps startups like Datadog and Klout, helping them prototype ideas and transform them into scalable solutions. Elasticsearch can run on your laptop or scale to hundreds of servers to handle PB-level data.

Origins

Many years ago, an unemployed developer named Shay Banon, who had just gotten married, followed his wife to London where she was studying to be a chef. While looking for a job to make money, he started using an early version of Lucene to create a recipe search engine for his wife.

Using Lucene directly was difficult, so Shay began creating an abstraction layer that would make it simple for Java developers to add search functionality to their programs. He released his first open-source project, Compass.

Later, Shay got a job working primarily with high-performance, distributed memory data grids. The need for a high-performance, real-time, distributed search engine became particularly prominent. He decided to rewrite Compass, turn it into a standalone service, and name it Elasticsearch.

The first public version was released in February 2010. Since then, Elasticsearch has become one of the most active projects on Github with over 300 contributors (now 736 contributors). A company has been formed to provide commercial services around Elasticsearch and develop new features, but Elasticsearch will always remain open source and available to everyone.

It’s said that Shay’s wife is still waiting for her recipe search engine…

Terminology

cluster

Represents a cluster with multiple nodes, one of which is the master node. This master node can be elected, and the master-slave relationship is internal to the cluster. One of Elasticsearch’s concepts is decentralization, which literally means there is no central node. This is from an external perspective because, from the outside, an Elasticsearch cluster is logically a whole, and communication with any node is equivalent to communication with the entire Elasticsearch cluster.

shards

Represents index shards. Elasticsearch can split a complete index into multiple shards, which allows a large index to be split into multiple parts distributed across different nodes, forming a distributed search. The number of shards can only be specified before the index is created and cannot be changed after the index is created.

replicas

Represents index replicas. Elasticsearch can set multiple replicas of an index. Replicas serve two purposes: first, to improve system fault tolerance so that when a node’s shard is damaged or lost, it can be recovered from a replica; second, to improve Elasticsearch’s query efficiency, as Elasticsearch automatically load-balances search requests.

recovery

Represents data recovery or redistribution. When nodes join or leave, Elasticsearch redistributes index shards based on machine load. Data recovery also occurs when a failed node restarts.

river

Represents an Elasticsearch data source and a method for synchronizing data from other storage systems (such as databases) to Elasticsearch. It exists as a plugin service for Elasticsearch, reading data from the river and indexing it into Elasticsearch. Official rivers include couchDB, RabbitMQ, Twitter, and Wikipedia.

gateway

Represents how Elasticsearch index snapshots are stored. By default, Elasticsearch first stores indexes in memory and persists them to local disk when memory is full. Gateway stores index snapshots, and when the Elasticsearch cluster is shut down and restarted, it reads index backup data from the Gateway. Elasticsearch supports various types of Gateways, including local file system (default), distributed file system, Hadoop’s HDFS, and Amazon’s S3 cloud storage service.

discovery.zen

Represents Elasticsearch’s automatic node discovery mechanism. Elasticsearch is a P2P-based system that first broadcasts to find existing nodes, then uses multicast protocol for communication between nodes, and also supports point-to-point interaction.

Transport

Represents how Elasticsearch internal nodes or clusters interact with clients. By default, TCP protocol is used internally, but it also supports HTTP protocol (JSON format), Thrift, Servlet, Memcached, ZeroMQ, and other transport protocols (integrated through plugins).

Installation

I’ll use Docker for installation. Visit https://hub.docker.com and search for Elasticsearch to find the official image.

$ docker pull elasticsearch:7.4.2

Note: The Elasticsearch official image repository no longer supports the latest tag, so you cannot use the docker pull elasticsearch command directly. You must specify a version number when pulling.

Single Node Mode

# Create a Docker network named betterde
$ docker network create betterde

$ docker run -d \
  --name elasticsearch \
  --net betterde \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  elasticsearch:7.4.2

Cluster Mode

Here we use Docker Compose to deploy an Elasticsearch cluster. Below is the docker-compose.yml:

version: '2.2'
services:
  es01:
    image: elasticsearch:7.4.2
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - betterde
  es02:
    image: elasticsearch:7.4.2
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data02:/usr/share/elasticsearch/data
    networks:
      - betterde
  es03:
    image: elasticsearch:7.4.2
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data03:/usr/share/elasticsearch/data
    networks:
      - betterde

volumes:
  data01:
    driver: local
  data02:
    driver: local
  data03:
    driver: local

networks:
  betterde:
    name: betterde
    driver: bridge
$ docker-compose up

Note: If used in a production environment, please set the system parameter vm.max_map_count to at least 262144 using the command sysctl -w vm.max_map_count=262144.

Now use Curl to access http://localhost:9200. If all goes well, you should get a response like this:

{
  "name" : "9affc1c058b7",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "vqP1MWdlQxWT8xwJ9A-FyA",
  "version" : {
    "number" : "7.4.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
    "build_date" : "2019-10-28T20:40:44.881551Z",
    "build_snapshot" : false,
    "lucene_version" : "8.2.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

This indicates that Elasticsearch is now running.

Conclusion

As you can see, while Elasticsearch is powerful, it’s also very simple and easy to deploy for beginners. We’ll also experience this when deploying other Elastic Stack products. I’ll be publishing a separate blog post to introduce Kibana, another supporting component, so stay tuned.

I hope this is helpful, Happy hacking…