Customers Contact TR

Real-Time Big Data Reports with ClickHouse

 
 

In today’s technology, there are thousands and millions of data streams per second. Everything is now done digitally, including smartphones, smartwatches, wristbands, beacon technologies, money transfers, and millions of data resulting from this digitalization.

 

In solutions with legacy databases, growing data increased the servers’ processing power, and increased computing power resulted in more servers, more space, and more cost. As this problem required a solution, methods of approach to the database have also changed.

 

At this stage, there has been a change in the way we access stored data (access through more than one layer – cache mechanisms). It created the need to develop immediate solutions for data storage and improvements to ensure continuous access. Eventually, some new systems have been designed to store the data with different methods, enabling quicker access through various layers.

 

When we started experiencing problems with understanding the data, storing it according to its content, accessing it, and doing all these instantaneously for millions of data, ClickHouse emerged as the solution to many of our problems.

 

What is ClickHouse?

 

ClickHouse developed for Yandex.Metrica is a column-based, distributed, parallel-processing, horizontally expandable OLAP (Online Analytical Processing) Database Management System (DBMS). It stores data on disk or memory and supports SQL (has some differences with standard SQL, though) but has no actual update/delete and transaction support.

 

➡ It is an open-source and free project.

➡ According to Yandex’s statement, it was used in the second largest OLAP project in the world. There is no transaction support because it is OLAP.

➡ There are different mechanisms for load balancing and data consistency in the case of query problems than other solutions.

➡ With DDL support, queries practically work in all connected servers.

➡ It provides table structure level support for processes such as data consistency, speed, selection and merging, and distributed work.

➡ You notice the benefits and high speed from when you start to use it, especially when the data grows.

➡ In terms of technical infrastructure;

 
  • It is developed with C ++.
  • JDBC driver is presented as natural.
  • It can work over TCP / IP or HTTP to process data.
  • It works on the Linux server, but they recommended to use Ubuntu 12.04 and above.
  • If a different distro is to be used, the x86_64 architecture should support SSE 4.2.
 

Basic Differences with SQL

 

ClickHouse inherently supports SQL. However, it is different from the SQL statements we know.

 

For example; To get a sample from the same last name surname in SQL distinct first_name should be written, while ClickHouse expects it as uniq(first_name). (However, there is no obligation to write that way because ClickHouse expects it. You can do the “distinct” process, as you know.) The basis of this difference is the concern for speed.

 

As you start to use it, you soon notice C ++, the heart of the project, in the queries. The most significant difference in the queries is that the words used for aggregation are written in function logic rather than in statement. For ALTER operations, it uses a different syntax.

 

Data Types

 

In contrast to the data types we know, ClickHouse works on the data types of the written language. For example, let’s take the ‘integer’ transaction used in MySQL.

 
Source: https://dev.mysql.com/doc/refman/8.0/en/integer-types.html    
Source: https://clickhouse-docs.readthedocs.io/en/latest/data_types/int_uint.html

You can see the syntax differences above.

 

When you create your tables from scratch or migrate your existing tables to ClickHouse, you can choose the appropriate one for you in the documentation and use it in your project.

 

Table Structures

 

If you have previously worked with MySQL, Cassandra, or a similar DBMS, you know that they have a limited number of table structures, whereas you’ll see more table structures in ClickHouse. ClickHouse has several different table structure engine families such as Distributed, Merge, MergeTree, *MergeTree, Log, TinyLog, Memory, Buffer, Null, File. The most used are Distributed, Memory, MergeTree, and their sub-engines.

 

Distributed actually works as a view rather than a complete table structure. In definition, it is a structure, but it does not contain any data.

 

“MergeTree” engine family provides solutions to our significant problems. MergeTree has created sub-engines such as ReplicatedMergeTree, AggregatingMergeTree, SummingMergeTree for more comprehensive solutions.

 

Example Scenario: Collection of user-based content click data

 
 

Desired Outcome: To calculate how many times users clicked on a content (A user may click multiple times to the same point)

 

Problem: This calculation creates extra data in the horizontal tables, thus increasing the duration of the process.

 

Solution: In such cases, ClickHouse sums the same type of records together using the “SummingMergeTree” table structure. This way, it eliminates duplicate records, thereby performing a faster calculation.

 

Now let’s see how ClickHouse provides a solution for the big data, our main problem.

 

Using of Cluster

 

We have mentioned that data proliferation also increases costs. Data is generally stored in rows. The drawback here is that as the data grows, you need to increase the vertical capacity because all lines in a row must be examined in the calculation process. However, in column-based systems where data is kept in columns, you can scale horizontally by adding similar servers when your data grows.

 

Rather than the usual way, we group the data into a “cluster” and expand it when the data grows.

 

ClickHouse offers us an option for this:

 

ReplicatedMergeTree” table structure makes it possible to copy data by creating nodes in the same cluster. This way, our result’s accuracy doesn’t change even when an element in a shard is working.

 

“Distributed” table structure acts as the aggregation of these shards. This way, we can get results without considering how data is distributed across the shards (they can be distributed entirely randomly). Besides, when you create distributed shards, each shard can query itself and provide faster results. It is recommended to have a maximum of 300 nodes, as indicated in the documentation.

 

DDL Queries

 

When you work in a cluster structure, you can not know if your server is always running. Maintenance and management are challenging. Think about this: Your server makes software and hardware upgrades, and it is not accessible for a while. However, there is no interruption in the data. In this case, because you do not know which server is running, there are changes in your queries.

 

For example, in the past, you were making requests to a node connected to a single IP or domain; however, now you should make requests to a working set of nodes connected to several IPs or domains. Usually, controlling which is working requires extra cost and administration.

 

ClickHouse approaches our situation as below:

 
“You give me a list of nodes in the cluster, and I’ll write the data to the running one for you. I’ll even distribute it to the others. But on one condition: you must specify this in your queries.”
 

That’s why ClickHouse starts the job at the server level and ends at the query level.

 

So how do we specify this in the server configuration?

 

Defining of Cluster

 

Things to pay attention to:

 
  1. While shards can include replicas, replicas do not include shards.
  2. The number of replicas can be independent of the shard it is located. It can specify how much data is processed and sent to the next shard. Using ON CLUSTER {custom cluster} SQL statement in your queries makes it run on that cluster. With this method, you can write all of your data to any running node in the cluster without knowing which cluster has how many shards or replicas.
  3. Do not think that ClickHouse can solely solve everything. ClickHouse needs the Zookeeper service when doing this, and the description is as follows;
 

<yandex>
. . .
<remote_servers>
    <custom_cluster>
        <shard>
            <weight>1</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>example01-01-1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-01-2</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <weight>2</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>example01-02-1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-02-2</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-02-3</host>
                <port>9000</port>
            </replica>
        </shard>
    </custom_cluster>
</remote_servers>
. . .
<zookeeper>
    <node index="1">
        <host>example1</host>
        <port>2181</port>
    </node>
    <node index="2">
        <host>example2</host>
        <port>2181</port>
    </node>
    <node index="3">
        <host>example3</host>
        <port>2181</port>
    </node>
</zookeeper>
. . .
</yandex>

Important Notes

 

➡ If you have used a method that requires Zookeeper in your solution, your Zookeeper server must always be running to run your ClickHouse server. Otherwise, your server pretends to be running, but in fact, it is not. When you check the logs, you see that it cannot connect to Zookeeper.

 

➡ Because DDL queries run on Zookeeper, if you add a new node to your server and define it in the cluster, ClickHouse automatically synchronizes the replica data. However, it is not recommended to add new replicas to shards that contain a high number of records. So, it would help if you had a plan before preparing the cluster.

 

Summary

Not every project can meet every need. Therefore, it is crucial to gain experience in various projects. You can find the most appropriate method and solution for your project by analyzing your needs right from the beginning. ClickHouse solves your problems if your requirements match the criteria specified in its documentation.


Author: Fatih Çetin

Date Published: Sep 21, 2018