Products Solutions Contact TR

Real-Time Big Data Reports with ClickHouse

 
 

In today’s technology, there are thousands and millions of data streams per second. Everything is now done digitally, including smartphones, smartwatches, wristbands, beacon technologies, money transfers, and millions of data as a result of this digitalization.

 

In solutions with legacy databases, growing data increased the processing power of the servers, and increased computing power resulted in more servers, more space, and more cost. As this problem required a solution, methods of approach to the database have also changed.

 

At this stage, there has been a change in the way we access stored data (access through more than one layer – cache mechanisms). It created the need for the development of immediate solutions for data storage and improvements to ensure continuous access. Eventually, some new systems have been developed to store the data with different methods, enabling quicker access through various layers.

 

When we started experiencing problems with understanding the data, storing it according to its content, accessing it, and doing all these instantaneously for millions of data, ClickHouse emerged as the solution of many of our problems.

 

What is ClickHouse?

 

ClickHouse developed for Yandex.Metrica, is a column-based, distributed, parallel-processing, horizontally expandable OLAP (Online Analytical Processing) Database Management System (DBMS) that stores data on disk or memory and supports SQL (has some differences with standard SQL though), but has no actual update/delete and transaction support.

 

➡ It is an open source and free project.

➡ According to Yandex’ statement, it was used in the second largest OLAP project in the world. There is no transaction support because it is OLAP.

➡ In the case of query problems, there are different mechanisms for load balancing and data consistency than other solutions.

➡ With DDL support, queries practically work in all connected servers.

➡ It provides table structure level support for processes such as data consistency, speed, selection and merging, and distributed work.

➡ You notice the benefits and high speed from the moment you start to use it, especially when the data grows.

➡ In terms of technical infrastructure;

 
  • Developed with C ++.
  • JDBC driver is presented as natural.
  • It can work over TCP / IP or HTTP to process data.
  • It works on Linux server but they recommended to use Ubuntu 12.04 and above.
  • If a different distro is to be used, the x86_64 architecture should support SSE 4.2.
 

Basic Differences with SQL

 

ClickHouse inherently supports SQL. However, it is different from the SQL statements we know.

 

For example; To get a sample from the same last name surname in SQL distinct first_name should be written, while ClickHouse expects it as uniq(first_name). (However, there is no obligation to write that way because ClickHouse expects it. You can do the “distinct” process as you know.) The basis of this difference is the concern for speed.

 

As you start to use it, you will soon notice C ++, the heart of the project, in the queries. The biggest difference in the queries is that the words that will be used for aggregation are written in the function logic rather than in the statement, and for ALTER operations, it uses a different syntax.

 

Data Types

 

In contrast to the data types we know, ClickHouse works on the data types of the language it is written. For example, let’s take the ‘integer’ transaction used in MySQL.

 
Source: https://dev.mysql.com/doc/refman/8.0/en/integer-types.html    
Source: https://clickhouse-docs.readthedocs.io/en/latest/data_types/int_uint.html

You can see the syntax differences above.

 

When you create your tables from scratch or migrate your existing tables to ClickHouse, you can choose the appropriate one for you in the documentation and use in your project.

 

Table Structures

 

If you have previously worked with MySQL, Cassandra or a similar DBMS, you know that they have a limited number of table structures, whereas you’ll see more table structures in ClickHouse. ClickHouse has several different table structure engine families such as Distributed, Merge, MergeTree, *MergeTree, Log, TinyLog, Memory, Buffer, Null, File. The most used are Distributed, Memory, MergeTree, and their sub-engines.

 

Distributed actually works as a view, rather than a complete table structure. In definition, it is a structure, but it does not contain any data.

 

“MergeTree” engine family provide solutions to our major problems. MergeTree has created sub-engines such as ReplicatedMergeTree, AggregatingMergeTree, SummingMergeTree for more comprehensive solutions.

 

Example Scenario: Collection of user-based content click data

 
 

Desired Outcome: To calculate how many times users clicked on a content (A user may click multiple times to the same point)

 

Problem: This calculation will create extra data in the horizontal tables, thus increasing the duration of the process.

 

Solution: In such cases, ClickHouse sums the same type of records together using the “SummingMergeTree” table structure. This way, it eliminates duplicate records thereby performing a faster calculation.

 

Now let’s see how ClickHouse provides a solution for the big data, our main problem.

 

Using of Cluster

 

We have mentioned that data proliferation also increases costs. Data is generally stored in rows. The drawback here is that as the data grows, you need to increase the vertical capacity because all lines in a row must be examined in the calculation process. However, in column-based systems where data is kept in columns, you can scale horizontally by adding similar servers when your data grows.

 

Rather than the usual way, we do this by grouping the data into a “cluster” and expand the cluster when the data grows.

 

ClickHouse offers us an option for this:

 

ReplicatedMergeTree” table structure makes it possible to copy data by creating nodes in the same cluster. This way, the accuracy of our result doesn’t change even when an element in a shard is working.

 

“Distributed” table structure acts as the aggregation of these shards. This way, we can get results without considering how data is distributed across the shards (they can be distributed completely randomly). In addition, when you create distributed shards, each shard can query itself and provide faster results. It is recommended to have a maximum of 300 nodes as indicated in the documentation.

 

DDL Queries

 

When you work in a cluster structure, you can not know if your server is always running. Maintenance and management are difficult. Think about this: Your server will make software and hardware upgrades, and it will not be accessible for a while. However, there will be no interruption in the data. In this case, because you do not know which server is running, there will be changes in your queries.

 

For example; In the past, you were making requests to a single IP or domain, now you should make requests to the solid IP / domain. Checking which one is running creates extra cost and requires administration.

 

ClickHouse approaches our situation as below:

 
“You give me a list of nodes in the cluster, and I’ll write the data to the running one, for you. I’ll even distribute it to the others. But on one condition: you must specify this in your queries.”

That’s why ClickHouse starts the job at the server level and end at the query level.

 

So how do we specify this in the server configuration?

 

Defining of Cluster

 

Things to pay attention:

 
  1. While shards can include replicas, replicas do not include shards.
  2. The number of replicas can be independent for the shard it is located. It can specify how much data will be processed and sent to the next shard. Using ON CLUSTER {custom cluster} SQL statement in your queries makes it run on that cluster. With this method, you can write all of your data to any running node in the cluster, without knowing which cluster has how many shards or replicas.
  3. Don’t think that ClickHouse can solely solve everything. ClickHouse needs the Zookeeper service when doing this, and the description is as follows;
 
 
<yandex>
. . .
<remote_servers>
    <custom_cluster>
        <shard>
            <weight>1</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>example01-01-1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-01-2</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <weight>2</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>example01-02-1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-02-2</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-02-3</host>
                <port>9000</port>
            </replica>
        </shard>
    </custom_cluster>
</remote_servers>
. . .
<zookeeper>
    <node index="1">
        <host>example1</host>
        <port>2181</port>
    </node>
    <node index="2">
        <host>example2</host>
        <port>2181</port>
    </node>
    <node index="3">
        <host>example3</host>
        <port>2181</port>
    </node>
</zookeeper>
. . .
</yandex>

 

Important Notes

 

➡ If you have used a method that requires Zookeeper in your solution, your Zookeeper server must always be running to be able to run your ClickHouse server. Otherwise, your server will seem to be running, but in fact, it will not be. When you check the logs, you will see that it cannot connect to Zookeeper.

 

➡ Because DDL queries run on Zookeeper, if you add a new node to your server and define it in the cluster, ClickHouse will automatically synchronize the data for the replica. However, it is not recommended to add new replicas to shards that contain a high number of records. So, you should have a plan before preparing the cluster.

 

Summary

Not every project can meet every need. Therefore, it is important to gain experience in various projects. You can find out the most appropriate method and solution for your project by analyzing your needs right from the beginning. ClickHouse will solve your problems if your requirements match the criteria specified in its documentation.

Author: Fatih Çetin
%d bloggers like this: