Working with Materialized View tables in ClickHouse

There must be something about January which makes John prod me into a blog post about something I’ve just teased out. So here we are, it’s 2020, it’s January, and what is fast (OK, not so fast) becoming an annual tradition.

Today’s post is a selection on snippets on Materialized Views. Materialized Views, if you haven’t met them, are tables automatically populated when data is inserted into some other table.

Materialized View inner tables

There are two ways to create a table populated from a Materialized View. This first, which we’ve used up to now, is like this:

CREATE MATERIALIZED VIEW QueriesPerSecond
(
    Date Date,
    DateTime DateTime,
    NodeID UInt16,
    QueryCount UInt32,
    QueryDOCount UInt32,
)
ENGINE = SummingMergeTree()
PARTITION BY toYearWeek(Date)
ORDER BY (Date, DateTime, NodeID)
AS SELECT
    Date,
    DateTime,
    NodeID,
    CAST(1 as UInt32) AS QueryCount,
    CAST(QueryDO AS UInt32) AS QueryDOCount,
FROM QueryResponse
WHERE QueryResponseHasQuery;

When records are added to table QueryResponse, the SELECT runs and the results added to QueriesPerSecond.

Well, almost. If you SHOW TABLES, you’ll see there’s now QueriesPerSecond and another table named .inner.QueriesPerSecond. The latter table is where the records actually end up.

How to refer to that inner table, a digression…

You need a little trick to refer to the inner table from SQL. You’ll need to enclose the table name in back-ticks. This also means that you can’t refer to inner tables in a different database to the current.

So to verify that the inner table is indeed the table containing the data for the view, SELECT * FROM `.inner.QueriesPerSecond` LIMIT 10.

… back to Materialized Views

All this is fine and dandy. But what if you want to, say, rename a column in QueryResponse? How do you change the SELECT associated with the Materialized View? Well, you can drop and re-create the Materialized View, but if you drop the view then the inner table also gets dropped and you lose all the existing data. What you need to do instead is to find out where the definition is saved on the server:

:) select metadata_path from system.tables where name='QueriesPerSecond'
+-metadata_path--------------------------------------------+
| /var/lib/clickhouse/metadata/wombat/QueriesPerSecond.sql |
+----------------------------------------------------------+
:) detach table QueriesPerSecond

Head to the server, edit that .sql file to your taste, and reattach the table:

:) attach table QueriesPerSecond

So, you can do it, but it’s not great. You can’t do it from SQL, and you need to log into the server and directly edit files in the ClickHouse working space.

Materialized View without inner tables

Here’s what we’ll be doing from now on. It turns out you can create a normal table to hold the results of the materialized view updates:

CREATE TABLE QueriesPerSecond
(
    Date Date,
    DateTime DateTime,
    NodeID UInt16,
    QueryCount UInt32,
    QueryDOCount UInt32,
)
ENGINE = SummingMergeTree()
PARTITION BY toYearWeek(Date)
ORDER BY (Date, DateTime, NodeID);

CREATE MATERIALIZED VIEW QueriesPerSecondMV
TO QueriesPerSecond
AS SELECT
    Date,
    DateTime,
    NodeID,
    CAST(1 as UInt32) AS QueryCount,
    CAST(QueryDO AS UInt32) AS QueryDOCount,
FROM QueryResponse
WHERE QueryResponseHasQuery;

To change the view SELECT, drop QueriesPerSecondMV and re-create it. Unlike the materialized view with the inner table we saw earlier, this won’t delete the underlying table. So now we can modify the materialized view query from SQL, rather than having to monkey with files on the server.

Convert from inner table Materialized View to a separate table Materialized View

So, can you convert from the inner table to a separate table Materialized View? Yes.

:) DETACH TABLE QueriesPerSecond
:) RENAME TABLE `.inner.QueriesPerSecond` TO QueriesPerSecond
:) CREATE MATERIALIZED VIEW QueriesPerSecondMV
TO QueriesPerSecond
AS SELECT
    Date,
    DateTime,
    NodeID,
    CAST(1 as UInt32) AS QueryCount,
    CAST(QueryDO AS UInt32) AS QueryDOCount,
FROM QueryResponse
WHERE QueryResponseHasQuery;

Renaming the old inner table to the previous name of the view plus table overwrites any old definition of that view, and we have a pure data table. Create a new materialized view to populate it, and we’re done.

Using executable external dictionaries in ClickHouse

External dictionaries, a dictionary populated by an external source, are a rather useful way to make data external to ClickHouse accessible when working in ClickHouse. One option for the source of the external data is an executable. I found, though, that the documentation doesn’t clearly tell you how to use this, so here I’m trying to rectify this.

There are two basic types of executable external dictionary, which I’ll call whole file and lookup dictionaries. Let’s look at each in turn. Oh, and I’m using ClickHouse version 18.5.1 for these examples.

Whole file dictionary

This is the most straightforward type. It is similar to an external dictionary populated from a local file. A command is run and the results read from the command standard output in one of the formats supported by ClickHouse.

Keeping things as simple as possible, let’s just use the command to cat a file to standard output. Here’s a simple input file with two fields separated by a tab:

$ cat /home/jim/test
1	One
2	Two
3	Three

And here’s an extract of the configuration for a simple flat dictionary.

    <source>
<executable>
<command>cat /home/jim/test</command>
<format>TabSeparated</format>
</executable>
</source>
<lifetime>3600</lifetime>
<layout>
<flat/>
</layout>
<structure>
<id>
<name>item</name>
</id>
<attribute>
<name>value</name>
<type>String</type>
<null_value></null_value>
</attribute>
</structure>

As you might guess, the command is run every hour and the results stored in the dictionary. Here’s a lookup on that dictionary.

SELECT dictGetString('test', 'value', toUInt64(1))

+-dictGetString('test', 'value', toUInt64(1))-+
| One                                         |
+---------------------------------------------+

1 rows in set. Elapsed: 0.002 sec.

If your key isnt’ a simple integer, you can switch to a complex_key_hashed dictionary.

    <source>
<executable>
<command>cat /home/jim/test</command>
<format>TabSeparated</format>
</executable>
</source>
<lifetime>3600</lifetime>
<layout>
<complex_key_hashed/>
</layout>
<structure>
<key>
<attribute>
<name>item</name>
<type>String</type>
</attribute&gt;
</key>
<attribute>
<name>value</name>
<type>String</type>
<null_value></null_value>
</attribute>
</structure>

And here’s a lookup on that dictionary.

SELECT dictGetString('test2', 'value', tuple('3'))

+-dictGetString('test2', 'value', tuple('3'))-+
| Three                                       |
+---------------------------------------------+

1 rows in set. Elapsed: 0.043 sec.

lookup dictionary

So far, so straightforward. But ClickHouse has another trick up its sleeve. If you specify a dictionary layout of complex_key_cache and a cache size, ClickHouse requests key items from the command by passing the key value into the command on its standard input. The command reads the key, and prints the dictionary record to its standard output.

Here’s an example looking at disc usage. I’m using a simple Python3 application sys-info to read the disc usage; here’s the core part of that application.

        statfs = os.statvfs('/home/jim')

        for l in sys.stdin.readlines():
            l = l.strip()
            if l == 'disc-block-size':
                val = statfs.f_bsize
            elif l == 'disc-size':
                val = statfs.f_blocks * statfs.f_frsize // 1024
            elif l == 'disc-available':
                val = statfs.f_bavail * statfs.f_frsize // 1024
            elif l == 'disc-percent-free':
                val = 100 *statfs.f_bavail // statfs.f_blocks
            elif l == 'disc-percent-used':
                val = 100 - (100 * statfs.f_bavail // statfs.f_blocks)
            else:
                print('Unknown key: {key}'.format(key=l))
                logging.error('Unknown key: {key}'.format(key=l))
                sys.exit(1)

            print('{key}\t{val}'.format(key=l, val=val))
        sys.exit(0)

Running it from the command line looks like this:

$ echo "disc-percent-used" | sys-info
disc-percent-used	6

Note that the first field returned has to be the index value. Let’s match this with a dictionary definition.

    <source>
<executable>
<command>sys-info</command>
<format>TabSeparated</format>
</executable>
</source>
<lifetime>3600</lifetime>
<layout>
<complex_key_cache>
<size_in_cells>20</size_in_cells>
</complex_key_cache>
</layout>
<structure>
<key>
<attribute>
<name>item</name>
<type>String</type>
</attribute>
</key>
<attribute>
<name>value</name>
<type>String</type>
<null_value></null_value>
</attribute>
</structure>

And now we’re set up for a query.

SELECT dictGetString('sysinfo', 'value', tuple('disc-percent-used'))
+-dictGetString('sysinfo', 'value', tuple('disc-percent-used'))-+
| 6                                                             |
+---------------------------------------------------------------+
1 rows in set. Elapsed: 0.087 sec.

Selecting an interval from a integer column in PostgreSQL

I needed to create a select that combined a timestamp in one column with an interval specified in minutes from another column. After much manual reading and searching I found the answer on stackoverflow. My SQL looked something like:

SELECT d.time AT TIME ZONE 'UTC' + interval '1 minute' * d.minute as time FROM my data d ...

Much faster than using Perl’s DateTime::Format::Pg and DateTime::Duration modules to do the addition.