Using executable external dictionaries in ClickHouse

External dictionaries, a dictionary populated by an external source, are a rather useful way to make data external to ClickHouse accessible when working in ClickHouse. One option for the source of the external data is an executable. I found, though, that the documentation doesn’t clearly tell you how to use this, so here I’m trying to rectify this.

There are two basic types of executable external dictionary, which I’ll call whole file and lookup dictionaries. Let’s look at each in turn. Oh, and I’m using ClickHouse version 18.5.1 for these examples.

Whole file dictionary

This is the most straightforward type. It is similar to an external dictionary populated from a local file. A command is run and the results read from the command standard output in one of the formats supported by ClickHouse.

Keeping things as simple as possible, let’s just use the command to cat a file to standard output. Here’s a simple input file with two fields separated by a tab:

$ cat /home/jim/test
1	One
2	Two
3	Three

And here’s an extract of the configuration for a simple flat dictionary.

    <source>
<executable>
<command>cat /home/jim/test</command>
<format>TabSeparated</format>
</executable>
</source>
<lifetime>3600</lifetime>
<layout>
<flat/>
</layout>
<structure>
<id>
<name>item</name>
</id>
<attribute>
<name>value</name>
<type>String</type>
<null_value></null_value>
</attribute>
</structure>

As you might guess, the command is run every hour and the results stored in the dictionary. Here’s a lookup on that dictionary.

SELECT dictGetString('test', 'value', toUInt64(1))

+-dictGetString('test', 'value', toUInt64(1))-+
| One                                         |
+---------------------------------------------+

1 rows in set. Elapsed: 0.002 sec.

If your key isnt’ a simple integer, you can switch to a complex_key_hashed dictionary.

    <source>
<executable>
<command>cat /home/jim/test</command>
<format>TabSeparated</format>
</executable>
</source>
<lifetime>3600</lifetime>
<layout>
<complex_key_hashed/>
</layout>
<structure>
<key>
<attribute>
<name>item</name>
<type>String</type>
</attribute&gt;
</key>
<attribute>
<name>value</name>
<type>String</type>
<null_value></null_value>
</attribute>
</structure>

And here’s a lookup on that dictionary.

SELECT dictGetString('test2', 'value', tuple('3'))

+-dictGetString('test2', 'value', tuple('3'))-+
| Three                                       |
+---------------------------------------------+

1 rows in set. Elapsed: 0.043 sec.

lookup dictionary

So far, so straightforward. But ClickHouse has another trick up its sleeve. If you specify a dictionary layout of complex_key_cache and a cache size, ClickHouse requests key items from the command by passing the key value into the command on its standard input. The command reads the key, and prints the dictionary record to its standard output.

Here’s an example looking at disc usage. I’m using a simple Python3 application sys-info to read the disc usage; here’s the core part of that application.

        statfs = os.statvfs('/home/jim')

        for l in sys.stdin.readlines():
            l = l.strip()
            if l == 'disc-block-size':
                val = statfs.f_bsize
            elif l == 'disc-size':
                val = statfs.f_blocks * statfs.f_frsize // 1024
            elif l == 'disc-available':
                val = statfs.f_bavail * statfs.f_frsize // 1024
            elif l == 'disc-percent-free':
                val = 100 *statfs.f_bavail // statfs.f_blocks
            elif l == 'disc-percent-used':
                val = 100 - (100 * statfs.f_bavail // statfs.f_blocks)
            else:
                print('Unknown key: {key}'.format(key=l))
                logging.error('Unknown key: {key}'.format(key=l))
                sys.exit(1)

            print('{key}\t{val}'.format(key=l, val=val))
        sys.exit(0)

Running it from the command line looks like this:

$ echo "disc-percent-used" | sys-info
disc-percent-used	6

Note that the first field returned has to be the index value. Let’s match this with a dictionary definition.

    <source>
<executable>
<command>sys-info</command>
<format>TabSeparated</format>
</executable>
</source>
<lifetime>3600</lifetime>
<layout>
<complex_key_cache>
<size_in_cells>20</size_in_cells>
</complex_key_cache>
</layout>
<structure>
<key>
<attribute>
<name>item</name>
<type>String</type>
</attribute>
</key>
<attribute>
<name>value</name>
<type>String</type>
<null_value></null_value>
</attribute>
</structure>

And now we’re set up for a query.

SELECT dictGetString('sysinfo', 'value', tuple('disc-percent-used'))
+-dictGetString('sysinfo', 'value', tuple('disc-percent-used'))-+
| 6                                                             |
+---------------------------------------------------------------+
1 rows in set. Elapsed: 0.087 sec.

Spread the word. Share this post!