opentsdb探索之路——部分设计与实现

行无际 2020-04-07 我要评论

- [opentsdb 概览(overview)](#opentsdb-%E6%A6%82%E8%A7%88overview) - [opentsdb 存储细节(Writing)](#opentsdb-%E5%AD%98%E5%82%A8%E7%BB%86%E8%8A%82writing) - [rowkey的设计](#rowkey%E7%9A%84%E8%AE%BE%E8%AE%A1) - [rowkey的具体实现](#rowkey%E7%9A%84%E5%85%B7%E4%BD%93%E5%AE%9E%E7%8E%B0) - [压缩(compaction)](#%E5%8E%8B%E7%BC%A9compaction) - [追加模式(appends)](#%E8%BF%BD%E5%8A%A0%E6%A8%A1%E5%BC%8Fappends) - [opentsdb UID的分配(UID Assignment)](#opentsdb-uid%E7%9A%84%E5%88%86%E9%85%8Duid-assignment) - [opentsdb 查询细节(Reading)](#opentsdb-%E6%9F%A5%E8%AF%A2%E7%BB%86%E8%8A%82reading) - [rowkey中加salt的情况(Salting)](#rowkey%E4%B8%AD%E5%8A%A0salt%E7%9A%84%E6%83%85%E5%86%B5salting) - [其他配置(Configuration)](#%E5%85%B6%E4%BB%96%E9%85%8D%E7%BD%AEconfiguration) - [http接口(HTTP API)](#http%E6%8E%A5%E5%8F%A3http-api) - [opentsdb连接Kerberos认证的HBase(非重点，仅顺手记录于此)](#opentsdb%E8%BF%9E%E6%8E%A5kerberos%E8%AE%A4%E8%AF%81%E7%9A%84hbase%E9%9D%9E%E9%87%8D%E7%82%B9%E4%BB%85%E9%A1%BA%E6%89%8B%E8%AE%B0%E5%BD%95%E4%BA%8E%E6%AD%A4) - [具体操作](#%E5%85%B7%E4%BD%93%E6%93%8D%E4%BD%9C) - [写在后面](#%E5%86%99%E5%9C%A8%E5%90%8E%E9%9D%A2) 基于opentsdb-2.4.0版本，本篇开启opentsdb探索之路(主要涉及`读写特性`以及一些`其他细节`)，下一篇将开启opentsdb优化之路——`性能优化思路与建议`(总结当前痛点问题、优化思路和解决方案，同时也欢迎朋友提出更好的思路与方案)。 ***注意：阅读本篇文章应该要对`HBase`有最基本的认识，比如`rowkey`、`region`、`store`、` ColumnFamily`、`ColumnQualifier`等概念以及`HBase`逻辑结构、物理存储结构有大致的认知。*** ### opentsdb 概览(overview) ![opentsdb总体架构图](https://img2020.cnblogs.com/blog/1546632/202004/1546632-20200405151735981-1960286423.png) 上图取自官方`http://opentsdb.net/overview.html`。其中的`TSD`(对应实际进程名是`TSDMain`)就是`opentsdb`组件。每个实例`TSD`都是独立的。没有`master`，没有共享状态(`shared state`)，因此实际生产部署可能会通过`nginx`+`Consul`运行多个`TSD`实例以实现`负载均衡`。 > Each TSD uses the open source database HBase or hosted Google Bigtable service to store and retrieve time-series data 我们大多应该还是用`HBase`作为数据存储。安装部署一文中提到过在[HBase中创建表结构](https://www.cnblogs.com/itwild/p/12528757.html#%E5%9C%A8hbase%E4%B8%AD%E5%88%9B%E5%BB%BA%E8%A1%A8%E7%BB%93%E6%9E%84)，这里先简单介绍一下这4张表(`table`)，随着探究的深入会对`tsdb`和`tsdb-uid`这两张表有更深刻的认识，至于`tsdb-meta`、`tsdb-tree`两张表不是这里讨论的重点，简单了解一下即可。相关文档：`http://opentsdb.nethttps://img.qb5200.com/download-x/docs/build/html/user_guide/backends/index.html` - tsdb: `opentsdb`全部的时序数据都存在这张表中，该表只有一个名为"t"的列族(`ColumnFamily`)。所以这张表的数据非常大，大多情况下读写性能瓶颈也就与这张表密切相关，进而优化也可能与它相关。 **rowkey的设计为`an optional salt, the metric UID, a base timestamp and the UID for tagk/v pairs`，即[可选的salt位+metric的UID+小时级别的时间戳+依次有序的tagk、tagv组成的UID键值对]，如下：** ```sh [salt][...] ``` 暂不考虑`salt`位，关于加`salt`下面有章节单独拿出来看它的设计与实现。来看一个不加`salt`且含有两个`tag`的时序数据的`rowkey`组成： ```sh 00000150E22700000001000001000002000004 '----''------''----''----''----''----' metric time tagk tagv tagk tagv ``` 至于`rowkey`为什么要这样设计以及具体实现，后面详细介绍，这里先有个基本认知。 - tsdb-uid: 为了减少`rowkey`的长度，`opentsdb`会将`metric`、`tagk`、`tagv`都映射成`UID`，映射是双向的，比如说既可以根据`tagk`找到对应的`UID`，也可以根据`UID`直接找到相应的`tagk`。而这些`映射关系`就记录在`tsdb-uid`表中。该表有两个`ColumnFamily`，分别是`name`和`id`，另外这两个`ColumnFamily`下都有三列，分别是`metric`、`tagk`、`tagv`。如下图所示： |RowKey|id:metric|id:tagk|id:tagv|name:metric|name:tagk|name:tagv| |:--:|:--:|:--:|:--:|:--:|:--:|:--:| |metric01|0x01|||||| |metric02|0x02|||||| |tagk01||0x01||||| |tagv01|||0x01|||| |tagv02|||0x02|||| |0x01||||metric01||| |0x01|||||tagk01|| |0x01||||||tagv01| |0x02||||metric02||| |0x02||||||tagv02| 从上面可以看出，`metric`、`tagk`、`tagv`三种类型的`UID`映射互不干扰，这也就使得`0x01`这个`UID`在不同类型中有着不同的含义。后面会从源码角度讲一下uid大致的分配。 - tsdb-meta: 在完成时序数据的写入之后，会根据当前`opentsdb`实例的配置决定是否为相关时序记录元数据信息。看一下`opentsdb.conf`配置文件中`tsd.core.meta.enable_tsuid_tracking`配置项即可。 `tsd.core.meta.enable_tsuid_tracking`(默认`false`): 如果开启该选项，每次写入一个`DataPoint`(时序数据)的同时还会向`tsdb-meta`表中写入`rowkey`为该时序数据的`tsuid`(下面会讲到它，即完整的`rowkey`除去`salt`和`timestamp`后的数据), `value`为1的记录。这样，每个点就对应两次`HBase`的写入，一定程度上加大了HBase集群的压力。相关代码见`TSDB#storeIntoDB()#WriteCB#call()` ```java // if the meta cache plugin is instantiated then tracking goes through it if (meta_cache != null) { meta_cache.increment(tsuid); } else { // tsd.core.meta.enable_tsuid_tracking if (config.enable_tsuid_tracking()) { // tsd.core.meta.enable_realtime_ts if (config.enable_realtime_ts()) { // tsd.core.meta.enable_tsuid_incrementing if (config.enable_tsuid_incrementing()) { TSMeta.incrementAndGetCounter(TSDB.this, tsuid); } else { TSMeta.storeIfNecessary(TSDB.this, tsuid); } } else { // 写入rowkey为tsuid，value为1的记录 final PutRequest tracking = new PutRequest(meta_table, tsuid, TSMeta.FAMILY(), TSMeta.COUNTER_QUALIFIER(), Bytes.fromLong(1)); client.put(tracking); } } } ``` - tsdb-tree: 作用，可按照树形层次结构组织时序，就像浏览文件系统一样浏览时序。相关介绍`http://opentsdb.nethttps://img.qb5200.com/download-x/docs/build/html/user_guide/trees.html`。这里就不细说了，有兴趣的话看下上面链接中官方介绍的`Examples`，就能秒懂是干嘛的。 ### opentsdb 存储细节(Writing) 相关文档： `http://opentsdb.nethttps://img.qb5200.com/download-x/docs/build/html/user_guide/writing/index.html` #### rowkey的设计只有一个名为"t"的列族 - 时序数据的`metric`、`tagk`、`tagv`三部分字符串都会被转成`UID`，这样再长的字符串在`rowkey`中也会由`UID`代替，大大缩短了`rowkey`的长度 - `rowkey`中的时序数据的`timestamp`并非实际的时序数据时间，是格式化成以`小时`为单位的时间戳(所谓的`base_time`)，也就是说该`rowkey`中的`base_time`表示的是该时序数据发生在哪个整点(小时)。每个数据写入的时候，会用该时序数据实际时间戳相对`base_time`的偏移量(`offset`)作为`ColumnQualifier`写入。结合下面的图以及之后的代码，就一目了然。 |rowkey|t: +1|t: +2|t: +3|t: ...|t: +3600| |--|--|--|--|--|--| |salt+metric_uid+base_time+tagk1+tagv1+...+tagkN+tagvN|10|9|12|...|8| #### rowkey的具体实现在没有启用`salt`的情况下，我整理出来生成`rowkey`的代码如下(注意一下：源码中并没有这段代码哦)： ```java public byte[] generateRowKey(String metricName, long timestamp, Map tags) { // 获取metricUid byte[] metricUid = tsdb.getUID(UniqueId.UniqueIdType.METRIC, metricName); // 将时间戳转为秒 if ((timestamp & Const.SECOND_MASK) != 0L) { timestamp /= 1000L; } final long timestamp_offset = timestamp % Const.MAX_TIMESPAN;//3600 // 提取出时间戳所在的整点(小时)时间 final long basetime = timestamp - timestamp_offset; // 用TreeMap存储, 排序用的是memcmp()方法，下面会有介绍 Map tagsUidMap = new org.hbase.async.Bytes.ByteMap<>(); tags.forEach((k, v) -> tagsUidMap.put( tsdb.getUID(UniqueId.UniqueIdType.TAGK, k), tsdb.getUID(UniqueId.UniqueIdType.TAGV, v))); // 不加salt的rowkey，metricUid+整点时间戳+所有的tagK、tagV byte[] rowkey = new byte[metricUid.length + Const.TIMESTAMP_BYTES + tags.size() * (TSDB.tagk_width() + TSDB.tagv_width())]; // 下面拷贝相应的数据到rowkey字节数组中的相应位置 System.arraycopy(metricUid, 0, rowkey, 0, metricUid.length); Bytes.setInt(rowkey, (int) basetime, metricUid.length); int startOffset = metricUid.length + Const.TIMESTAMP_BYTES; for (Map.Entry entry : tagsUidMap.entrySet()) { System.arraycopy(entry.getKey(), 0, rowkey, startOffset, TSDB.tagk_width()); startOffset += TSDB.tagk_width(); System.arraycopy(entry.getValue(), 0, rowkey, startOffset, TSDB.tagv_width()); startOffset += TSDB.tagv_width(); } return rowkey; } ``` 其中的`ByteMap`就是`TreeMap`，见`org.hbase.async.Bytes.ByteMap` ```java /** A convenient map keyed with a byte array. */ public static final class ByteMap extends TreeMap implements Iterable> { public ByteMap() { super(MEMCMP); } } ``` 多个`tag`的排序规则是对`tag_id`的`bytes`进行排序，调用的是`org.hbase.async.Bytes#memcmp(final byte[] a, final byte[] b)`方法，如下 ```java /** * {@code memcmp} in Java, hooray. * @param a First non-{@code null} byte array to compare. * @param b Second non-{@code null} byte array to compare. * @return 0 if the two arrays are identical, otherwise the difference * between the first two different bytes, otherwise the different between * their lengths. */ public static int memcmp(final byte[] a, final byte[] b) { final int length = Math.min(a.length, b.length); if (a == b) { // Do this after accessing a.length and b.length return 0; // in order to NPE if either a or b is null. } for (int i = 0; i < length; i++) { if (a[i] != b[i]) { return (a[i] & 0xFF) - (b[i] & 0xFF); // "promote" to unsigned. } } return a.length - b.length; } ``` #### 压缩(compaction) 相关文档： `http://opentsdb.nethttps://img.qb5200.com/download-x/docs/build/html/user_guidehttps://img.qb5200.com/download-x/definitions.html#compaction` > An OpenTSDB compaction takes multiple columns in an HBase row and merges them into a single column to reduce disk space. This is not to be confused with HBase compactions where multiple edits to a region are merged into one. OpenTSDB compactions can occur periodically for a TSD after data has been written, or during a query. `tsd.storage.enable_compaction`：是否开启压缩(默认为true，开启压缩) 为了减少存储空间(讲道理对查询也有好处)，`opentsdb`在写入时序数据的同时会把`rowkey`放到`ConcurrentSkipListMap`中，一个`daemon`线程不断检查`System.currentTimeMillis()/1000-3600-1`之前的数据能否被压缩，满足压缩条件则会把一小时内的时序数据(它们的`rowkey`是相同的)查出来在内存压缩(`compact`)成一列回写(`write`)到`HBase`中，然后`delete`之前的原始数据。或者是查询(`query`)操作可能也会触发`compaction`操作。代码见`CompactionQueue` ```java final class CompactionQueue extends ConcurrentSkipListMap { public CompactionQueue(final TSDB tsdb) { super(new Cmp(tsdb)); // tsd.storage.enable_appends if (tsdb.config.enable_compactions()) { // 启用了压缩则会启一个daemon的线程 startCompactionThread(); } } /** * Helper to sort the byte arrays in the compaction queue. *

* This comparator sorts things by timestamp first, this way we can find * all rows of the same age at once. */ private static final class Cmp implements Comparator { /** The position with which the timestamp of metric starts. */ private final short timestamp_pos; public Cmp(final TSDB tsdb) { timestamp_pos = (short) (Const.SALT_WIDTH() + tsdb.metrics.width()); } @Override public int compare(final byte[] a, final byte[] b) { // 取rowkey中的base_time进行排序 final int c = Bytes.memcmp(a, b, timestamp_pos, Const.TIMESTAMP_BYTES); // If the timestamps are equal, sort according to the entire row key. return c != 0 ? c : Bytes.memcmp(a, b); } } } ``` 看看上面启动的`daemon`线程在做啥`CompactionQueue#Thrd` ```java /** * Background thread to trigger periodic compactions. */ final class Thrd extends Thread { public Thrd() { super("CompactionThread"); } @Override public void run() { while (true) { final int size = size(); // 达到最小压缩阈值则触发flush() if (size > min_flush_threshold) { final int maxflushes = Math.max(min_flush_threshold, size * flush_interval * flush_speed / Const.MAX_TIMESPAN); final long now = System.currentTimeMillis(); // 检查上个整点的数据能否被压缩 flush(now / 1000 - Const.MAX_TIMESPAN - 1, maxflushes); } } } } ``` 再看`CompactionQueue#flush(final long cut_off, int maxflushes)` ```java private Deferred> flush(final long cut_off, int maxflushes) { final ArrayList> ds = new ArrayList>(Math.min(maxflushes, max_concurrent_flushes)); int nflushes = 0; int seed = (int) (System.nanoTime() % 3); for (final byte[] row : this.keySet()) { final long base_time = Bytes.getUnsignedInt(row, Const.SALT_WIDTH() + metric_width); if (base_time > cut_off) { // base_time比较靠近当前时间，则直接跳出 break; } else if (nflushes == max_concurrent_flushes) { break; } // 这里会发向hbase发get请求获取时序数据，在callback中进行压缩操作 ds.add(tsdb.get(row).addCallbacks(compactcb, handle_read_error)); } return group; } ``` 最后看一下`compaction`具体做了啥，见`CompactionQueue#Compaction#compact()` ```java public Deferred

opentsdb探索之路——部分设计与实现

相关文章

猜您喜欢

今日热门