cyrus

cyrus

梦想是做一名可以“养活”自己的自由职业者 --自由不是你想做什么就做什么 而是不想做什么就不做什么.
github
twitter
bilibili

What? Real-time data warehouse is actually so simple?

As the first practitioner of the company's real-time data warehouse, I record my own definition of some specifications and understanding of the real-time data warehouse.

Overview of Streaming Data Warehouse#

Purpose of Construction#

Clear data structure: Each data layer has its own role and responsibility, making it more convenient and understandable to use and maintain.
Simplify complex problems: Break down a complex task into multiple steps to be completed step by step, with each layer solving specific problems.
Unified data standard: By data layering, provide a unified data output and output standard.
Reduce duplicate development: Standardize data layering, enable data reuse, develop common intermediate layers, and greatly reduce duplicate calculation work.

Design Concept#

Real-time streaming processing chain (Flink + Kafka) divides data into ODS, DWD, DWS layers and writes them into the online service layer in real-time to provide online services (real-time OLAP).

  • Compared to offline data warehouses, the data source storage of real-time data warehouses is different. Based on Kafka, detailed data or summary data will be stored in Kafka, and dimension information will be considered from HIVE T+1.
  • Compared to offline data warehouses, there is no complex business logic, and more emphasis is placed on the efficiency of data flow. The data warehouse has fewer layers.
  • Data flow and calculation between layers are completed through Flink.

Suitable Scenarios#

Operational level: For example, real-time business changes, real-time statistical analysis by topic, and analysis of daily business trends by hour.
Production level: For example, whether the real-time system is reliable, whether the system is stable, and the health status of real-time monitoring systems.
C-end users: For example, search recommendation sorting requires the production of real-time behavior, characteristics, and other feature variables to recommend more reasonable content to users.
Risk control side: Real-time risk identification, anti-fraud, abnormal transactions, etc., are all scenarios that require the application of real-time data.

Architecture Design for OLAP#

image

Data Warehouse Design Example#

Layering#

LevelNameStorageDescription
ODSSource LayerWritten to Kafka/ flink-cdc-table after CDCCollect unstructured data
DWDDetail LayerKafkaFilter, join, and widen (filter, union, join)
DWSSummary LayerKafkaBusiness aggregation (agg win, join dimension)
ADSApplication Layer
DIMDimension Table LayerUpsert-kafkaHive dimension table, detail association

ods:
Business library binlog: Generally from various relational database binlogs.
System logs: Logs are generally saved in the form of files, and Flink and Kafka are used to access them in real-time.
Message queue: Data from Kafka, etc.

dw:
Flink processing model, divided by topic

ads:
Generally directly connected to OLAP analysis, or business layer data calling interface.
This is the top level, generally the result type data, which can be directly used or displayed, and it is also the highest level of data abstraction and analysis.

dim source:
Dimension of detailed data, sourced from Hive.

Case Study#

Food safety business as an example

image

Design#

ods: realtime_ods_source.database_name.table_name
dwd: realtime_dwd.table_name
dim: realtime_dim.dimension_table_name
dws: realtime_dws.business_name
ads: realtime_ads.indicator_name

Topic example:

image

image

Note: Topic names are case-sensitive.

Issues#

  • Lambda architecture leads to inconsistent standards (consider using Flink 14+ batch-stream unified API)
  • Incomplete persistence of ADS, add Druid, ES, etc. for OLAP
  • Reusability and the demand for each layer of the data warehouse link to be queryable (hive catalog permanent table storage)
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.