As the first practitioner of the company's real-time data warehouse, I record my own definition of some specifications and understanding of the real-time data warehouse.
Overview of Streaming Data Warehouse#
Purpose of Construction#
Clear data structure: Each data layer has its own role and responsibility, making it more convenient and understandable to use and maintain.
Simplify complex problems: Break down a complex task into multiple steps to be completed step by step, with each layer solving specific problems.
Unified data standard: By data layering, provide a unified data output and output standard.
Reduce duplicate development: Standardize data layering, enable data reuse, develop common intermediate layers, and greatly reduce duplicate calculation work.
Design Concept#
Real-time streaming processing chain (Flink + Kafka) divides data into ODS, DWD, DWS layers and writes them into the online service layer in real-time to provide online services (real-time OLAP).
- Compared to offline data warehouses, the data source storage of real-time data warehouses is different. Based on Kafka, detailed data or summary data will be stored in Kafka, and dimension information will be considered from HIVE T+1.
- Compared to offline data warehouses, there is no complex business logic, and more emphasis is placed on the efficiency of data flow. The data warehouse has fewer layers.
- Data flow and calculation between layers are completed through Flink.
Suitable Scenarios#
Operational level: For example, real-time business changes, real-time statistical analysis by topic, and analysis of daily business trends by hour.
Production level: For example, whether the real-time system is reliable, whether the system is stable, and the health status of real-time monitoring systems.
C-end users: For example, search recommendation sorting requires the production of real-time behavior, characteristics, and other feature variables to recommend more reasonable content to users.
Risk control side: Real-time risk identification, anti-fraud, abnormal transactions, etc., are all scenarios that require the application of real-time data.
Architecture Design for OLAP#
Data Warehouse Design Example#
Layering#
Level | Name | Storage | Description |
---|---|---|---|
ODS | Source Layer | Written to Kafka/ flink-cdc-table after CDC | Collect unstructured data |
DWD | Detail Layer | Kafka | Filter, join, and widen (filter, union, join) |
DWS | Summary Layer | Kafka | Business aggregation (agg win, join dimension) |
ADS | Application Layer | ||
DIM | Dimension Table Layer | Upsert-kafka | Hive dimension table, detail association |
ods:
Business library binlog: Generally from various relational database binlogs.
System logs: Logs are generally saved in the form of files, and Flink and Kafka are used to access them in real-time.
Message queue: Data from Kafka, etc.
dw:
Flink processing model, divided by topic
ads:
Generally directly connected to OLAP analysis, or business layer data calling interface.
This is the top level, generally the result type data, which can be directly used or displayed, and it is also the highest level of data abstraction and analysis.
dim source:
Dimension of detailed data, sourced from Hive.
Case Study#
Food safety business as an example
Design#
ods: realtime_ods_source.database_name.table_name
dwd: realtime_dwd.table_name
dim: realtime_dim.dimension_table_name
dws: realtime_dws.business_name
ads: realtime_ads.indicator_name
Topic example:
Note: Topic names are case-sensitive.
Issues#
- Lambda architecture leads to inconsistent standards (consider using Flink 14+ batch-stream unified API)
- Incomplete persistence of ADS, add Druid, ES, etc. for OLAP
- Reusability and the demand for each layer of the data warehouse link to be queryable (hive catalog permanent table storage)