Incremental load etl processes

2/28/2024 0 Comments

Incremental load etl processes

In general, tables of incremental processing still refer to business tables rather than small tables (such as code tables or parameter tables). The Processing of More Than Two Incremental Tables Therefore, it is unnecessary to perform incremental processing and computing in many scenarios in MaxCompute. If the full tables are two million-level tables, it is recommended to test the performance, which may be simpler and more efficient to associate directly. The reason for performing incremental processing here is that the table is really large. This avoids the redistribution of large tables, which improves operational efficiency and reduces resource consumption. This logic takes advantage of the feature of small incremental tables and the use of MAPJOIN to quickly produce two associative set unions and then associate them. The simplest MERGE logic is listed below: INSERT OVERWRITE TABLE table_old PARTITION(ds='$')įrom ta_add2 t1 join tb_add2 t2 on t1.pk=t2.pk Therefore, MERGE logic is the simplest and most classic incremental processing logic. Only the INSERT and DELETE states of the primary table or the INNER JOIN table can be passed to the next layer, and the incremental states of other tables are UPDATE.Īfter incremental data is integrated into the MaxCompute platform, you need to perform a MERGE operation to generate full data at the ODS layer.As long as the associated row of a table is incremental, the incremental identifier of the table is used. If multiple tables are associated, incremental identifiers of multiple tables need to be taken.The result table generated by incremental processing needs to record the incremental state and data update time.Two incremental tables cannot be directly associated, and at least one of the tables is a full data table.Incremental table (incremental state and data update time).It is unreasonable or incorrect to break through or not abide by these principles. We need to establish some principles of using incremental processing. Scenario 2: The incremental processing logic is simple and has advantages over full processing in performance. Scenario 1: The resources required for full processing cannot meet the requirement for timeliness, and performance needs to be optimized. However, incremental processing in MaxCompute is summarized into two scenarios: Therefore, the entire data processing can be completed more quickly and economically. Then, we expect to use this small set of incremental data to complete the data processing instead of using full data. Compared with full data, incremental data is a smaller set. The premise of incremental processing is that we have obtained incremental data. After all, it is not easy to do incremental processing. Should we do it? To sum up, we can do incremental processing for those parts that can be done but do not force to do it or do it on a large scale. If we still need to use features (such as deleting and updating) supported by relational databases, we can refer to the new feature Transactional Table recently launched by MaxCompute public cloud. If the submission is still as frequent as in the relational database, incremental processing fails to gain the advantages of performance and resource. Since there is no index structure, we have to use full data retrieval for each processing.

How do we do incremental processing in MaxCompute? Honestly, it is not easy. We also do not have the index structure that accelerates searching 50 rows of data from 10 million rows. The data block has been 64 MB, rather than 4 KB or 8 KB. These operations are not easy in the current distributed file systems, such as MaxCompute. The data storage of the database system is on each 4 KB or 8 KB data block, and the detailed statistical information and index structure help us do incremental data processing efficiently. The database system supports transactions, and characteristics of atomicity, consistency, isolation, and durability (ACID) can support updating, deleting, and inserting operations on a data table at the same time. At that time, incremental processing was the most common method even the MPP database platform did not do all full processing. More than ten years ago, incremental processing was not a method that was worth mentioning because the storage and computing performance of relational databases was very limited.

0 Comments

YOUR CART

Incremental load etl processes

Leave a Reply.

Author

Archives

Categories