栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

Datax使用

Datax使用

Datax/Addax

Datax/Addax

1. 介绍2. 框架设计3. 运行要求4. 文档5. 安装步骤

1. 安装jdk2. 安装python2.73. 将addax-4.0.8.tar.gz解压4. 执行测试脚本5. 示例 6.不同数据源之间的迁移7.多表同步和多表定时增量同步8.待解决问题

1. windows上使用失败2. addax使用失败 1. 介绍

Addax(此前叫 DataX)是一款支持主流数据库 (Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server) 的通用数据采集工具。

2. 框架设计

Addax是一个离线数据同步框架,采用framework+plugin架构构建。
将数据源读取和写入抽象成为Reader/Writer插件。

Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给framework。

Writer: Writer为数据写入模块,负责不断向framework取数据,并将数据写入到目的端。

framework:framework用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
Addax framework提供了简单的接口与插件交互,提供简单的插件接入机制,只需要任意加上一种插件,就能无缝对接其他数据源。

3. 运行要求

JDK1.8+Python2.7 +/ python3.7+ 4. 文档

点击跳转官网地址点击跳转文档地址支持数据库:

database/filesystemreadwriteplugin(reader/writer)memo
Cassandracassandrareader/cassandrawriter
ClickHouse-----clickhousereader/clickhousewriter----
DB2rbdmsreader/rdbmswriternot fully tested
DBFdbfreader/dbfwriter----
ElasticSearchelasticsearchreader/elasticsearchwriteroriginally from @Kestrong
Excelexcelreader/excelwriter----
FTPftpreader/ftpwriter----
Hbase 1.x(API)hbase11xreader/hbase11xwriteruse Hbase API
Hbase 1.x(SQL)hbase11xsqlreader/hbase11xsqlwriteruse PhoenixPhoenix
Hbase 2.x(API)xhbase20xreaderuse Hbase API
Hbase 2.x(SQL)hbase20xsqlreader/hbase20xsqlwritervia Phoenix
HDFShdfsreader/hdfswritersupport HDFS 2.0 or later
Hivexhivereader----
HTTPxhttpreadersupport RestFul API
Greenplumpostgresqlreader/greenplumwriter----
InfluxDBinfluxdbreader/influxdbwriteronLY support InfluxDB 1.x
InfluxDB2influxdb2reader/influxdb2writeronLY InfluxDB 2.0 or later
jsonxjsonfilereader----
kudukudureader/kuduwriter----
MongoDBmongodbreader/mongodbwriter----
MySQL/MariaDBmysqlreader/mysqlwriter----
Oracleoraclereader/oraclewriter----
PostgreSQLpostgresqlreader/postgresqlwriter----
Trinordbmsreader/rdbmswritertrino
Redisredisreader/rediswriter----
SQLitesqlitereader/sqlitewriter----
SQL Serversqlserverreader/sqlserverwriter----
TDenginetdenginereader/tdenginewriterTDengine
TDH Inceptor2rdbmsreader/rdbmswriterTranswarp TDH 5.1 or later
TEXTtextfilereader/textfilewriter----
5. 安装步骤

点击跳转参考地址

1. 安装jdk 2. 安装python2.7
sudo apt install python2.7

sudo apt install python-pip

sudo apt install python3-pip

3. 将addax-4.0.8.tar.gz解压
sudo tar -zxvf addax-4.0.8.tar.gz -C /opt/

4. 执行测试脚本
 cd /opt/addax-4.0.8.tar.gz/
 sudo ./bin/addax.sh ./job/job.json

如果没有报错,则会有类似以下输出

  ___      _     _
 / _     | |   | |
/ /_  __| | __| | __ ___  __
|  _  |/ _` |/ _` |/ _`  / /
| | | | (_| | (_| | (_| |>  <
_| |_/__,_|__,_|__,_/_/_

:: Addax version ::    (v4.0.3-SNAPSHOT)

2021-08-23 13:45:17.199 [        main] INFO  VMInfo               - VMInfo# operatingSystem class => com.sun.management.internal.OperatingSystemImpl
2021-08-23 13:45:17.223 [        main] INFO  Engine               -
{
    "content":
        {
            "reader":{
                "parameter":{
                    "column":[
                        {
                            "type":"string",
                            "value":"addax"
                        },
                        {
                            "type":"long",
                            "value":19890604
                        },
                        {
                            "type":"date",
                            "value":"1989-06-04 00:00:00"
                        },
                        {
                            "type":"bool",
                            "value":true
                        }
                    ],
                    "sliceRecordCount":10
                },
                "name":"streamreader"
            },
            "writer":{
                "parameter":{
                    "print":true
                },
                "name":"streamwriter"
            }
        }
}

2021-08-23 13:45:17.238 [        main] INFO  PerfTrace            - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-08-23 13:45:17.239 [        main] INFO  JobContainer         - Addax jobContainer starts job.
2021-08-23 13:45:17.240 [        main] INFO  JobContainer         - Set jobId = 0
2021-08-23 13:45:17.250 [       job-0] INFO  JobContainer         - Addax Reader.Job [streamreader] do prepare work .
2021-08-23 13:45:17.250 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] do prepare work .
2021-08-23 13:45:17.251 [       job-0] INFO  JobContainer         - Job set Channel-Number to 1 channels.
2021-08-23 13:45:17.251 [       job-0] INFO  JobContainer         - Addax Reader.Job [streamreader] splits to [1] tasks.
2021-08-23 13:45:17.252 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] splits to [1] tasks.
2021-08-23 13:45:17.276 [       job-0] INFO  JobContainer         - Scheduler starts [1] taskGroups.
2021-08-23 13:45:17.282 [ taskGroup-0] INFO  TaskGroupContainer   - taskGroupId=[0] start [1] channels for [1] tasks.
2021-08-23 13:45:17.287 [ taskGroup-0] INFO  Channel              - Channel set byte_speed_limit to -1, No bps activated.
2021-08-23 13:45:17.288 [ taskGroup-0] INFO  Channel              - Channel set record_speed_limit to -1, No tps activated.
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
2021-08-23 13:45:20.295 [       job-0] INFO  AbstractScheduler    - Scheduler accomplished all tasks.
2021-08-23 13:45:20.296 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] do post work.
2021-08-23 13:45:20.297 [       job-0] INFO  JobContainer         - Addax Reader.Job [streamreader] do post work.
2021-08-23 13:45:20.302 [       job-0] INFO  JobContainer         - PerfTrace not enable!
2021-08-23 13:45:20.305 [       job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 220 bytes | Speed 73B/s, 3 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.011s | Percentage 100.00%
2021-08-23 13:45:20.307 [       job-0] INFO  JobContainer         -
任务启动时刻                    : 2021-08-23 13:45:17
任务结束时刻                    : 2021-08-23 13:45:20
任务总计耗时                    :                  3s
任务平均流量                    :               73B/s
记录写入速度                    :              3rec/s
读出记录总数                    :                  10
读写失败总数                    :                   0
5. 示例

创建两个数据库各创建一张user表表1中增加几条数据编写json配置

    {
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "root",
                        "password": "123456",
                        "column": [
                            "id",
                            "name"
                        ],
                        "connection": [
                            {
                                "table": [
                                    "user"
                                ],
                                "jdbcUrl": [
                                    "jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8"
                                ]
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "username": "root",
                        "password": "123456",
                        "column": [
                            "id",
                            "name"
                        ],
                        "connection": [
                            {
                                "table": [
                                    "user"
                                ],
                                "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8"
                            }
                        ]
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1,
                "byte": 104857600
            },
            "errorLimit": {
                "record": 10,
                "percentage": 0.05
            }
        }
    }
}

settings: 用来定义本次任务的一些控制参数,比如指定多少线程,最大错误率,最大错误记录条数等,这是可选配置。可配置内容如下:

    {
  "setting": {
    "speed": {
      "byte": -1,
      "record": 100,
      "channel": 1
    },
    "errorLimit": {
      "record": 0,
      "percentage": 0.02
    }
  }
}

reader: 用来配置数据读取所需要的相关信息,这是必填内容writer: 用来配置写入数据所需要的相关信息,这是必填内容transformer: 数据转换规则,如果需要对读取的数据在写入之前做一些变换,可以配置该项,否则可以不配置

name:数据库类型username:数据库用户名password:数据库密码column:列名connection:table:表名
jdbcUrl:sql连接地址

执行结果

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2022-02-23 14:14:05.602 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2022-02-23 14:14:05.610 [main] INFO  Engine - the machine info  =>

        osInfo: Private Build 1.8 25.312-b07
        jvmInfo:        Linux amd64 5.13.0-30-generic
        cpu num:        1

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [Copy, MarkSweepCompact]

        MEMORY_NAME                    | allocation_size                | init_size
        Eden Space                     | 273.06MB                       | 273.06MB
        Code Cache                     | 240.00MB                       | 2.44MB
        Survivor Space                 | 34.13MB                        | 34.13MB
        Compressed Class Space         | 1,024.00MB                     | 0.00MB
        metaspace                      | -0.00MB                        | 0.00MB
        Tenured Gen                    | 682.69MB                       | 682.69MB


2022-02-23 14:14:05.629 [main] INFO  Engine -
{
        "content":[
                {
                        "reader":{
                                "name":"mysqlreader",
                                "parameter":{
                                        "column":[
                                                "id",
                                                "name"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":[
                                                                "jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8"
                                                        ],
                                                        "table":[
                                                                "user"
                                                        ]
                                                }
                                        ],
                                        "password":"******",
                                        "username":"root"
                                }
                        },
                        "writer":{
                                "name":"mysqlwriter",
                                "parameter":{
                                        "column":[
                                                "id",
                                                "name"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":"jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8",
                                                        "table":[
                                                                "user"
                                                        ]
                                                }
                                        ],
                                        "password":"******",
                                        "username":"root"
                                }
                        }
                }
        ],
        "setting":{
                "errorLimit":{
                        "percentage":0.05,
                        "record":10
                },
                "speed":{
                        "byte":104857600,
                        "channel":1
                }
        }
}

2022-02-23 14:14:05.653 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2022-02-23 14:14:05.654 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2022-02-23 14:14:05.655 [main] INFO  JobContainer - DataX jobContainer starts job.
2022-02-23 14:14:05.661 [main] INFO  JobContainer - Set jobId = 0
2022-02-23 14:14:06.006 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2022-02-23 14:14:06.027 [job-0] INFO  OriginalConfPretreatmentUtil - table:[user] has columns:[id,name].
2022-02-23 14:14:06.241 [job-0] INFO  OriginalConfPretreatmentUtil - table:[user] all columns:[
id,name
].
2022-02-23 14:14:06.251 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (id,name) VALUES(?,?)
], which jdbcUrl like:[jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
2022-02-23 14:14:06.251 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2022-02-23 14:14:06.255 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2022-02-23 14:14:06.257 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2022-02-23 14:14:06.257 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2022-02-23 14:14:06.258 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 104857600 bytes.
2022-02-23 14:14:06.265 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2022-02-23 14:14:06.267 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2022-02-23 14:14:06.288 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2022-02-23 14:14:06.298 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2022-02-23 14:14:06.299 [job-0] INFO  JobContainer - Running by standalone Mode.
2022-02-23 14:14:06.318 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2022-02-23 14:14:06.324 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2022-02-23 14:14:06.326 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2022-02-23 14:14:06.346 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2022-02-23 14:14:06.353 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select id,name from user
] jdbcUrl:[jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2022-02-23 14:14:06.380 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select id,name from user
] jdbcUrl:[jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2022-02-23 14:14:06.747 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[409]ms
2022-02-23 14:14:06.748 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2022-02-23 14:14:16.333 [job-0] INFO  StandAloneJobContainerCommunicator - Total 3 records, 6 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.002s | Percentage 100.00%
2022-02-23 14:14:16.334 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2022-02-23 14:14:16.335 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2022-02-23 14:14:16.335 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2022-02-23 14:14:16.336 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2022-02-23 14:14:16.337 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2022-02-23 14:14:16.342 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%


         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s
                 MarkSweepCompact     | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s

2022-02-23 14:14:16.343 [job-0] INFO  JobContainer - PerfTrace not enable!
2022-02-23 14:14:16.344 [job-0] INFO  StandAloneJobContainerCommunicator - Total 3 records, 6 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.002s | Percentage 100.00%
2022-02-23 14:14:16.350 [job-0] INFO  JobContainer -
任务启动时刻                    : 2022-02-23 14:14:05
任务结束时刻                    : 2022-02-23 14:14:16
任务总计耗时                    :                 10s
任务平均流量                    :                0B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   3
读写失败总数                    :                   0

6.不同数据源之间的迁移

配置文件修改:

namejdbcUrl 7.多表同步和多表定时增量同步

点击跳转参考地址

资料中使用shell脚本 + 模板文件实现批量同步实际使用时需要将模板文件中的column属性也通过变量的方式传参 8.待解决问题 1. windows上使用失败 2. addax使用失败

使用addax时连接不上数据库,相同脚本在datax中正常运行

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/747910.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号