最具影响力的数字化技术在线社区

168主编 发表于 2017-5-24 19:06:52

GreenPlum 可读写外部表 实战

作者:黄杉

Greenplum的gpload工具通过可读外部表和并行化文件服务器gpfdist(或gpfdists)来加载数据。gpload处理并行化的基于文件的外部表设置,以及允许我们使用单个YAML文件来配置数据格式,外部表定义,以及gpfdist或gpfdists。


要使用gpload工具有几个前提条件必须满足:
1.    使用gpload的服务器必须安装Python 2.6.2或者以上版本,pygresql工具和pyyaml工具(数据库服务端已经安装了python和需要的python库文件)
2.    必须装gpfdist程序,并把它设置到环境变量PATH中(可以从数据库服务器端的安装目录的子目录bin中拷贝该工具)
3.    确保使用gpload工具的ETL服务器与Greenplum所有服务器的联通性,以及机器名解析正确


gpload通过它的控制文件(YAML格式控制文件)指定装载的细节信息。所以编写控制文件是用好gpload的关键所在。gpload的控制文件采用YAML1.0.0.1文档格式编写,因此它必须得是有效的YAML格式。 查看gpload参数:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
$ gpload -?
gpload -f configuration file

Options:
    -h hostname: host to connect to
    -p port: port to connect to
    -U username: user to connect as
    -d database: database to connect to
    -W: force password authentication
    -q: quiet mode
    -D: do not actually load data
    -v: verbose
    -V: very verbose
    -l logfile: log output to logfile
    --no_auto_trans: do not wrap gpload in transaction
    --gpfdist_timeout timeout: gpfdist timeout value
    --version: print version number and exit
    -?: help

$      



创建序列create SEQUENCE gpload_audit_seq INCREMENT BY 1 MINVALUE 1 NO MAXVALUE START WITH 1;
select * from gpload_audit_seq ;

创建审计表,记录每一次gpload执行的开始和结束事件信息

create table gpload_audit(                                                               
    id bigint,
    state text,
    mode text,
    tablename text,
    updatetime timestamp
) distributed by (id);

创建gpload加载的表CREATE TABLE expenses
(
    name text,
    amount numeric,
    category text,
    des text,
    update_date date
);
建立外部表:create writableexternal table t02 (
id integer,
name varchar(128)
)
location (
'gpfdist://101.254.31.72:8090/gpextdata/t21.txt',
'gpfdist://101.254.31.72:8090/gpextdata/t22.txt'
)
Format 'TEXT' (delimiter as E'|' null as '' escape 'OFF')
创建gpload.yml配置文件$ more gpload.yml
---
VERSION: 1.0.0.1
DATABASE: yueworld_db
USER: mch
HOST: 10.254.2.111
PORT: 5432
GPLOAD:
INPUT:
    - SOURCE:
      LOCAL_HOSTNAME:
          - 10.254.3.72
      PORT: 8090
      FILE:
          - /home/gpadmin/gpdextdata/t01.txt
          - /home/gpadmin/gpdextdata/t02.txt
            
    - COLUMNS:
      - id: int
      - name: text
    - FORMAT: text
    - DELIMITER: ','
    - ERROR_LIMIT: 25
OUTPUT:
    - TABLE: t01
    - MODE: INSERT
SQL:
    - BEFORE: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'start', 'insert', 'zhangyun_schema.expenses', current_timestamp)"
    - AFTER: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'end', 'insert', 'zhangyun_schema.expenses', current_timestamp)"
$
在greenplum db库上赋予账号权限:
yueworld_db=# grant select,insert on gpload_audit to mch;
GRANT
yueworld_db=# grant select,insert,update,delete on t01 to mch;
GRANT
yueworld_db=#
yueworld_db=# grant select,usage,update on gpload_audit_seq to mch;
GRANT
yueworld_db=# grant all on table t02 to mch;
GRANT
yueworld_db=#
去gpload服务器上,执行gpload:$ gpload -f g2.yml
2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30
2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02'
2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30
2017-05-15 17:52:30|INFO|running time: 0.29 seconds
2017-05-15 17:52:30|INFO|rows Inserted          = 4
2017-05-15 17:52:30|INFO|rows Updated         = 0
2017-05-15 17:52:30|INFO|data formatting errors = 0
2017-05-15 17:52:30|INFO|gpload succeeded
$
在greenplum服务器上查询yueworld_db=# select * from t02;
ERROR:it is not possible to read from a WRITABLE external table.
HINT:Create the table as READABLE instead
yueworld_db=#
奇怪的是,当你再执行一次gpload -f y2.yml的时候,会持续往外部表里面写数据,而且是将原有的数据重新写入到t02里面去,而且执行第三次的时候,这个rows Inserted会变成16,如下看执行过程:$ gpload -f g2.yml
2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30
2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02'
2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30
2017-05-15 17:52:30|INFO|running time: 0.29 seconds
2017-05-15 17:52:30|INFO|rows Inserted          = 8
2017-05-15 17:52:30|INFO|rows Updated         = 0
2017-05-15 17:52:30|INFO|data formatting errors = 0
2017-05-15 17:52:30|INFO|gpload succeeded
$




页: [1]
查看完整版本: GreenPlum 可读写外部表 实战