GreenPlum 可读写外部表 实战
作者:黄杉Greenplum的gpload工具通过可读外部表和并行化文件服务器gpfdist(或gpfdists)来加载数据。gpload处理并行化的基于文件的外部表设置,以及允许我们使用单个YAML文件来配置数据格式,外部表定义,以及gpfdist或gpfdists。
要使用gpload工具有几个前提条件必须满足:
1. 使用gpload的服务器必须安装Python 2.6.2或者以上版本,pygresql工具和pyyaml工具(数据库服务端已经安装了python和需要的python库文件)
2. 必须装gpfdist程序,并把它设置到环境变量PATH中(可以从数据库服务器端的安装目录的子目录bin中拷贝该工具)
3. 确保使用gpload工具的ETL服务器与Greenplum所有服务器的联通性,以及机器名解析正确
gpload通过它的控制文件(YAML格式控制文件)指定装载的细节信息。所以编写控制文件是用好gpload的关键所在。gpload的控制文件采用YAML1.0.0.1文档格式编写,因此它必须得是有效的YAML格式。 查看gpload参数:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
$ gpload -?
gpload -f configuration file
Options:
-h hostname: host to connect to
-p port: port to connect to
-U username: user to connect as
-d database: database to connect to
-W: force password authentication
-q: quiet mode
-D: do not actually load data
-v: verbose
-V: very verbose
-l logfile: log output to logfile
--no_auto_trans: do not wrap gpload in transaction
--gpfdist_timeout timeout: gpfdist timeout value
--version: print version number and exit
-?: help
$
创建序列create SEQUENCE gpload_audit_seq INCREMENT BY 1 MINVALUE 1 NO MAXVALUE START WITH 1;
select * from gpload_audit_seq ;
创建审计表,记录每一次gpload执行的开始和结束事件信息
create table gpload_audit(
id bigint,
state text,
mode text,
tablename text,
updatetime timestamp
) distributed by (id);
创建gpload加载的表CREATE TABLE expenses
(
name text,
amount numeric,
category text,
des text,
update_date date
);
建立外部表:create writableexternal table t02 (
id integer,
name varchar(128)
)
location (
'gpfdist://101.254.31.72:8090/gpextdata/t21.txt',
'gpfdist://101.254.31.72:8090/gpextdata/t22.txt'
)
Format 'TEXT' (delimiter as E'|' null as '' escape 'OFF')
创建gpload.yml配置文件$ more gpload.yml
---
VERSION: 1.0.0.1
DATABASE: yueworld_db
USER: mch
HOST: 10.254.2.111
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- 10.254.3.72
PORT: 8090
FILE:
- /home/gpadmin/gpdextdata/t01.txt
- /home/gpadmin/gpdextdata/t02.txt
- COLUMNS:
- id: int
- name: text
- FORMAT: text
- DELIMITER: ','
- ERROR_LIMIT: 25
OUTPUT:
- TABLE: t01
- MODE: INSERT
SQL:
- BEFORE: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'start', 'insert', 'zhangyun_schema.expenses', current_timestamp)"
- AFTER: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'end', 'insert', 'zhangyun_schema.expenses', current_timestamp)"
$
在greenplum db库上赋予账号权限:
yueworld_db=# grant select,insert on gpload_audit to mch;
GRANT
yueworld_db=# grant select,insert,update,delete on t01 to mch;
GRANT
yueworld_db=#
yueworld_db=# grant select,usage,update on gpload_audit_seq to mch;
GRANT
yueworld_db=# grant all on table t02 to mch;
GRANT
yueworld_db=#
去gpload服务器上,执行gpload:$ gpload -f g2.yml
2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30
2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02'
2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30
2017-05-15 17:52:30|INFO|running time: 0.29 seconds
2017-05-15 17:52:30|INFO|rows Inserted = 4
2017-05-15 17:52:30|INFO|rows Updated = 0
2017-05-15 17:52:30|INFO|data formatting errors = 0
2017-05-15 17:52:30|INFO|gpload succeeded
$
在greenplum服务器上查询yueworld_db=# select * from t02;
ERROR:it is not possible to read from a WRITABLE external table.
HINT:Create the table as READABLE instead
yueworld_db=#
奇怪的是,当你再执行一次gpload -f y2.yml的时候,会持续往外部表里面写数据,而且是将原有的数据重新写入到t02里面去,而且执行第三次的时候,这个rows Inserted会变成16,如下看执行过程:$ gpload -f g2.yml
2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30
2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02'
2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30
2017-05-15 17:52:30|INFO|running time: 0.29 seconds
2017-05-15 17:52:30|INFO|rows Inserted = 8
2017-05-15 17:52:30|INFO|rows Updated = 0
2017-05-15 17:52:30|INFO|data formatting errors = 0
2017-05-15 17:52:30|INFO|gpload succeeded
$
页:
[1]