阿里云大数据项目实战开发

发表于 2019-3-21 17:32:28

马上注册，结交更多数据大咖，获取更多知识干货，轻松玩转大数据

您需要登录才可以下载或查看，没有帐号？立即注册

x

本帖最后由 168主编于 2019-3-21 17:34 编辑

任务目标：从阿里云数据库中读取表1：hdp6_result 和表2：hdp6_locationresult
对取出的数据进行简单处理后，在数据库中新建表3：hdp5_resultsave
并将数据以分区表的形式存入表3当中。
1.数据库读取流程步骤1：链接Spark master

[AppleScript] 纯文本查看 复制代码

SparkConf conf = new SparkConf()
                    .setAppName("ResultData");
                    .setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);

步骤2：初始化OdpsContext，并进行数据库的select

[AppleScript] 纯文本查看 复制代码

OdpsContext sqlContext = new OdpsContext(ctx);
String sql = "select * from hdp6_result limit 100";
DataFrame df = sqlContext.sql(sql);
df.show(100);
df.printSchema();

2.多表查询，返回一个dataframe方法一：SQL语句多表查询，返回一个dataframe

[AppleScript] 纯文本查看 复制代码

String sql = "select result_value,lower_tolerance,upper_tolerance,"
                    + "time_stamp,type_number from hdp6_result "
                    + "INNER join hdp6_locationresult ON "
                    + "hdp6_result.location_result_uid = hdp6_locationresult.location_result_uid";
            DataFrame df = sqlContext.sql(sql);
            df.show();

重要事项！！！：经过验证，SQL语句多表查询的方式，并不适用于阿里云ODPS，可能是由于多表查询不利于官方进行计费。
方法二：每张表单独返回一个dataframe，将dataframe进行组合，返回一个新的dataframe

[AppleScript] 纯文本查看 复制代码

String sql1 = "select * from hdp6_result limit 100";
DataFrame df1 = sqlContext.sql(sql1);
df1.show(100);
df1.printSchema();

String sql2 = "select * from hdp6_locationresult limit 100";
DataFrame df2 = sqlContext.sql(sql2);
df2.show(100);
df2.printSchema();

DataFrame df0 = df2.join(df1, df1.col("location_result_uid").equalTo(df2.col("location_result_uid")));;
System.out.println("----3----");
df0.show(200);
df0.printSchema();

3.对数据进行简单处理并生成新的dataframe

[AppleScript] 纯文本查看 复制代码

      //---------------------------------------表格location_result列数据以getString形式取出
        List<String> resultlist = df1.select("location_result").distinct().javaRDD().map(new Function<Row, String>() {
            public String call(Row row) {
                return String.valueOf(row.getString(0));
            }
        }).distinct().collect();
        //---------------------------------------开始循环处理数据
        for (String locationresult : resultlist) {
            df2 = df1.filter("location_result = '" + locationresult + "' ");//特定location的内容
            df2.cache();//df2持久化
            DataFrame valueByType0 = df2.groupBy("tolerance", "type_number")
                    .agg(avg("result_value"),count("result_value"),stddev("result_value"))
                    .filter(("count(result_value) >= " + PART_COUNT_LOW_LIMIT))
                    .orderBy("tolerance", "type_number");//拆出公差，型号，平均和数量,方差
            List<Row> valueByType = valueByType0.javaRDD().map(new Function<Row, Row>() {
                public Row call(Row row) {
                    row.getString(0);//tolerance
                    row.getString(1);//type
                    row.getDouble(2);//avg
                    row.getLong(3);//count
                    row.getDouble(4);//std
                    return row;
                }
            }).collect();
            //-----------------------------------数组，String tolerance 0，String type 1，Double avg 2，Long count 3，Double std 4
            for (Row row : valueByType) {
                if (processed.contains(row.getString(0))) {
                    msg = "processed tolerance" + row.getString(0);
                    System.out.println(msg);
                } else if ((valueByType0.filter("tolerance = " + row.getString(0))).count() == 1) {
                    msg = "only one line" + row.getString(0);
                    System.out.println(msg);
                } else {
                    msg = "---------DATA Hangding--------";
                    System.out.println(msg);
                    processed.add(row.getString(0));//将该公差加入已处理列表
                    List<Row> valueByTypeOneTol = valueByType0.filter("tolerance = '" + row.getString(0)+"'").collectAsList();
                    double sum = 0;
                    long count = 0;
                    double stdsum = 0;
                    for (Row row1 : valueByTypeOneTol) //计算总平均
                    {
                        sum += row1.getDouble(2) * row1.getLong(3);
                        count += row1.getLong(3);
                        stdsum += Math.pow(row1.getDouble(4), 2);
                    }
                    msg = "---------Print result of avg&std--------";
                    System.out.println(msg);
                    double avg = sum / count;
                    double std = Math.pow((stdsum / valueByTypeOneTol.size()), 0.5);
                    System.out.println("avg = " + avg + ", std = " + std);

                    msg = "---------Result Judgement--------";
                    System.out.println(msg);
                    for (Row row2 : valueByTypeOneTol) {
                        System.out.println("delta avg  = " + (avg - row2.getDouble(2)));
                        String tol[] = row2.getString(0).split("/");
                        double tolSpan = Math.abs(Double.valueOf(tol[1]) - Double.valueOf(tol[0]));
                        System.out.println("tolSpan = " + tolSpan);
                        if (tolSpan <= 0) //判断公差是否为0
                        {
                            msg = "no tolerance or tolerance equals to 0";
                            System.out.println(msg);
                        } else {
                            if ((avg - row2.getDouble(2)) >= tolSpan * AVG_TOL)//均值判断 平均值 - 单个值 大为坏
                            {
                                msg = "typeno = " + row2.getString(1) + "avg need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "avg is less than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                            System.out.println("delta std  = " + (row.getDouble(4) - std));
                            if ((row.getDouble(4) - std) >= tolSpan * STD_TOL)//方差判断 单个值 - 平均值  大为坏
                            {
                                msg = "typeno = " + row.getString(1) + "std need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "std is greater than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                        }
                    }
                }
            }

4.新建表格用于存储处理后的数据

[AppleScript] 纯文本查看 复制代码

//----------------------------------------------新建一个表格
sqlContext.sql("Create table if not exists scxtest001 (uid varchar(500), time_stamp String) partitioned by (pt String)");

5.存储处理后的dataframe方法一：以临时表的形式存入新建表格注意，目前阿里云只有spark 2.3.0版本才支持建立临时视图
方法二：INSERTINTO和SAVEASTABLE的形式存入新建表格关于上述两个方法，请关注我个人的git lab 项目：
https://gitlab.com/uaes-tef3/hdev6online

作者：旺达哒哒哒
链接：https://www.jianshu.com/p/f69cca15b521
来源：简书

帐号		自动登录	找回密码
密码			立即注册

阿里云大数据项目实战开发

马上注册，结交更多数据大咖，获取更多知识干货，轻松玩转大数据

站长推荐 /1