最具影响力的数字化技术在线社区

168大数据

 找回密码
 立即注册

QQ登录

只需一步,快速开始

1 2 3 4 5
打印 上一主题 下一主题
开启左侧

[理论框架] 数据治理 主数据 元数据_我们对数据治理的误解

[复制链接]
跳转到指定楼层
楼主
发表于 2020-12-3 19:52:42 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式

马上注册,结交更多数据大咖,获取更多知识干货,轻松玩转大数据

您需要 登录 才可以下载或查看,没有帐号?立即注册

x

数据治理 主数据 元数据

Data governance is top of mind for many of my customers, particularly in light of GDPR, CCPA, COVID-19, and any number of other acronyms that speak to the increasing importance of data management when it comes to protecting user data.

数据治理是我许多客户的首要考虑因素,尤其是考虑到GDPR,CCPA,COVID-19以及任何其他首字母缩写词,这些首字母缩写词表明了数据管理在保护用户数据方面的重要性日益提高。

Over the past several years, data catalogs have emerged as a powerful tool for data governance, and I couldn’t be happier. As companies digitize and their data operations democratize, it’s important for all elements of the data stack, from warehouses to business intelligence platforms, and now, catalogs, to participate in compliance best practices.

在过去的几年中, 数据目录已成为一种强大的数据治理工具 ,我对此感到高兴。 随着公司数字化及其数据运营的民主化,从仓库到商业智能平台,再到现在的目录,数据堆栈的所有元素都必须参与合规性最佳实践。

But are data catalogs all we need to build a robust data governance program?

但是,构建强大的数据治理程序所需的所有数据目录都是吗?

数据目录用于数据治理? (Data catalogs for data governance?)

Analogous to a physical library catalog, data catalogs serve as an inventory of metadata and give investors the information necessary to evaluate data accessibility, health, and location. Companies like Alation, Collibra, and Informatica tout solutions that not only keep tabs on your data, but also integrate with machine learning and automation to make data more discoverable, collaborative, and now, in compliance with organizational, industry-wide, or even government regulations.

类似于物理图书馆目录, 数据目录用作元数据清单,并向投资者提供评估数据可访问性,健康状况和位置所需的信息。 像Alation,Collibra和Informatica这样的公司都在宣传解决方案,这些解决方案不仅可以保留数据标签,还可以与机器学习和自动化集成,从而使数据更易于发现,协作,并且现在符合组织,整个行业甚至政府的要求。规定。

Since data catalogs provide a single source of truth about a company’s data sources, it’s very easy to leverage data catalogs to manage the data in your pipelines. Data catalogs can be used to store metadata that gives stakeholders a better understanding of a specific source’s lineage, thereby instilling greater trust in the data itself. Additionally, data catalogs make it easy to keep track of where personally identifiable information (PII) can both be housed and sprawl downstream, as well as who in the organization has the permission to access it across the pipeline.

由于数据目录提供有关公司数据源的唯一事实来源,因此利用数据目录来管理管道中的数据非常容易。 数据目录可用于存储元数据,从而使利益相关者更好地了解特定来源的血统,从而在数据本身上建立起更大的信任。 此外,数据目录使跟踪个人身份信息(PII)可以存放和向下游蔓延的位置以及组织中的谁有权通过管道访问变得容易。

什么适合我的组织? (What’s right for my organization?)

So, what type of data catalog makes the most sense for your organization? To make your life a little easier, I spoke with data teams in the field to learn about their data catalog solutions, breaking them down into three distinct categories: in-house, third-party, and open source.

那么,哪种类型的数据目录最适合您的组织? 为了使您的生活更轻松,我与该领域的数据团队进行了交谈,以了解他们的数据目录解决方案,并将它们分为三个不同的类别:内部,第三方和开源。

内部的 (In-house)

Some B2C companies — I’m talking the Airbnbs, Netflixs, and Ubers of the world — build their own data catalogs to ensure data compliance with state, country, and even economic union (I’m looking at you GDPR) level regulations. The biggest perk of in-house solutions is the ability to quickly spin up customizable dashboards, pulling out fields your team needs the most.

一些B2C公司(我正在谈论全球的AirbnbsNetflixUber)建立自己的数据目录,以确保数据符合州,国家或经济联盟(我在看您的GDPR)级法规。 内部解决方案最大的好处是能够快速启动可定制的仪表板,从而拉出团队最需要的领域。






Uber’s Databook lets data scientists easily search for tables. Uber的数据手册可让数据科学家轻松搜索表格。 Image courtesy of 图片由 Uber EngineeringUber Engineering提供 .

While in-house tools make for quick customization, over time, such hacks can lead to a lack of visibility and collaboration, particularly when it comes to understanding data lineage. In fact, one data leader I spoke with at a food delivery startup noted that what was clearly missing from her in-house data catalog was a “single pane of glass.” If she had a single source of truth that could provide insight into how her team’s tables were being leveraged by other parts of the business, ensuring compliance would be easy.

尽管内部工具可以快速进行自定义,但随着时间的流逝,此类黑客行为可能导致缺乏可见性和协作性,尤其是在了解数据沿袭时。 实际上,我在一家食品配送初创公司与之交谈的一位数据负责人指出,她内部数据目录中显然缺少的是“一块玻璃”。 如果她有一个真实的来源,可以洞察业务的其他部门如何利用她的团队的表,那么确保合规将很容易。

On top of these tactical considerations, spending engineering time and resources building a multi-million dollar data catalog just doesn’t make sense for the vast majority of companies.

除了这些战术上的考虑之外,花费大量的工程时间和资源来建立数百万美元的数据目录对于绝大多数公司来说都是没有意义的。

第三方 (Third-party)

Since their founding in 2012, Alation has largely paved the way for the rise of the automated data catalog. Now, there are a whole host of ML-powered data catalogs on the market, including Collibra, Informatica, and others, many with pay-for-play workflow and repository-oriented compliance management integrations. Some cloud providers, like Google, AWS, and Azure, also offer data governance tooling integration at an additional cost.

自2012年成立以来, Alation在很大程度上为自动化数据目录的兴起铺平了道路。 现在,市场上有大量基于ML的数据目录,包括CollibraInformatica等,其中许多具有按需付费工作流程和面向存储库的合规性管理集成。 一些云提供商,例如Google,AWS和Azure,还提供了额外的数据治理工具集成。

In my conversations with data leaders, one downside of these solutions came up time and again: usability. While nearly all of these tools have strong collaboration features, one Data Engineering VP I spoke with specifically called out his third-party catalog’s unintuitive UI.

在与数据负责人的对话中,这些解决方案的一个缺点一次又一次出现:可用性。 尽管几乎所有这些工具都具有强大的协作功能,但与我交谈的一位数据工程副总裁特别提到了他的第三方目录的直观用户界面。

If data tools aren’t easy to use, how can we expect users to understand or even care whether they’re compliant?

如果数据工具不容易使用,我们如何期望用户理解甚至关心他们是否合规?

开源的 (Open source)

In 2017, Lyft became an industry leader by open sourcing their data discovery and metadata engine, Amundsen, named after the famed Antarctic explorer. Other open source tools, such as Apache Atlas, Magda and CKAN, provide similar functionalities, and all three make it easy for development-savvy teams to fork an instance of the software and get started.

2017年,Lyft通过开源其数据发现和元数据引擎Amundsen成为行业领导者, Amundsen以著名的南极探险家的名字命名。 其他开放源代码工具(例如Apache AtlasMagdaCKAN )提供了相似的功能,而这三者使精通开发的团队可以轻松地派生该软件的实例并开始使用。






Amundsen, an open source data catalog, gives users insight into schema usage. Amundsen是一个开源数据目录,可让用户深入了解架构的使用。 Image courtesy of 图片由 Mikhail IvanovMikhail Ivanov提供 .

While some of these tools allow teams to tag metadata within to control user access, this is an intensive and often manual process that most teams just don’t have the time to tackle. In fact, a product manager at a leading transportation company shared that his team specifically chose not to use an open source data catalog because they didn’t have off-the-shelf support for all the data sources and data management tooling in their stack, making data governance extra challenging. In short, open source solutions just weren’t comprehensive enough.

尽管其中一些工具允许团队在其中标记元数据来控制用户访问,但这是一个密集且通常是手动的过程,大多数团队只是没有时间解决。 实际上,一家领先的运输公司的产品经理分享说,他的团队特别选择不使用开源数据目录,因为他们没有对堆栈中所有数据源和数据管理工具的现成支持,使数据治理更具挑战性。 简而言之,开源解决方案还不够全面。

Still, there’s something critical to compliance that even the most advanced catalog can’t account for: data downtime.

尽管如此,即使对于最高级的目录,也无法解决合规性方面的关键问题: 数据停机

缺少的链接:数据停机 (The missing link: data downtime)

Recently, I developed a simple metric for a customer that helps measure data downtime, in other words, periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. When applied to data governance, data downtime gives you a holistic picture of your organization’s data reliability. Without data reliability to power full discoverability, it’s impossible to know whether or not your data is fully compliant and usable.

最近,我为客户开发了一个简单的指标 ,该指标可以帮助您衡量数据停机时间 ,换句话说,就是您的数据不完整,错误,丢失或不准确时的时间段。 当应用于数据治理时,数据停机时间可以使您全面了解组织的数据可靠性。 没有数据可靠性来增强完全可发现性,就无法知道您的数据是否完全合规和可用。

Data catalogs solve some, but not all, of your data governance problems. To start, mitigating governance gaps is a monumental undertaking, and it’s impossible to prioritize these without a full understanding of which data assets are actually being accessed by your company. Data reliability fills this gap and allows you to unlock your data ecosystem’s full potential.

数据目录解决了部分但不是全部的数据治理问题。 首先,减轻治理差距是一项艰巨的任务,如果无法完全了解贵公司实际上正在访问哪些数据资产,就不可能对这些差距进行优先排序。 数据可靠性填补了这一空白,并允许您释放数据生态系统的全部潜力。

Additionally, without real-time lineage, it’s impossible to know how PII or other regulated data sprawls. Think about it for a second: even if you’re using the fanciest data catalog on the market, your governance is only as good as your knowledge about where that data goes. If your pipelines aren’t reliable, neither is your data catalog.

此外,如果没有实时沿袭,就不可能知道PII或其他受监管的数据是如何蔓延的。 仔细考虑一下:即使您使用的是市场上最高级的数据目录,您的治理也仅取决于您对数据去向的了解。 如果管道不可靠,那么数据目录也不可靠。

Owing to their complementary features, data catalogs and data reliability solutions work hand-in-hand to provide an engineering approach to data governance, no matter the acronyms you need to meet.

由于具有互补功能,因此数据目录和数据可靠性解决方案可以协同工作,从而为数据治理提供一种工程方法,无论您需要使用首字母缩写词如何。

Personally, I’m excited for what the next wave of data catalogs have in store. And trust me: it’s more than just data.

就个人而言,我对下一波数据目录的存储感到兴奋。 相信我:这不仅仅是数据。

If you want to learn more, reach out to Barr Moses.

如果您想了解更多信息,请联系 Barr Moses




翻译自: https://towardsdatascience.com/what-we-got-wrong-about-data-governance-365555993048


楼主热帖
分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 转播转播 分享分享 分享淘帖 赞 踩

168大数据 - 论坛版权1.本主题所有言论和图片纯属网友个人见解,与本站立场无关
2.本站所有主题由网友自行投稿发布。若为首发或独家,该帖子作者与168大数据享有帖子相关版权。
3.其他单位或个人使用、转载或引用本文时必须同时征得该帖子作者和168大数据的同意,并添加本文出处。
4.本站所收集的部分公开资料来源于网络,转载目的在于传递价值及用于交流学习,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。
5.任何通过此网页连接而得到的资讯、产品及服务,本站概不负责,亦不负任何法律责任。
6.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源,若标注有误或遗漏而侵犯到任何版权问题,请尽快告知,本站将及时删除。
7.168大数据管理员和版主有权不事先通知发贴者而删除本文。

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

关于我们|小黑屋|Archiver|168大数据 ( 京ICP备14035423号|申请友情链接

GMT+8, 2024-5-8 05:35

Powered by BI168大数据社区

© 2012-2014 168大数据

快速回复 返回顶部 返回列表