




1 2 3 4 5
打印 上一主题 下一主题

大数据技术如何才能发挥最佳状态?When big data is truly better

发表于 2014-9-13 22:19:40 | 只看该作者 回帖奖励 |正序浏览 |阅读模式


您需要 登录 才可以下载或查看,没有帐号?立即注册




  • 深信不疑: 这是一种根深蒂固的观念,有些人认为无论实际情况如何,更庞大的规模、更迅捷的速度以及/或者更多样的数据类型总是能够带来更具实践价值的分析结论,而这也正是他们眼中大数据分析的核心价值所在。如果在实际操作中找到理想的结论,那么根据他们的思维方式,这仅仅是由于具体处理者不够努力、不够聪明或者没有使用正确的工具及方法。
  • 盲目迷信: 这种观点认为,大数据的绝对规模本身就是其价值的切实体现,而这与我们是否能够从中获取到实际结论并无关系。根据这种思维方式,如果我们以大数据所支持的特定企业应用程序为出发点对大数据功能进行评估,那么完全不需要像当下分析领域这样迫切需要数据科学家的帮助、而能够任意将数据保存在数据湖当中以支持未来的探索活动。
  • 视为负担: 这种观点认为,数据的庞大规模并不是带来正面或者负面结果的必要条件。不过有一项事实明确而不容否认,即现有数据库在存储与处理能力方面的匮乏根本无力负担大数据的高强度负载,因此需要新的平台加以支撑(例如hadoop)。如果我们不能将发展脚步与数据的迅猛增长保持一致,那么这种观点认为企业的当务之急是将核心业务转移到新型数据库当中。
  • 绝佳机遇: 就我个人而言,这才是看待大数据的正确方式。其核心实质在于随着数据规模的不断扩大、数据流速度的不断提升以及数据来源与格式的持续增长,我们需要以更加快捷而有效的方式所数据中提取出前所未有的分析结论。这种观点不会迷信或者过度依赖大数据,因为我们承认某些结论完全可以通过小规模数据分析方式得出。同时,这种观点也不会将数据规模视为一种负担,而单纯只是需要通过新型数据库平台、工具以及实践方案解决的另一项技术挑战。





在这方面,我最近发现了一篇非常出色的评述文章,其中对一种特殊类型的数据——也就是低密度细化行为数据——进行了深入阐释,指出其能够在规模化条件下显著提高预测性分析的准确率。该文作者Junqué de Fortuny、Martens以及Provost指出,“此类数据集的关键特性在于其低密度:对于任何给定实例,绝大多数特征对于实际价值的贡献为零、或者说‘没有意义’。”










When big data is truly better

Take advantage of scale when past experience indicates greater analytic value will result. But big data is not a hammer — nor is every problem a nail

Many people assume that big data means bigger is always better. People tend to approach the “bigger is better” question from various philosophical perspectives, which I characterize thusly:

Faith: This is the notion that, somehow, greater volumes, velocities, and/or varieties of data will always deliver fresher insights, which amounts to the core value of big data analytics. If we’re unable to find those insights, according to this perspective, it’s only because we’re not trying hard enough, we’re not smart enough, or we’re not using the right tools and approaches.

Fetish: This is the notion that the sheer bigness of data is a value in its own right, regardless of whether we’re deriving any specific insights from it. If we’re evaluating the utility of big data solely on the specific business applications it supports, according to this outlook, we’re not in tune with the modern need of data scientists to store data indiscriminately in data lakes to support future explorations.

Burden: This is the notion that the bigness of data is not necessarily better or worse, but it is simply a fact of life that has the unfortunate consequence of straining the storage and processing capacity of existing databases, thereby necessitating new platforms (such as Hadoop). If we’re not able to keep up with all this burdensome new data, or so this perspective leads us to believe, the core business imperative is to change over to a new type of database.

Opportunity: This is, in my opinion, the right approach to big data. It’s focused on extracting unprecedented insights more effectively and efficiently as the data scales to new heights, streams in faster, and originates in an ever-growing range of sources and formats. It doesn’t treat big data as a faith or fetish, because it acknowledges that many differentiated insights can continue to be discovered at lower scales. It doesn’t treat data’s scale as a burden, either, but as simply a challenge to be addressed effectively through new database platforms, tooling, and practices.

Last year, I blogged on the hardcore use cases for big data in a discussion that was exclusively on the “opportunity” side of the equation. Later in the year, I observed that big data’s core “bigness” value derives from the ability of incremental content to reveal incremental context. More context is better than less when what you’re doing is analyzing data in order to ascertain its full significance. Likewise, more content is better than less when you’re trying to identify all of the variables, relationships, and patterns in your problem domain to a finer degree of granularity. The bottom line: More context plus more content usually equals more data.

Big data’s value is also in its ability to correct errors that are more likely to crop up at smaller scale. In that same post, I cited a third party who observed that, for a data scientist, having less data in their training set means they’re susceptible to several modeling risks. For starters, at smaller scales you’re more likely to overlook key predictive variables. You are also more likely to skew the model to nonrepresentative samples. In addition, you’re more likely to find spurious correlations that would disappear if you had a more complete data sets revealing the underlying relationships at work.

Scale can be beautiful

Everybody recognizes that some types of data and some use cases are more conducive than others to realizing fresh insights at scale.

In that vein, I recently came across a great article that spells out one specific category of data — sparse, fine-grained behavioral data — on which predictive performance often improves with scale. The authors, Junqué de Fortuny, Martens, and Provost, state that “a key aspect of such datasets is that they are sparse: For any given instance, the vast majority of the features have a value of zero or ‘not present.'”

What’s most noteworthy about this (and the authors support their discussion by citing ample research) is that this type of data is at the heart of many big data applications with a customer-analytics focus. Social media behavioral data fits this description, as do Web browsing behavioral data, mobile behavioral data, advertising response behavioral data, natural language behavioral data, and so on.

“Indeed,” the authors state, “for many of the most common business applications of predictive analytics, such as targeted marketing in banking and telecommunications, credit scoring, and attrition management, the data used for predictive analytics are very similar … [T]he features tend to be demographic, geographic, and psychographic characteristics of individuals, as well as statistics summarizing particular behaviors, such as their prior purchase behavior with the firm.”

The core reason why bigger behavioral data sets are usually better is simple, the authors state: “Certain telling behaviors may not be observed in sufficient numbers without massive data.” That’s because, in a sparse data set, no individual person whose behavior is being recorded is likely to exhibit more than a limited range of behaviors. But when you look across an entire population, you’re likely to observe every specific type of behavior being expressed at least once and perhaps numerous times within specific niches. At smaller data scales, looking at fewer subjects and observing fewer behavioral features, you’re likely to overlook much of this richness.

Predictive models thrive on the richness of the source behavioral data sets, in order to drive more accurate predictions across a wider range of future scenarios. Hence, bigger usually is better.

When bigger equals fuzzier

Nonetheless, the authors also note scenarios where this assumption falls apart, and it all has to do with the predictive value of specific behavioral features. Essentially, a trade-off underlies predictive behavioral modeling.

Each incremental new behavioral feature added to a predictive model should be sufficiently relevant to the prediction made so that it boosts the learning yield and predictive power of the model enough to overcome the ever-wider variances — hence over-fitting and predictive error — that tends to come with ever larger feature sets. As the authors state: “The large number of irrelevant features simply increases variance and the opportunity to over-fit, without the balancing opportunity of learning better models (presuming that one can actually select the right subset).”

Clearly, bigger isn’t better when bigness gets in the way of deriving predictive insights. You don’t want your big data analytics effort to be a victim of its own scale. Your data scientists have to be smart enough to know when to scale back their models to the hardcore of features best suited to the analytic task at hand.

原文链接:http://www.infoworld.com/d/big-data/when-big-data-truly-better-249737 译者&via:51CTO

分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 转播转播 分享分享 分享淘帖 赞 踩

168大数据 - 论坛版权1.本主题所有言论和图片纯属网友个人见解,与本站立场无关

您需要登录后才可以回帖 登录 | 立即注册



站长推荐上一条 /1 下一条

关于我们|小黑屋|Archiver|168大数据 ( 京ICP备14035423号|申请友情链接

GMT+8, 2024-6-12 01:00

Powered by BI168大数据社区

© 2012-2014 168大数据

快速回复 返回顶部 返回列表