2014年亚太知识发现和数据挖掘(PAKDD)竞赛报告

发表于 2014-9-9 14:31:35

马上注册，结交更多数据大咖，获取更多知识干货，轻松玩转大数据

您需要登录才可以下载或查看，没有帐号？立即注册

x

在台南举办的PAKDD2014年会议，我全部在场，在这篇文章中，我将报告我所参加的有关会议和会谈的有趣信息。（PAKDD会议是数据挖掘研究领域的重要国际学术会议之一）

Succint数据结构对于数据挖掘的重要性

我还参加了一个很不错的由R.拉曼约主讲的关于用于数据挖掘的succint数据结构教程。我会尽量为如实反映本教程的一些要点（但它只是我的理解）。

succint数据结构的简单定义如下。它是一种数据结构,比naive数据结构使用更少的内存来存储一些信息。

我们为什么要关心？

一个原因是，为了进行快速的数据挖掘，把所有的数据存入存储器是很重要的。但有时候数据不能被完全存储。在大数据的时代，如果我们使用了一些非常紧凑的数据结构，那么我们就可以容纳更多的数据到内存中，也许我们并不需要使用分布式算法来处理大数据。在本教程中提供的一个例子是由CAMMY和赵某使用一台具有紧凑数据结构的计算机在执行相同的任务时击败了分布式映射机器。如果我们能适合所有的数据存储到一台计算机的内存，性能可能会更好，因为在一台计算机上的数据处理速度比分布在多台计算机要快。

第二个原因是，如果一个数据结构更紧凑，那么在一些情况下，计算机在其高速缓存情况下会有更多的内存，因此，对数据的访问，甚至可能更快。因此，使用succint数据结构，数据是更加压缩的，并不是在执行时间总是具有负面影响。

一个有特色的压缩数据结构应该具有哪些特点？

一个重要的特点是,它应该能够压缩信息，并且一个使用数据结构的算法应该能够直接在不减压的情况下发挥作用。
另一个理想的特性是，它应该提供相同的接口作为未压缩的数据结构。换言之，对于一个算法，我们应当能够通过压缩数据结构，以取代数据结构，而不必修改算法。
一种经压缩的数据结构通常是由数据和索引，以便快速访问数据。该索引应该比数据小。
有时，数据结构与查询时间之间的折中是多于的。减少冗余可以提高查询的时间。

评估有多少位是必要的编码信息，存在多种措施：信息论，熵……如果我们设计一个succint数据结构，而我们使用更多的内存比使用这些措施是必要的，那么我们就做错了。

本教程也提到了，存在着提供succint数据结构的实现几个库，如Sux4J, SDSL-lite, SDSL .

此外教程也提供了很多succint数据结构例子诸如二叉树实现为一个位向量，多位树，小波树等。

关联规则挖掘的应用

G.韦伯给出了另一个非常有趣的谈话。该谈话是首先将关联规则挖掘和研究关联统计数据领域的方法作比较的。有人解释说：

统计学往往试图找到适合的数据，而关联规则发现多个本地模式的单一模式，并让用户选择最佳的模式（其中排除更好地解释数据）。
对于高维数据，关联规则挖掘是可伸缩的，而且统计学的传统技术不能在大量的变量中得到应用。

那么，为什么关联规则挖掘没有在实际中得到广泛应用？有人认为，其原因是，研究人员在该领域过于注重性能（速度，memor），而不是开发可以找到不同寻常的重要模式的算法。只注重寻找频繁规则，太多的“垃圾”被呈现给用户（频繁的规则，是显而易见的）。它表明，在一些应用中，实际上，频繁规则并不总是重要的，但稀有的却具有高的统计显着性的对用户确实重要的。

那么，什么对用户是重要的呢？这个问题有点主观性。然而，至少有四个原则，可以帮助我们知道什么对用户不重要。

1）如果频率可以通过假设独立性来预测，那么关联并不重要。例如，发现具有前列腺癌的所有人是男性是无趣的关联，因为很明显，只有男性可以得到前列腺癌。

2）冗余关联不应被呈现给用户。如果一个项目X为一组项目的Y的必然结果，则{X} UY的一切和Y是相关联的。我们并不需要所有的这些规则。一般情况下，我们要将复杂的规则简单化（要去掉多余的规则）

3）做一些统计检验来筛选非显著的关联
……
此外，高效率地挖掘关联是理想的，并能够向用户解释为什么在必要时要消除一些规则。

此外，如果可能的话，在用户选择的模式被发现的地方，我们可以使用top-k算法而不是使用最小支持度阈值。其原因是，有时最好的关联是少关联。

这些是我注意到在这个演讲的主要观点。

裴坚给出在这次会议的另一个有趣的谈话。

主题是大数据。

这次讲座一些关键的理念是，让一种技术实用化，你必须让它小和不可见。依托数据挖掘的系统也许不得不探测用户何时需要数据挖掘服务，并尽早提供服务。

数据挖掘系统的其它可喜的特征是，用户应该能够设定的偏好。此外，如果用户以交互方式改变其偏好，结果应该很快会更新。数据挖掘系统也应该可以进行情境感知。

也有人提及大数据总是相对的。在20世纪70年代的一些论文已经在谈论大量的数据，最近的一些会议的主题甚至为“非常大的数据库”。但即使“大”是相对的，自2003年以来，我们每隔几天记录的数据量比以前世界上全部的信息都多。

关于大数据的论坛

周五有一场有大约7个在数据挖掘领域顶尖的研究人员参加的大数据论坛。我会尽量如实反映论坛中听到了一些有趣的意见和想法。当然，下面的文字是我的理解。

从大量的数据中学习

杰夫•韦伯讲述了从大量的数据中学习的挑战。他提到，大多数研究主要集中在如何能扩展现有的算法，而不是设计新的算法。他提到，不同的算法有不同的学习曲线，而且有些模式对小的数据可能适应的好，但其他模式可能对大数据适应的更好。其实，一些可以容纳复杂和大量数据的模式可能倾向于过度适应小量的数据。

在他看来，我们不应该只是尽量扩大现有技术算法的状态，而是要设计可以应付大量的数据、高维和细粒度的数据新的算法，。我们需要低偏置、效率高、可能的核心算法。

另一个有趣的一点是，有一个流行的神话，如果我们可以用大数据训练它，使用任何算法都能将工作做好。这是不正确的。不同的算法有不同的学习曲线（产生不同的错误率基于训练数据的大小）。

大数据和小的占用空间

爱德华•常给出了另一个有趣的观点。谈话中提到当训练实例数较大简单方法经常会优于复杂分类器。他提出复杂的算法难以并行，并且因此对大数据可以适用简单的算法。例如，他提及，他花了两年的时间试图并行化“深度学习”的算法，但是失败了，因为它太复杂了。

另一个关键的想法是，对大数据做数据挖掘应该有一个小巧的内存和功耗。最后一点对于可穿戴设备特别重要。但当然一些处理要在云中进行。

我们应该专注于小数据的问题？

乔治Karypis提出了一个非常有趣的观点。我们被告知，大数据无处不在，并且有越来越多的数据。我们用建议的技术，如地图缩小，线性模型，深度学习，采样，分线性算法等来应对。但是，我们应该停止浪费时间在只和少数几家公司（如谷歌，微软）有关的大数据问题上。

我们更应注重“深层数据”。这意味着数据可能很小，但非常复杂，计算量很大，需要一个“深”的理解。但是数据也可以很容易地储存在当今的工作站和小型集群。

我们应该关注那些有用的应用，而不是在大数据上话费太多的精力。

跨学科的需要

Shonali Krishnaswamy提出另一个全新的观点。

她也提到，由于复杂的计算、有限的资源和用户极短的注意力、在移动平台上的数据挖掘可能很难。

此外，为了能够对大数据进行数据挖掘，我们将需要得到来自于工作领域的灵感，来交叉这些学科：（1）并行/分布式算法，（2）移动/普适计算（3）接口/可视化（4）决策科学，（5）也许是语义代理商。

医疗保健领域的问题

还有关于刘际明在医疗保健问题的一些讨论。我不会对这个问题去考虑太多的细节，因为这和我没有多少相关的话题。但有些挑战被提到，那就是如何应对多样性、复杂性、及时性、不同的数据源，节奏、空间尺度WRT的问题，复杂的相互作用，结构上的偏颇，如何进行数据驱动的模型，如何测试结果和服务，以及如何访问和共享数据。

英语原文：

Importance of Succint Data Structures for Data Mining

I have attended a very nice tutorial by R. Raman about succint data structures for data mining. I will try to report some main points of this tutorial here as faithfully as possible (but it is my interpretation).

A simple definition of what is a succint data structure is as follows. It is a data structure that uses less memory than a naive data structure for storing the some information.

Why should we care?

A reason is that to perform fast data mining, it is usually important to have all the data into memory. But sometimes the data cannot fit. In the age of big data, if we use some very compact data structures, then we can fit more data into memory and perhaps that we don’t need to use distributed algorithms to handle big data. An example that was provided in the tutorial is a paper by Cammy & Zhao that used a single computer with a compact structure to beat a distributed map reduce implementation to perform the same task. If we can fit all data into the memory of a single computer, the performance may possibly be better because data access is faster on a single computer than if the computation is distributed.
A second reason is that if a data structure is more compact, then in some cases a computer may store more memory in its cache and therefore access to the data may even be faster. Therefore, there is not always a negative effect on execution time when data is more compressed using a succint data structures.
What characteristics a compressed data structure should provide?

One important characteristic is that it should compresses information and an algorithm using the data structure should ideally be able to work directly on it without decompressing the data.
Another desirable characteristic is that it should provide the same interface as an uncompressed data structure. In other words, for an algorithm, we should be able to replace the data structure by a compressed data structure without having to modify the algorithm.
A compressed data structure is usually composed of data and an index for quick access to the data. The index should be smaller than the data.
Sometimes a trade-off is redundancy in the data structure vs query time. Reducing redundancy may increase query time.
There exists various measures to assess how much bits are necessary to encode some information: naive, information-theoretic, entropy… If we design a succint data structure and we use more memory than what is necessary using these measures, then we are doing something wrong.

In the tutorial, it was also mentionned that there exists several libraries providing succint data structure implementations such as Sux4J, SDSL-lite, SDSL…

Also many examples of succint data structures were provided such as binary trees implemented as a bit vectors, multibit trees, wavelet trees, etc.

On applications of association rule mining

Another very interesting talk was given by G. Webb. The talk first compared association rule mining with methods from the field of statistics to study associations in data. It was explained that:

statistics often tries to find a single model that fit the data, wherehas association rules discovers multiple local models (associations), and let the user choose the best models (which rule better explain the data).
association rule mining is scalable to high dimensional data, wherehas classical techniques from statistics cannot be applied to a large amount of variables
So why association rule mining is not so much used in real applications? It was argued that the reason is that researchers in this field focus too much on performance (speed, memor) rather than on developing algorithms that can find unusual and important patterns. By focusing only on finding frequent rules, too much “junk” is presented to the user (frequent rules that are obvious). It was shown that in some applications, actually, it is not always the frequent rules that are important but the rare ones that have a high statistical significance or are important to the user.

So what is important to the user? It is a little bit subjective. However, there are at least four principles that can help to know what is NOT important to the user.

1) If frequency can be predicted by assuming independency then the association is not important. For example, finding that all persons having prostate cancer are males is an uninteresting association, because it is obvious that only male can get prostate cancer.
2) Redundant associations should not be presented to the user. If an item X is a necessary consequence of a set of items Y, then {X} U Y should be associated with everything that Y is. We don’t need all these rules. In general, we should either keep simple of complex rules (we should remove redundant rules)
3) doing some statistical tests to filter non significant associations
…
Also, it is desirable to mine association efficiently and to be able to explain to the user why some rules are eliminated if necessary.

Also, if possible we may use top-k algorithms where the user chooses the number of patterns to be found rather than using the minsup threshold. The reason is that sometimes the best associations are rare associations.

These were the main ideas that I have noticed in this presentation.

About big data

Another interesting talk at this conference was given by Jian Pei. The topic was Big Data.

Some key ideas in this talk was that to make a technology useful, you have to make it small and invisible. A system relying on data mining may have to detect when a user needs a data mining service and provides the service as early as possible.

Other desirable characteristics of a data mining system are that a user should be able to set preferences. Moreover, if a user interactively changes its preferences, results should be updated quickly. A data mining system should also be context aware.

It was also mentionned that big data is always relative. Some papers in the 1970s were already talking about large data and recently some conference have even adopted the theme “extremely large databases”. But even if “big” is relative, since 2003 we record more data every few days in the world than everything that had been recorded before.

Social activities and organization

In general, PAKDD was very-well organized. The organizers did a huge job. It is personally one of the best conference that I have attended in terms of organization. I was also able to met many interesting people from the field of data mining that I had not met before.

The social activities and banquet were also nice.

Location of PAKDD 2015

The location of PAKDD 2015 was announced. It will be in Ho Chi Minh City, Vitenam from 19-22 May 2014. The website is http://pakdd2015.pakdd.org

The deadline for paper submission is 3 October 2014 and notification is the 26 December 2014.

The panel about big data

Friday, there was a great panel about big data with 7 top researchers from the field of data mining. I will try to faithfully report some interesting opinions and ideas heard during the panel. Of course, the text below is my interpretation.

Learning from large data

Geoff Webb discusses the challenges of learning from large quantities of data. He mention that the majority of research focuses on how we can scale up existing algorithms rather than designing new algorithms. He mentionned that different algorithms have different learning curves and that some models may work very well with small data but other model may work better with big data. Actually, some models that can fit complex and large amount of data may tend to overfit with small data.

In his opinion, we should not just try to scale up the state of the art algorithm but to design new algorithms that can cope with huge quantities of data, high dimensionality and fine grained data. We need low bias, very efficient and probably out of core algorithms.

Another interesting point is that there is a popular myth that using any algorithms will work well if we can train it with big data. That is not true. Different algorithm have different learning curves (produce different error rate based on the size of the training data).

Big data and the small footprint

Another interesting opinion was given by Edward Chang. It was mentionned that often simple methods can outperforms complex classifiers when the number of training examples is larger. He mentionned that complex algorithms are hard to parallelize and that the solution may thus be to use simple algorithms for big data. As example, he mentionned that he tried to parallelize “deep learning” algorithms for 2 years and fail because it is too complex.

Another key idea is that doing data mining with big data should have a small footprint in terms of memory and power consumption. The latter point is especially important for wearable computers. But of course some of the processing could be done in the cloud.

Should we focus on the small data problems?

Another very interesting point of view was presented by George Karypis. We are told that big data is everywhere and that there is more and more data. We responded by proposing technologies such as Map Reduce, linear model, deep learning, sampling, sub-linear algorithms etc. However, we should stop spending time on big data problems relevant to only a few companies (e.g. Google, Microsoft).

We should rather focus on “deep data”. This means data that may be small but highly complex, computationally expensive, require a “deep” understanding. But also data that can easily fit on today workstation and small scale clusters.

We should focus on applications that are useful rather than concentrating too much work on big data.

On the need to cross disciplines

Another refreshing point of view what the one of Shonali Krishnaswamy.

She also mentioned that data mining on mobile platforms may be hard due to complex computation, limited resources and users that have short attention span.

Moreover, to be able to perform data mining on big data, we will need to cross disciplines by getting inspired by work from the fields of: (1) parallel/distributed algorithms, (2) mobile/pervasive computing (3) interfaces / visualizations (4) decision sciences and (5) perhaps semantic agents.

Issues in healthcare

There was also some discussion about issues in health care by Jiming Liu. I will not go into too much details about this one since I’m not much related to this topic. But some challenges that were mentionned is how to deal with diversity, complexity, timeliness, diverse data sources, tempo-spatial scales wrt problem, complex interactions, structural biases, how to perform data driven modelling, how to test result and service and how to access & share data.

Coupling

There was also another discussion by Longbing Cao about the need of coupling. I did not take too much notes about this one so I will not discuss it here.

本文是Philippe Fournier-Viger写在data-mining上的2014年亚太知识发现和数据挖掘(PAKDD)竞赛报告观后感。

帐号		自动登录	找回密码
密码			立即注册

2014年亚太知识发现和数据挖掘(PAKDD)竞赛报告

马上注册，结交更多数据大咖，获取更多知识干货，轻松玩转大数据

站长推荐 /1