A Pipeline Model to Discover Frequent Itemsets in an Hierarchical Systems

Khedija Arour

Khedija Arour

Abstract

Like all the other fields of data processing, the modern information systems have integrated the results of the advanced technologies of the last decades. These systems contain implicit data which it will be necessary to extract and exploit, by using data Mining techniques. Mining association rules which trends to find interesting association or correlation relationships among large amounts of data is one of these techniques. It is a twosteps process, the first step finds all frequent itemsets and the second step constructs association rules from these frequent sets. The overall performance of mining association rules is determined by the first step which becomes the focus problem. This step is expensive with high demands for computation and data access. Parallel computing seems to have a natural role to play since parallel computers provides scalability. In this paper, we examine the issue of mining association rules among items in large databases transactions using the algorithm Apriori proposed by Agrawal. In this context, we propose a new parallel version of the Apriori algorithm of Agrawal, that is the main algorithm of each data mining technique. Parallel computing seems to have a natural role to play since parallel computers provides scalability. In fact, our objective of our work is to have an efficient parallel execution time that requires a delicate balance between program granularity and communication latency (synchronization overhead) between the different granules. Unlike previous work on parallelization of specific data mining algorithms, our approaches consist to discover the different granularity levels of parallelism and their impact on the performance. In this paper we focus on task and data parallelism (hybrid approach) under distributed memory. In particular, if communication latency is minimal then fine grain partitioning will yield the best performance. This is the case when data parallelism is used. If communication latency is large (as in a loosely coupled system), then coarse grain partitioning is more appropriate. For the target architecture used in this work (distributed-shared memory).), the problem of load balancing among the nodes becomes a more critical issue in attempts to yield high performance. We have carried out a detailed evaluation of the parallelization techniques and the impact of combining different types of parallelism (task, data and pipeline) on the effectiveness of the system.