The most significant issue of data mining is frequent itemset mining on big datasets. The best known basic algorithm for mining the frequent itemsets is Apriori. Apriori is one of the more well-known algorithms that is used to extract frequent itemsets from big datasets where the frequent itemsets can be used as basis for discovering knowledge such as detecting unknown relationships and producing results which can be used for decision making. When the data size is very big, both memory usage and computational cost will be very expensive. And in this case, single processor’s memory and CPU resources are very limited which make the algorithm performance inefficient. Thus, parallel and distribute the algorithm improves the performance of the algorithm.
In this research, a novel approach named “FTWeightedHashT” is presented for frequent itemset mining on big datasets. The proposed algorithm has used Hadoop-MapReduce with enhanced scalability, and execution time. The results obtained in this research are 8040, 4280, 2170, 1030, 850, and 610 milliseconds corresponding to standalone machine, 2, 4, 8, 12, and 16 node. ANOVA has been used for analyzing the results obtained compared with the former results. Experiments have been done using Retail and Mushroom Datasets, and showed about 60% of improved results regarding time execution. The proposed algorithm can process big datasets efficiently on Hadoop-MapReduce model with 16 node, which can significantly reduce the time execution, and enhance the scalability of the Apriori Algorithm.
الملخص