IEEE 2017-2018 Data Mining Projects in DotNet

Abstract:

The efficient processing of document streams plays an important role in many information filtering systems. Emerging applications, such as news update filtering and social network notifications, demand presenting end-users with the most relevant content to their preferences. In this work, user preferences are indicated by a set of keywords. A central server monitors the document stream and continuously reports to each user the top-k documents that are most relevant to her keywords. Our objective is to support large numbers of users and high stream rates, while refreshing the top-k results almost instantaneously. Our solution abandons the traditional frequency-ordered indexing approach. Instead, it follows an identifier-ordering paradigm that suits better the nature of the problem. When complemented with a novel, locally adaptive technique, our method offers (i) proven optimality w.r.t. the number of considered queries per stream event, and (ii) an order of magnitude shorter response time (i.e., time to refresh the query results) than the current state-of-the-art.

Abstract:

With the popularity of social media (e.g., Facebook and Flicker), users can easily share their check-in records and photos during their trips. In view of the huge number of user historical mobility records in social media, we aim to discover travel experiences to facilitate trip planning. When planning a trip, users always have specific preferences regarding their trips. Instead of restricting users to limited query options such as locations, activities, or time periods, we consider arbitrary text descriptions as keywords about personalized requirements. Moreover, a diverse and representative set of recommended travel routes is needed. Prior works have elaborated on mining and ranking existing routes from check-in data. To meet the need for automatic trip organization, we claim that more features of Places of Interest (POIs) should be extracted. Therefore, in this paper, we propose an efficient Keyword-aware Representative Travel Route framework that uses knowledge extraction from users' historical mobility records and social interactions. Explicitly, we have designed a keyword extraction module to classify the POI-related tags, for effective matching with query keywords. We have further designed a route reconstruction algorithm to construct route candidates that fulfill the requirements. To provide befitting query results, we explore Representative Skyline concepts, that is, the Skyline routes which best describe the trade-offs among different POI features. To evaluate the effectiveness and efficiency of the proposed algorithms, we have conducted extensive experiments on real location-based social network datasets, and the experiment results show that our methods do indeed demonstrate good performance compared to state-of-the-art works.

Abstract:

With the increasing complexity of both data structures and computer architectures, the performance of applications needs fine tuning in order to achieve the expected runtime execution time. Performance tuning is traditionally based on the analysis of performance data. The analysis results may not be accurate, depending on the quality of the data and the applied analysis approaches. Therefore, application developers may ask: Can we trust the analysis results? This paper introduces our research work in performance optimization of the memory system, with a focus on the cache locality of a shared memory and the memory locality of a distributed shared memory. The quality of the data analysis is guaranteed by using both real performance data acquired at the runtime while the application is running and well-established data analysis algorithms in the field of bioinformatics and data mining. We verified the quality of the proposed approaches by optimizing a set of benchmark applications. The experimental results show a significant performance gain.

Abstract:

Medical information retrieval plays an increasingly important role to help physicians and domain experts to better access medical-related knowledge and information, and support decision making. Integrating the medical knowledge bases has the potential to improve the information retrieval performance through incorporating medical domain knowledge for relevance assessment. However, this is not a trivial task due to the challenges to effectively utilize the domain knowledge in the medical knowledge bases. In this paper, we proposed a novel medical information retrieval system with a two-stage query expansion strategy, which is able to effectively model and incorporate the latent semantic associations to improve the performance. This system consists of two parts. First, we applied a heuristic approach to enhance the widely used pseudo relevance feedback method for more effective query expansion, through iteratively expanding the queries to boost the similarity score between queries and documents. Second, to improve the retrieval performance with structured knowledge bases, we presented a latent semantic relevance model based on tensor factorization to identify semantic association patterns under sparse settings. These identified patterns are then used as inference paths to trigger knowledge-based query expansion in medical information retrieval. Experiments with the TREC CDS 2014 data set: 1) showed that the performance of the proposed system is significantly better than the baseline system and the systems reported in TREC CDS 2014 conference, and is comparable with the state-of-the-art systems and 2) demonstrated the capability of tensor-based semantic enrichment methods for medical information retrieval tasks.

Abstract:

Undoubtedly, Cloud computing is the ongoing trend that robbed the widespread concern. In fact, it succeeded in discharging users from computational and storage burden to offer them as services following pay-as-you-go principle. The substantial feature of Cloud is the ability to scale up with the rise of users' demands. However, Cloud performance falls into the duty of scaling up to fulfill the user requirements and the dramatically increase in energy consumption dilemma. In order to establish a certain trade-off, server consolidation tries to tackle this challenge by maximizing resources utilization per server in order to minimize the number of active servers. In fact, with the ever-growing demands on cloud services, cloud providers must be able to manage highly fluctuating workloads and avoids SLA violation. In this paper we are concerned by two main levels: Front-end and back-end. The former exhibits the set of users that utilize cloud services and the later is materialized by data centers where the factual load is carry out. It's significant to say that the front-end users are the leading responsible of the workload shape within data centers. We point out this fact in an attempt to further investigate the server workload within real time angle. The main goal of this paper is to formalize the server load according to users' behavior in term of submitted tasks and submission rate and to apply stream mining techniques as an introductory step to build a real time prediction system.

Abstract:

Finding knowledge from large data sets to use in intelligent systems becomes more and more important in the Internet era. Pattern mining, classification, text mining, and opinion mining are the topical issues. Among them, pattern mining is an important issue. The problem of mining erasable patterns (EPs) has been proposed as a variant of frequent pattern mining for optimizing the production plans of factories. Several algorithms have been proposed for effectively mining EPs. However, for large threshold values, many EPs are obtained, leading to large memory usage. Therefore, it is necessary to mine a condensed representation of EPs. This paper first defines erasable closed patterns (ECPs), which can represent the set of EPs without information loss. Then, a theorem for fast determining ECPs based on dPidset structure is proposed and proven. Next, two efficient algorithms [erasable closed pattern mining (ECPat) and dNC_Set based algorithm for erasable closed pattern mining (dNC-ECPM)] for mining ECPs based on this theorem are proposed. Experimental results show that ECPat is the best method for sparse data sets, while dNC-ECPM algorithm outperforms ECPat algorithm and a modified mining erasable itemsets algorithm in terms of the mining time and memory usage for all remaining data sets.

Abstract:

The advent of social media and the rapid development of mobile communication technologies have dramatically changed the way to express the feeling, attitude, mood, passion etc. People often express their reactions, fancies and predilections through social media by means of short texts of epigrammatic nature rather than writing long text. Many micro blogging services like Twitter enable people to share and discuss their thoughts and views in the form of short texts without being constrained by space and time. Millions of tweets are generated each day on multifarious issues. Sentiments or opinions for diverse issues have been observed as an important dimension which characterizes human behaviour. Public frequently articulate their opinions towards various issues. In an effort to gain insights from people's point of views, this paper applies text mining on tweets generated on Twitter sites for two famous Indian political diplomats: Arvind Kejriwal and Narendra Modi. The results reveal value of this competitive study and how these diplomats could deal with their political affairs in a better way and identify areas where they need to take a better step into. This study could really help these diplomats to improve their political strategies.

Abstract:

Extracting opinion words and product features is an important task in many sentiment analysis applications. Opinion lexicon also plays a very important role because it is very useful for a wide range of tasks. Although there are several opinion lexicons available, it is hard to maintain a universal opinion lexicon to cover all domains. So, it is necessary to expand a known opinion lexicon that are useful for some domains. The aim of this system is to automatically expand opinion lexicon and to extract product features based on the dependency relations. StandfordCoreNLP dependency parser is used to identify the dependency relations between features and opinions. Extraction rules are predefined according to these dependency relations. This work proposed an algorithm based on double Propagation to extract feature and opinions. The polarities of extracted opinions are annotated by using Vader lexicon. Unlike the existing approaches, this system contributes indirect relations, verbs opinions and verb product features. In order to increase the precision and recall, the system also proposes indirect relations and additional patterns besides 8 rules in Double Propagation. And, general words that are not features and adjectives that are not opinions are filtered in the proposed system. According to experimental studies, our approach is better than the existing state of the art approaches.

Abstract:

This paper presents a method to obtain delay rational macromodels of electrically long interconnects from tabulated frequency data. The proposed algorithm first extracts multiple propagation delays and splits the data into single delay regions using a time-frequency decomposition transform. Then, the attenuation losses of each region are approximated using the Loewner matrix approach. The resulting macromodel is a combination of delay rational approximations. Numerical examples are presented to illustrate efficiency of the proposed method compared to traditional Loewner where the delays are not extracted beforehand.

Abstract:

A significant portion of the world population does not have access to proper healthcare. The key factor for healthcare's success is the physician's expertise. In this paper, we examine if that expertise can be modeled as an information corpus, a flavor of Big Data and extracted using text mining techniques, particularly using the Vector Space Model, to perform diagnosis. Using cloud and mobile technologies, medical diagnosis can then be made available everywhere there is Internet connectivity, reducing costs, increasing coverage and improving quality of life. The key to the possibility of performing medical diagnosis using an information retrieval approach is the data. This paper therefore focuses on the suitability of the dataset for automating diagnosis using text mining. We use various text mining tools relevant to the Vector Space Model to perform operations on the data to see if meaningful conclusions can be drawn from it. We present some of our observations from the experiments conducted and conclude with future directions.

Abstract:

Location information has become a key component of many applications in mobile and pervasive computing, and the ability to accurately predict the mobility of clients allows these applications to provide better service. However, existing location predictors rely heavily on a significant amount of empirical knowledge to function well. In this paper, we develop a novel framework to predict unknown locations when only little location information is available. Specifically, we first extract WiFi locations from WiFi scan results, then a mobility model is built based on the resulted WiFi location graph with connectivity information, finally, we make location predictions with little known location data with Gibbs sampling over the mobility model. Using a data set containing 31 fairly complete WiFi traces collected over three months as ground truth, we compare our proposed approach with other existing state-of-the-art location predictors. The experimental results show that our framework can achieve 83% location prediction accuracy with only three location samples each day, 15% better than Markov and Bayesian predictors which heavily rely on empirical knowledge.

Abstract:

Automating medical diagnosis is an important data mining problem, which is to infer likely disease(s) for some observed symptoms. Algorithms to the problem are very beneficial as a supplement to a real diagnosis. Existing diagnosis methods typically perform the inference on a sparse bipartite graph with two sets of nodes representing diseases and symptoms, respectively. By using this graph, existing methods basically assume no direct dependency exists between diseases (or symptoms), which may not be true in reality. To address this limitation, in this paper, we introduce two domain networks encoding similarities between diseases and those between symptoms to avoid information loss as well as to alleviate the sparsity problem of the bipartite graph. Based on the domain networks and the bipartite graph bridging them, we develop a novel algorithm, CCCR, to perform diagnosis by ranking symptom-disease clusters. Comparing with existing approaches, CCCR is more accurate, and more interpretable since its results deliver rich information about how the inferred diseases are categorized. Experimental results on real-life datasets demonstrate the effectiveness of the proposed method.

Abstract:

Traditional data stream classification techniques assume that the stream of data is generated from a single non-stationary process. On the contrary, a recently introduced problem setting, referred to as Multistream Classification involves two independent non-stationary data generating processes. One of them is the source stream that continuously generates labeled data instances. The other one is the target stream that generates unlabeled test data instances from the same domain. The distributions represented by the source stream data is biased compared to that of the target stream. Moreover, these streams may have asynchronous concept drifts between them. The multistream classification problem is to predict the class labels of target stream instances, while utilizing labeled data available from the source stream. In this paper, we propose an efficient solution for multistream classification by fusing drift detection into online data shift adaptation. Experiment results on benchmark data sets indicate significantly improved performance over the only existing approach for multistream classification.

Abstract:

In this research, we propose the version of K Nearest Neighbor which considers similarity among attributes for computing the similarity between feature vectors. The text segmentation task is viewed into the binary classification where each pair of sentences or paragraphs is classified into whether we put the boundary or not, and the proposed version resulted in the successful results in previous works concerned with the text categorization and clustering. In this research, we define the similarity measure based on both attributes and values, modify the KNN using it, and apply the modified version into the text segmentation task. We may expect more compact representation of data items and improved performance in the text segmentation task as well as other tasks of text mining. Therefore, the goal of this research is to implement the text segmentation system which provides the benefits.

Abstract:

In this paper, we propose a novel method for sentiment trend analysis using Ant Colony Optimization (ACO) algorithm and SentiWordNet. We first collect social data in the form of Resource Description Framework (RDF) triples, and then use ACO algorithm to digitize the amassed RDF triples. Using ACO algorithm, we then compute pheromone values to extract the trends of the user's sentiments with the modified equations. Next, we compute the user's sentiment scores for the computed pheromone values with respect to the sentiment words with SentiWordNet. Finally, we analyze the sentiment trend of the online user by time. For verification of the proposed method, we conduct experiments, and compare the analyzed sentiment trends with their real daily lives. The results show that the proposed method performs satisfactory sentiment trend analysis on real data.

Abstract:

A query facet is a significant list of information nuggets that explains an underlying aspect of a query. Existing algorithms mine facets of a query by extracting frequent lists contained in top search results. The coverage of facets and facet items mined by these kind of methods might be limited, because only a small number of search results are used. In order to solve this problem, we propose mining query facets by using knowledge bases which contain high-quality structured data. Specifically, we first generate facets based on the properties of the entities which are contained in Freebase and correspond to the query. Second, we mine initial query facets from search results, then expanding them by finding similar entities from Freebase. Experimental results show that our proposed method can significantly improve the coverage of facet items over the state-of-the-art algorithms.

Abstract:

Natural language processing and machine learning can be applied to student feedback to help university administrators and teachers address problematic areas in teaching and learning. The proposed system analyzes student comments from both course surveys and online sources to identify sentiment polarity, the emotions expressed, and satisfaction versus dissatisfaction. A comparison with direct-assessment results demonstrates the system's reliability.

Abstract:

Location Based Services (LBS) have become extremely popular over the past decade, being used on a daily basis by millions of users. Instances of real-world LBS range from mapping services (e.g., Google Maps) to lifestyle recommendations (e.g., Yelp) to real-estate search (e.g., Redfin). In general, an LBS provides a public (often web-based) search interface over its backend database (of tuples with 2D geolocations), taking as input a 2D query point and returning k tuples in the database that are closest to the query point, where k is usually a small constant such as 20 or 50. Such a public interface is often called a k-Nearest-Neighbor, i.e., kNN, interface. In this paper, we consider a novel problem of enabling density based clustering over the backend database of an LBS using nothing but limited access to the kNN interface provided by the LBS. Specifically, a key limit enforced by most real-world LBS is a maximum number of kNN queries allowed from a user over a given time period. Since such a limit is often orders of magnitude smaller than the number of tuples in the LBS database, our goal here is to mine from the LBS a cluster assignment function f(·), such that for any tuple t in the database (which may or may not have been accessed), f(·) can produce the cluster assignment of t with high accuracy. We conduct a comprehensive set of experiments over benchmark datasets and popular real-world LBS such as Yahoo! Flickr, Zillow, Redfin and Google Maps and demonstrate the effectiveness of our proposed techniques.

Abstract:

With the advent of multi-view data, multi-view learning has become an important research direction in both machine learning and data mining. Considering the difficulty of obtaining labeled data in many real applications, we focus on the multi-view unsupervised feature selection problem. Traditional approaches all characterize the similarity by fixed and pre-defined graph Laplacian in each view separately and ignore the underlying common structures across different views. In this paper, we propose an algorithm named Multi-view Unsupervised Feature Selection with Adaptive Similarity and View Weight (ASVW) to overcome the above mentioned problems. Specifically, by leveraging the learning mechanism to characterize the common structures adaptively, we formulate the objective function by a common graph Laplacian across different views, together with the sparse ℓ2,p-norm constraint designed for feature selection. We develop an efficient algorithm to address the non-smooth minimization problem and prove that the algorithm will converge. To validate the effectiveness of ASVW, comparisons are made with some benchmark methods on real-world datasets. We also evaluate our method in the real sports action recognition task. The experimental results demonstrate the effectiveness of our proposed algorithm.

Abstract:

Educational data contains valuable information that can be harvested through learning analytics to provide new insights for a better education system. However, sharing or analysis of this data introduce privacy risks for the data subjects, mostly students. Existing work in the learning analytics literature identifies the need for privacy and pose interesting research directions, but fails to apply state of the art privacy protection methods with quantifiable and mathematically rigorous privacy guarantees. This work aims to employ and evaluate such methods on learning analytics by approaching the problem from two perspectives: (1) the data is anonymized and then shared with a learning analytics expert, and (2) the learning analytics expert is given a privacy-preserving interface that governs her access to the data. We develop proof-of-concept implementations of privacy preserving learning analytics tasks using both perspectives and run them on real and synthetic datasets. We also present an experimental study on the trade-off between individuals’ privacy and the accuracy of the learning analytics tasks.

Abstract:

Mining high-utility patterns is the way of discovering sets of useful items that can provide a high profit in a customer transaction database. Discovering High-utility itemsets provide useful information that can help in decision making by clearly identify sets of lucrative items that customers bought in retail store. Discovering customer profitable items in retail store using traditional high-utility methods is inappropriate to find periodic customer behaviors and also in what manner those items related to each other's do. In this paper, we resolve those limitations by providing new method for discovering the productive high-utility periodic patterns from customer related data. Informally, we find the set of high profit correlated group of items. We define a new pattern-growth algorithm with new tree structure. Experimental evaluations show that our algorithm can reveal useful information.

Abstract:

With the emergence and rapid development of the application requirements of data publishing and data mining, how to protect the privacy data and prevent sensitive information leakage has become a great challenge. As a new privacy protection framework, differential privacy can provide privacy protection to the data. But the uniform grid method based on differential privacy has not considered the density and the sparsity of the data distribution, query deviation is too large. Therefore, this paper proposes a differential privacy data publishing method based on cell merging. To solve the problem of sparse data density and better balance noise deviation and uniform assumptions deviation, the paper gives the corresponding data partition algorithm, data merging algorithm. The accuracy and efficiency of the algorithm are compared with the uniform grid method and the adaptive grids approach algorithms, and the results show that it can keep the data validity and reduce the deviation of the query, at the same time,it has the higher accuracy and efficiency.

Abstract:

In this paper, we propose a new version of the LBRW (Learning based Random Walk), LBRW-Co, for predicting users co-occurrence based on mobility homophily and social links. More precisely, we analyze and mine jointly spatio-temporal and social features with the aim to predict and rank users co-occurrences. Experiments are performed on the Foursquare LBSN with accurate and refined measurements. Experimental results demonstrate that our LBRW-Co model have substantial advantages over baseline approaches in predicting and ranking co-occurrence interactions.

Abstract:

Query expansion has been widely adopted in Web search as a way of tackling the ambiguity of queries. Personalized search utilizing folksonomy data has demonstrated an extreme vocabulary mismatch problem that requires even more effective query expansion methods. Co-occurrence statistics, tag-tag relationships, and semantic matching approaches are among those favored by previous research. However, user profiles which only contain a user’s past annotation information may not be enough to support the selection of expansion terms, especially for users with limited previous activity with the system. We propose a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents. Based on user profiles, we build two novel query expansion techniques. These two techniques are based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile, respectively. The results of an in-depth experimental evaluation, performed on two real-world datasets using different external corpora, show that our approach outperforms traditional techniques, including existing non-personalized and personalized query expansion methods.

Abstract:

The multiobjective realisation of the data clustering problem has shown great promise in recent years, yielding clear conceptual advantages over the more conventional, single-objective approach. Evolutionary algorithms have largely contributed to the development of this increasingly active research area on multiobjective clustering. Nevertheless, the unprecedented volumes of data seen widely today pose significant challenges and highlight the need for more effective and scalable tools for exploratory data analysis. This paper proposes an improved version of the multiobjective clustering with automatic k-determination algorithm. Our new algorithm improves its predecessor in several respects, but the key changes are related to the use of an efficient, specialised initialisation routine and two alternative reduced-length representations. These design components exploit information from the minimum spanning tree and redefine the problem in terms of the most relevant subset of its edges. Our study reveals that both the new initialisation routine and the new solution representations not only contribute to decrease the computational overhead, but also entail a significant reduction of the search space, enhancing therefore the convergence capabilities and overall effectiveness of the method. These results suggest that the new algorithm proposed here will offer significant advantages in the realm of ‘big data’ analytics and applications.

Abstract:

A query facet is a significant list of information nuggets that explains an underlying aspect of a query. Existing algorithms mine facets of a query by extracting frequent lists contained in top search results. The coverage of facets and facet items mined by these kind of methods might be limited, because only a small number of search results are used. In order to solve this problem, we propose mining query facets by using knowledge bases which contain high-quality structured data. Specifically, we first generate facets based on the properties of the entities which are contained in Freebase and correspond to the query. Second, we mine initial query facets from search results, then expanding them by finding similar entities from Freebase. Experimental results show that our proposed method can significantly improve the coverage of facet items over the state-of-the-art algorithms.

Abstract:

Getting back to previously viewed web pages is a common yet uneasy task for users due to the large volume of personally accessed information on the web. This paper leverages human's natural recall process of using episodic and semantic memory cues to facilitate recall, and presents a personal web revisitation technique called WebPagePrev through context and content keywords. Underlying techniques for context and content memories' acquisition, storage, decay, and utilization for page re-finding are discussed. A relevance feedback mechanism is also involved to tailor to individual's memory strength and revisitation habits. Our 6-month user study shows that: (1) Compared with the existing web revisitation tool Memento, History List Searching method, and Search Engine method, the proposed WebPagePrev delivers the best re-finding quality in finding rate (92.10 percent), average F1-measure (0.4318), and average rank error (0.3145). (2) Our dynamic management of context and content memories including decay and reinforcement strategy can mimic users' retrieval and recall mechanism. With relevance feedback, the finding rate of WebPagePrev increases by 9.82 percent, average F1-measure increases by 47.09 percent, and average rank error decreases by 19.44 percent compared to stable memory management strategy. Among time, location, and activity context factors in WebPagePrev, activity is the best recall cue, and context+content based re-finding delivers the best performance, compared to context based re-finding and content based re-finding.

Abstract:

Wind power forecasting (WPF) is significant to guide the dispatching of grid and the production planning of wind farm effectively. The intermittency and volatility of wind leading to the diversity of the training samples have a major impact on the forecasting accuracy. In this paper, to deal with the training samples dynamics and improve the forecasting accuracy, a data mining approach consisting of K-means clustering and bagging neural network (NN) is proposed for short-term WPF. Based on the similarity among historical days, K-means clustering is used to classify the samples into several categories, which contain the information of meteorological conditions and historical power data. In order to overcome the over fitting and instability problems of conventional networks, a bagging based ensemble approach is integrated into the back propagation NN. To confirm the effectiveness, the proposed data mining approach is examined on real wind generation data traces. The simulation results show that it can obtain better forecasting accuracy than other baseline and existed short-term WPF approaches.