IEEE 2017-2018 Data Mining Projects in Java

Abstract:

Social voting is an emerging new feature in online social networks. It poses unique challenges and opportunities for recommendation. In this paper, we develop a set of matrix-factorization (MF) and nearest-neighbor (NN)-based recommender systems (RSs) that explore user social network and group affiliation information for social voting recommendation. Through experiments with real social voting traces, we demonstrate that social network and group affiliation information can significantly improve the accuracy of popularity-based voting recommendation, and social network information dominates group affiliation information in NN-based approaches. We also observe that social and group information is much more valuable to cold users than to heavy users. In our experiments, simple metapath-based NN models outperform computation-intensive MF models in hot-voting recommendation, while users' interests for nonhot votings can be better mined by MF models. We further propose a hybrid RS, bagging different single approaches to achieve the best top-k hit rate.

Abstract:

Cloud computing has generated much interest in the research community in recent years for its many advantages, but has also raise security and privacy concerns. The storage and access of confidential documents have been identified as one of the central problems in the area. In particular, many researchers investigated solutions to search over encrypted documents stored on remote cloud servers. While many schemes have been proposed to perform conjunctive keyword search, less attention has been noted on more specialized searching techniques. In this paper, we present a phrase search technique based on Bloom filters that is significantly faster than existing solutions, with similar or better storage and communication cost. Our technique uses a series of n-gram filters to support the functionality. The scheme exhibits a trade-off between storage and false positive rate, and is adaptable to defend against inclusion-relation attacks. A design approach based on an application’s target false positive rate is also described.

Abstract:

Spatial data have wide applications, e.g., location-based services, and geometric range queries (i.e., finding points inside geometric areas, e.g., circles or polygons) are one of the fundamental search functions over spatial data. The rising demand of outsourcing data is moving large-scale datasets, including large-scale spatial datasets, to public clouds. Meanwhile, due to the concern of insider attackers and hackers on public clouds, the privacy of spatial datasets should be cautiously preserved while querying them at the server side, especially for location-based and medical usage. In this paper, we formalize the concept of Geometrically Searchable Encryption, and propose an efficient scheme, named FastGeo, to protect the privacy of clients’ spatial datasets stored and queried at a public server. With FastGeo, which is a novel two-level search for encrypted spatial data, an honest-but-curious server can efficiently perform geometric range queries, and correctly return data points that are inside a geometric range to a client without learning sensitive data points or this private query. FastGeo supports arbitrary geometric areas, achieves sublinear search time, and enables dynamic updates over encrypted spatial datasets. Our scheme is provably secure, and our experimental results on real-world spatial datasets in cloud platform demonstrate that FastGeo can boost search time over 100 times.

Abstract:

We develop a novel framework, named as l-injection, to address the sparsity problem of recommender systems. By carefully injecting low values to a selected set of unrated user-item pairs in a user-item matrix, we demonstrate that top-N recommendation accuracies of various collaborative filtering (CF) techniques can be significantly and consistently improved. We first adopt the notion of pre-use preferences of users toward a vast amount of unrated items. Using this notion, we identify uninteresting items that have not been rated yet but are likely to receive low ratings from users, and selectively impute them as low values. As our proposed approach is method-agnostic, it can be easily applied to a variety of CF algorithms. Through comprehensive experiments with three real-life datasets (e.g., Movielens, Ciao, and Watcha), we demonstrate that our solution consistently and universally enhances the accuracies of existing CF algorithms (e.g., item-based CF, SVD-based CF, and SVD++) by 2.5 to 5 times on average. Furthermore, our solution improves the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy.

Abstract:

With the advances in geo-positioning technologies and location-based services, it is nowadays quite common for road networks to have textual contents on the vertices. Previous work on identifying an optimal route that covers a sequence of query keywords has been studied in recent years. However, in many practical scenarios, an optimal route might not always be desirable. For example, a personalized route query is issued by providing some clues that describe the spatial context between PoIs along the route, where the result can be far from the optimal one. Therefore, in this paper, we investigate the problem of clue-based route search (CRS), which allows a user to provide clues on keywords and spatial relationships. First, we propose a greedy algorithm and a dynamic programming algorithm as baselines. To improve efficiency, we develop a branch-and-bound algorithm that prunes unnecessary vertices in query processing. In order to quickly locate candidate, we propose an AB-tree that stores both the distance and keyword information in tree structure. To further reduce the index size, we construct a PB-tree by utilizing the virtue of 2-hop label index to pinpoint the candidate. Extensive experiments are conducted and verify the superiority of our algorithms and index structures.

Abstract:

The amount of data generated by individuals and enterprises is rapidly increasing. With the emerging cloud computing paradigm, the data and corresponding complex management tasks can be outsourced to the cloud for the management flexibility and cost savings. Unfortunately, as the data could be sensitive, the direct data outsourcing would have the problem of privacy leakage. The encryption can be used, before the data outsourcing, with the concern that the operations can still be accomplished by the cloud. We consider the multikeyword similarity search over outsourced cloud data. In particular, with the consideration of the text data only, multiple keywords are specified by the user. The cloud returns the files containing more than a threshold number of input keywords or similar keywords, where the similarity here is defined according to the edit distance metric. We propose three solutions, where blind signature provides the user access privacy, and a novel use of Bloom filter's bit pattern provides the speedup of search task at the cloud side. Our final design to achieve the search is secure against insider threats and efficient in terms of the search time at the cloud side. Performance evaluation and analysis are used to demonstrate the practicality of our proposed solutions.

Abstract:

Mass media sources, specifically the news media, have traditionally informed us of daily events. In modern times, social media services such as Twitter provide an enormous amount of user-generated data, which have great potential to contain informative news-related content. For these resources to be useful, we must find a way to filter noise and only capture the content that, based on its similarity to the news media, is considered valuable. However, even after noise is removed, information overload may still exist in the remaining data-hence, it is convenient to prioritize it for consumption. To achieve prioritization, information must be ranked in order of estimated importance considering three factors. First, the temporal prevalence of a particular topic in the news media is a factor of importance, and can be considered the media focus (MF) of a topic. Second, the temporal prevalence of the topic in social media indicates its user attention (UA). Last, the interaction between the social media users who mention this topic indicates the strength of the community discussing it, and can be regarded as the user interaction (UI) toward the topic. We propose an unsupervised framework-SociRank-which identifies news topics prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Our experiments show that SociRank improves the quality and variety of automatically identified news topics.

Abstract:

Searchable encryption allows a cloud server to conduct keyword search over encrypted data on behalf of the data users without learning the underlying plaintexts. However, most existing searchable encryption schemes only support single or conjunctive keyword search, while a few other schemes that are able to perform expressive keyword search are computationally inefficient since they are built from bilinear pairings over the composite-order groups. In this paper, we propose an expressive public-key searchable encryption scheme in the prime-order groups, which allows keyword search policies (i.e., predicates, access structures) to be expressed in conjunctive, disjunctive or any monotonic Boolean formulas and achieves significant performance improvement over existing schemes. We formally define its security, and prove that it is selectively secure in the standard model. Also, we implement the proposed scheme using a rapid prototyping tool called Charm [37], and conduct several experiments to evaluate it performance. The results demonstrate that our scheme is much more efficient than the ones built over the composite-order groups.

Abstract:

User preferences play a significant role in market analysis. In the database literature, there has been extensive work on query primitives, such as the well known top-k query that can be used for the ranking of products based on the preferences customers have expressed. Still, the fundamental operation that evaluates the similarity between products is typically done ignoring these preferences. Instead products are depicted in a feature space based on their attributes and similarity is computed via traditional distance metrics on that space. In this work, we utilize the rankings of the products based on the opinions of their customers in order to map the products in a user-centric space where similarity calculations are performed. We identify important properties of this mapping that result in upper and lower similarity bounds, which in turn permit us to utilize conventional multidimensional indexes on the original product space in order to perform these user-centric similarity computations. We show how interesting similarity calculations that are motivated by the commonly used range and nearest neighbor queries can be performed efficiently, while pruning significant parts of the data set based on the bounds we derive on the user-centric similarity of products.

Abstract:

Getting back to previously viewed web pages is a common yet uneasy task for users due to the large volume of personally accessed information on the web. This paper leverages human's natural recall process of using episodic and semantic memory cues to facilitate recall, and presents a personal web revisitation technique called WebPagePrev through context and content keywords. Underlying techniques for context and content memories' acquisition, storage, decay, and utilization for page re-finding are discussed. A relevance feedback mechanism is also involved to tailor to individual's memory strength and revisitation habits. Our 6-month user study shows that: (1) Compared with the existing web revisitation tool Memento, History List Searching method, and Search Engine method, the proposed WebPagePrev delivers the best re-finding quality in finding rate (92.10 percent), average F1-measure (0.4318), and average rank error (0.3145). (2) Our dynamic management of context and content memories including decay and reinforcement strategy can mimic users' retrieval and recall mechanism. With relevance feedback, the finding rate of WebPagePrev increases by 9.82 percent, average F1-measure increases by 47.09 percent, and average rank error decreases by 19.44 percent compared to stable memory management strategy. Among time, location, and activity context factors in WebPagePrev, activity is the best recall cue, and context+content based re-finding delivers the best performance, compared to context based re-finding and content based re-finding.

Abstract:

Web search engines are composed by thousands of query processing nodes, i.e., servers dedicated to process user queries. Such many servers consume a significant amount of energy, mostly accountable to their CPUs, but they are necessary to ensure low latencies, since users expect sub-second response times (e.g., 500 ms). However, users can hardly notice response times that are faster than their expectations. Hence, we propose the Predictive Energy Saving Online Scheduling Algorithm (PESOS ) to select the most appropriate CPU frequency to process a query on a per-core basis. PESOS aims at process queries by their deadlines, and leverage high-level scheduling information to reduce the CPU energy consumption of a query processing node. PESOS bases its decision on query efficiency predictors, estimating the processing volume and processing time of a query. We experimentally evaluate PESOS upon the TREC ClueWeb09B collection and the MSN2006 query log. Results show that PESOS can reduce the CPU energy consumption of a query processing node up to ∼ 48 percent compared to a system running at maximum CPU core frequency. PESOS outperforms also the best state-of-the-art competitor with a ∼ 20 percent energy saving, while the competitor requires a fine parameter tuning and it may incurs in uncontrollable latency violations.

Abstract:

With the popularity of social media (e.g., Facebook and Flicker), users can easily share their check-in records and photos during their trips. In view of the huge number of user historical mobility records in social media, we aim to discover travel experiences to facilitate trip planning. When planning a trip, users always have specific preferences regarding their trips. Instead of restricting users to limited query options such as locations, activities, or time periods, we consider arbitrary text descriptions as keywords about personalized requirements. Moreover, a diverse and representative set of recommended travel routes is needed. Prior works have elaborated on mining and ranking existing routes from check-in data. To meet the need for automatic trip organization, we claim that more features of Places of Interest (POIs) should be extracted. Therefore, in this paper, we propose an efficient Keyword-aware Representative Travel Route framework that uses knowledge extraction from users' historical mobility records and social interactions. Explicitly, we have designed a keyword extraction module to classify the POI-related tags, for effective matching with query keywords. We have further designed a route reconstruction algorithm to construct route candidates that fulfill the requirements. To provide befitting query results, we explore Representative Skyline concepts, that is, the Skyline routes which best describe the trade-offs among different POI features. To evaluate the effectiveness and efficiency of the proposed algorithms, we have conducted extensive experiments on real location-based social network datasets, and the experiment results show that our methods do indeed demonstrate good performance compared to state-of-the-art works.

Abstract:

Continuous top-k query over streaming data is a fundamental problem in database. In this paper, we focus on the sliding window scenario, where a continuous top-k query returns the top-k objects within each query window on the data stream. Existing algorithms support this type of queries via incrementally maintaining a subset of objects in the window and try to retrieve the answer from this subset as much as possible whenever the window slides. However, since all the existing algorithms are sensitive to query parameters and data distribution, they all suffer from expensive incremental maintenance cost. In this paper, we propose a self-adaptive partition framework to support continuous top-k query. It partitions the window into sub-windows and only maintains a small number of candidates with highest scores in each sub-window. Based on this framework, we have developed several partition algorithms to cater for different object distributions and query parameters. To our best knowledge, it is the first algorithm that achieves logarithmic complexity w.r.t. k for incrementally maintaining the candidate set even in the worstcase scenarios.

Abstract:

Location prediction is widely used to forecast users' next place to visit based on his/her mobility logs. It is an essential problem in location data processing, invaluable for surveillance, business, and personal applications. It is very challenging due to the sparsity issues of check-in data. An often ignored problem in recent studies is the variety across different check-in scenarios, which is becoming more urgent due to the increasing availability of more location check-in applications. In this paper, we propose a new feature fusion based prediction approach, GALLOP, i.e., GlobAL feature fused LOcation Prediction for different check-in scenarios. Based on the carefully designed feature extraction methods, we utilize a novel combined prediction framework. Specifically, we set out to utilize the density estimation model to profile geographical features, i.e., context information, the factorization method to extract collaborative information, and a graph structure to extract location transition patterns of users' temporal check-in sequence, i.e., content information. An empirical study on three different check-in datasets demonstrates impressive robustness and improvement of the proposed approach.

Abstract:

Searchable encryption is an important technique for public cloud storage service to provide user data confidentiality protection and at the same time allow users performing keyword search over their encrypted data. Previous schemes only deal with exact or fuzzy keyword search to correct some spelling errors. In this paper, we propose a new wildcard searchable encryption system to support wildcard keyword queries which has several highly desirable features. First, our system allows multiple keywords search in which any queried keyword may contain zero, one or two wildcards, and a wildcard may appear in any position of a keyword and represent any number of symbols. Second, it supports simultaneous search on multiple data owner’s data using only one trapdoor. Third, it provides flexible user authorization and revocation to effectively manage search and decryption privileges. Fourth, it is constructed based on homomorphic encryption rather than Bloom filter and hence completely eliminates the false probability caused by Bloom filter. Finally, it achieves a high level of privacy protection since matching results are unknown to the cloud server in the test phase. The proposed system is thoroughly analyzed and is proved secure. Extensive experimental results indicate that our system is efficient compared with other existing wildcard searchable encryption schemes in the public key setting.

Abstract:

Although cloud computing offers elastic computation and storage resources, it poses challenges on verifiability of computations and data privacy. In this work we investigate verifiability for privacy-preserving multi-keyword search over outsourced documents. As the cloud server may return incorrect results due to system faults or incentive to reduce computation cost, it is critical to offer verifiability of search results and privacy protection for outsourced data at the same time. To fulfill these requirements, we design a Verifiable Privacy-preserving keyword Search scheme, called VPSearch, by integrating an adapted homomorphic MAC technique with a privacy-preserving multi-keyword search scheme. The proposed scheme enables the client to verify search results efficiently without storing a local copy of the outsourced data. We also propose a random challenge technique with ordering for verifying top-k search results, which can detect incorrect top-k results with probability close to 1. We provide detailed analysis on security, verifiability, privacy, and efficiency of the proposed scheme. Finally, we implement VPSearch using Matlab and evaluate its performance over three UCI bag-of-words data sets. Experiment results show that authentication tag generation incurs about 3% overhead only and a search query over 300,000 documents takes about 0.98 seconds on a laptop. To verify 300,000 similarity scores for one query, VPSearch costs only 0.29 seconds.

Abstract:

Driven by the growing security demands of data outsourcing applications in sustainable smart cities, encrypting clients’ data has been widely accepted by academia and industry. Data encryptions should be done at the client side before outsourcing, because clouds and edges are not trusted. Therefore, how to properly encrypt data in a way that the encrypted and remotely stored data can still be queried has become a challenging issue. Though keyword searches over encrypted textual data have been extensively studied, approaches for encrypting graph-structured data with support for answering graph queries are still lacking in the literature. In this paper, we specially investigate graph encryption method for an important graph query type, called top-k Nearest Keyword (kNK) searches. We design several indexes to store necessary information for answering queries and guarantee that private information about the graph such as vertex identifiers, keywords and edges are encrypted or excluded. Security and efficiency of our graph encryption scheme are demonstrated by theoretical proofs and experiments on real-world datasets, respectively.

Abstract:

Probabilistic top-k ranking is an important and well-studied query operator in uncertain databases. However, the quality of top- k results might be heavily affected by the ambiguity and uncertainty of the underlying data. Uncertainty reduction techniques have been proposed to improve the quality of top- k results by cleaning the original data. Unfortunately, most data cleaning models aim to probe the exact values of the objects individually and therefore do not work well for subjective data types, such as user ratings, which are inherently probabilistic. In this paper, we propose a novel pairwise crowdsourcing model to reduce the uncertainty of top-k ranking using a crowd of domain experts. Given a crowdsourcing task of limited budget, we propose efficient algorithms to select the best object pairs for crowdsourcing that will bring in the highest quality improvement. Extensive experiments show that our proposed solutions outperform a random selection method by up to 30 times in terms of quality improvement of probabilistic top- k ranking queries. In terms of efficiency, our proposed solutions can reduce the elapsed time of a brute-force algorithm from several days to one minute.

Abstract:

In this paper we propose a query expansion and user profile enrichment approach to improve the performance of recommender systems operating on a folksonomy, storing and classifying the tags used to label a set of available resources. Our approach builds and maintains a profile for each user. When he submits a query (consisting of a set of tags) on this folksonomy to retrieve a set of resources of his interest, it automatically finds further “authoritative” tags to enrich his query and proposes them to him. All “authoritative” tags considered interesting by the user are exploited to refine his query and, along with those tags directly specified by him, are stored in his profile in such a way to enrich it. The expansion of user queries and the enrichment of user profiles allow any content-based recommender system operating on the folksonomy to retrieve and suggest a high number of resources matching with user needs and desires. Moreover, enriched user profiles can guide any collaborative filtering recommender system to proactively discover and suggest to a user many resources relevant to him, even if he has not explicitly searched for them.

Abstract:

Using online consumer reviews as electronic word of mouth to assist purchase-decision making has become increasingly popular. The Web provides an extensive source of consumer reviews, but one can hardly read all reviews to obtain a fair evaluation of a product or service. A text processing framework that can summarize reviews, would therefore be desirable. A subtask to be performed by such a framework would be to find the general aspect categories addressed in review sentences, for which this paper presents two methods. In contrast to most existing approaches, the first method presented is an unsupervised method that applies association rule mining on co-occurrence frequency data obtained from a corpus to find these aspect categories. While not on par with state-of-the-art supervised methods, the proposed unsupervised method performs better than several simple baselines, a similar but supervised method, and a supervised baseline, with an F₁-score of 67%. The second method is a supervised variant that outperforms existing methods with an F₁-score of 84%.

Abstract:

Nowadays, there is an ever-increasing migration of people to urban areas. Health care service is one of the most challenging aspects that is greatly affected by the vast influx of people to city centers. Consequently, cities around the world are investing heavily in digital transformation in an effort to provide healthier ecosystems for people. In such a transformation, millions of homes are being equipped with smart devices (e.g., smart meters, sensors, and so on), which generate massive volumes of fine-grained and indexical data that can be analyzed to support smart city services. In this paper, we propose a model that utilizes smart home big data as a means of learning and discovering human activity patterns for health care applications. We propose the use of frequent pattern mining, cluster analysis, and prediction to measure and analyze energy usage changes sparked by occupants' behavior. Since people's habits are mostly identified by everyday routines, discovering these routines allows us to recognize anomalous activities that may indicate people's difficulties in taking care for themselves, such as not preparing food or not using a shower/bath. This paper addresses the need to analyze temporal energy consumption patterns at the appliance level, which is directly related to human activities. For the evaluation of the proposed mechanism, this paper uses the U.K. Domestic Appliance Level Electricity data set-time series data of power consumption collected from 2012 to 2015 with the time resolution of 6 s for five houses with 109 appliances from Southern England. The data from smart meters are recursively mined in the quantum/data slice of 24 h, and the results are maintained across successive mining exercises. The results of identifying human activity patterns from appliance usage are presented in detail in this paper along with the accuracy of shortand long-term predictions.

Abstract:

Community Question Answering (CQA) sites, such as Stack Overflow and Yahoo! Answers, have become very popular in recent years. These sites contain rich crowdsourcing knowledge contributed by the site users in the form of questions and answers, and these questions and answers can satisfy the information needs of more users. In this article, we aim at predicting the voting scores of questions/answers shortly after they are posted in the CQA sites. To accomplish this task, we identify three key aspects that matter with the voting of a post, i.e., the non-linear relationships between features and output, the question and answer coupling, and the dynamic fashion of data arrivals. A family of algorithms are proposed to model the above three key aspects. Some approximations and extensions are also proposed to scale up the computation. We analyze the proposed algorithms in terms of optimality, correctness, and complexity. Extensive experimental evaluations conducted on two real data sets demonstrate the effectiveness and efficiency of our algorithms.