Data Science Journal
Online ISSN : 1683-1470
最新号
選択された号の論文の33件中1~33を表示しています
Papers
  • Prabha Dhandayudam, Ilango Krishnamurthi
    2014 年 13 巻 p. 1-11
    発行日: 2014年
    公開日: 2014/04/09
    [早期公開] 公開日: 2014/04/03
    ジャーナル フリー
    Customer segmentation is a process that divides a business's total customers into groups according to their diversity of purchasing behavior and characteristics. The data mining clustering technique can be used to accomplish this customer segmentation. This technique clusters the customers in such a way that the customers in one group behave similarly when compared to the customers in other groups. The customer related data are categorical in nature. However, the clustering algorithms for categorical data are few and are unable to handle uncertainty. Rough set theory (RST) is a mathematical approach that handles uncertainty and is capable of discovering knowledge from a database. This paper proposes a new clustering technique called MADO (Minimum Average Dissimilarity between Objects) for categorical data based on elements of RST. The proposed algorithm is compared with other RST based clustering algorithms, such as MMR (Min-Min Roughness), MMeR (Min Mean Roughness), SDR (Standard Deviation Roughness), SSDR (Standard deviation of Standard Deviation Roughness), and MADE (Maximal Attributes DEpendency). The results show that for the real customer data considered, the MADO algorithm achieves clusters with higher cohesion, lower coupling, and less computational complexity when compared to the above mentioned algorithms. The proposed algorithm has also been tested on a synthetic data set to prove that it is also suitable for high dimensional data.
  • D Sasikala, K Premalatha
    2014 年 13 巻 p. 12-25
    発行日: 2014年
    公開日: 2014/04/21
    [早期公開] 公開日: 2014/04/03
    ジャーナル フリー
    In recent times, the mining of association rules from XML databases has received attention because of its wide applicability and flexibility. Many mining methods have been proposed. Because of the inherent flexibility of the structures and the semantics of the documents, however, these methods are challenging to use. In order to accomplish the mining, an XML document must first be converted into a relational dataset, and an index table with node encoding is created to extract transactions and interesting items. In this paper, we propose a new method to mine association rules from XML documents using a new type of node encoding scheme that employs a Unique Identifier (UID) to extract the important items. The node scheme modified with UID encoding speeds up the mining process. A significance measure is used to identify the important rules found in the XML database. Finally, the mining procedure calculates the confidence that the identified rules are indeed meaningful. Experiments are conducted using XML databases available in the XML data repository. The results illustrate that the proposed method is efficient in terms of computation time and memory usage.
  • Xin Cheng, Changjun Hu, Yang Li
    2014 年 13 巻 p. 26-44
    発行日: 2014年
    公開日: 2014/04/27
    [早期公開] 公開日: 2014/04/24
    ジャーナル フリー
    A Materials Engineering Application (MEA) has been presented as a solution for the problems of materials design, solutions simulation, production and processing, and service evaluation. Large amounts of data are generated in the MEA distributed and heterogeneous environment. As the demand for intelligent engineering information applications increases, the challenge is to effectively organize these complex data and provide timely and accurate on-demand services. In this paper, based on the supporting environment of Open Cloud Services Architecture (OCSA) and Virtual DataSpace (VDS), a new semantic-driven knowledge representation model for MEA information is proposed. Faced with the MEA constantly changing user requirements, this model elaborates the semantic representation of data, services and their relationships to support the construction of domain knowledge ontology. Then, based on the ontology modeling in VDS, the semantic representations of association mapping, rule-based reasoning, and evolution tracking are analyzed to support MEA knowledge acquisition. Finally, an application example of knowledge representation in the field of materials engineering is given to illustrate the proposed model, and some experimental comparisons are discussed for evaluating and verifying the effectiveness of this method.
  • Taro Ubukawa, Alex de Sherbinin, Harlan Onsrud, Andy Nelson, Karen Pay ...
    2014 年 13 巻 p. 45-66
    発行日: 2014年
    公開日: 2014/05/29
    [早期公開] 公開日: 2014/05/15
    ジャーナル フリー
    There is a clear need for a public domain data set of road networks with high special accuracy and global coverage for a range of applications. The Global Roads Open Access Data Set (gROADS), version 1, is a first step in that direction. gROADS relies on data from a wide range of sources and was developed using a range of methods. Traditionally, map development was highly centralized and controlled by government agencies due to the high cost or required expertise and technology. In the past decade, however, high resolution satellite imagery and global positioning system (GPS) technologies have come into wide use, and there has been significant innovation in web services, such that a number of new methods to develop geospatial information have emerged, including automated and semi-automated road extraction from satellite/aerial imagery and crowdsourcing. In this paper we review the data sources, methods, and pros and cons of a range of road data development methods: heads-up digitizing, automated/semi-automated extraction from remote sensing imagery, GPS technology, crowdsourcing, and compiling existing data sets. We also consider the implications for each method in the production of open data.
  • A Düsterhus, A Hense
    2014 年 13 巻 p. 67-78
    発行日: 2014年
    公開日: 2014/06/09
    [早期公開] 公開日: 2014/06/05
    ジャーナル フリー
    A peer review scheme comparable to that used in traditional scientific journals is a major element missing in bringing publications of raw data up to standards equivalent to those of traditional publications. This paper introduces a quality evaluation process designed to analyse the technical quality as well as the content of a dataset. This process is based on quality tests, the results of which are evaluated with the help of the knowledge of an expert. As a result, the quality is estimated by a single value only. Further, the paper includes an application and a critical discussion on the potential for success, the possible introduction of the process into data centres, and practical implications of the scheme.
  • Ge Peng, Jean-Raymond Bidlot, H Paul Freitag, Carl J Schreck III
    2014 年 13 巻 p. 79-87
    発行日: 2014年
    公開日: 2014/08/12
    [早期公開] 公開日: 2014/07/29
    ジャーナル フリー
    This article documents a systematic bias in surface wind directions between the TAO buoy measurements at 0°, 170°W and the ECMWF analysis and forecasts. This bias was of the order 10° and persisted from November 2008 to January 2010, which was consistent with a post-recovery calibration drift in the anemometer vane. Unfortunately, the calibration drift was too time-variant to be used to correct the data so the quality flag for this deployment was adjusted to reflect low data quality. The primary purpose of this paper is to inform users in the modelling and remote-sensing community about this systematic, persistent wind directional bias, which will allow users to make an educated decision on using the data and be aware of its potential impact to their downstream product quality. The uncovering of this bias and its source demonstrates the importance of continuous scientific oversight and effective user-data provider communication in stewarding scientific data. It also suggests the need for improvement in the ability of buoy data quality control procedures of the TAO and ECMWF systems to detect future wind directional systematic biases such as the one described here.
  • Costantino Thanos
    2014 年 13 巻 p. 88-105
    発行日: 2014年
    公開日: 2014/09/14
    [早期公開] 公開日: 2014/09/11
    ジャーナル フリー
    Modern science is increasingly data-intensive, multidisciplinary, and network-centric. There is an emerging consensus among the members of the academic research community that the practices of this new science paradigm should be congruent with “open science”. This entails that the bonanza of research data, the wide availability of algorithms, data tools, and data services produced by the members of the research community must be discoverable, understandable, and usable by overcoming all kinds of heterogeneity and logical inconsistencies. The main concept for coping with the many dimensions of heterogeneity and logical inconsistency is mediation. Mediation is achieved by mediators or brokers. These are software modules that exploit encoded knowledge about certain datasets, data services, and user needs in order to implement an intermediary service. A mediating environment is an environment that provides a core set of intermediary services. Mediation should be a distinct functionality of future research data infrastructures. This paper surveys the different levels of interoperability, i.e., exchangeability, compatibility, and usability, their properties and relationships, mediation concepts, functions, and intermediary services. The current interoperability landscape is also illustrated. Finally, the paper advocates the need for mediating environments to be supported by future research data infrastructures and envisions that one of the most important features of future research data infrastructures will be mediation software.
  • Singh Bharat, O P Vyas
    2014 年 13 巻 p. 106-118
    発行日: 2014年
    公開日: 2014/11/14
    [早期公開] 公開日: 2014/11/06
    ジャーナル フリー
    A high-dimensional feature selection having a very large number of features with an optimal feature subset is an NP-complete problem. Because conventional optimization techniques are unable to tackle large-scale feature selection problems, meta-heuristic algorithms are widely used. In this paper, we propose a particle swarm optimization technique while utilizing regression techniques for feature selection. We then use the selected features to classify the data. Classification accuracy is used as a criterion to evaluate classifier performance, and classification is accomplished through the use of k-nearest neighbour (KNN) and Bayesian techniques. Various high dimensional data sets are used to evaluate the usefulness of the proposed approach. Results show that our approach gives better results when compared with other conventional feature selection algorithms.
  • Nidhi Kushwaha, O P Vyas
    2014 年 13 巻 p. 119-126
    発行日: 2014年
    公開日: 2014/11/14
    [早期公開] 公開日: 2014/11/06
    ジャーナル フリー
    The Semantic Web (Web 3.0) has been proposed as an efficient way to access the increasingly large amounts of data on the internet. The Linked Open Data Cloud project at present is the major effort to implement the concepts of the Seamtic Web, addressing the problems of inhomogeneity and large data volumes. RKBExplorer is one of many repositories implementing Open Data and contains considerable bibliographic information. This paper discusses bibliographic data, an important part of cloud data. Effective searching of bibiographic datasets can be a challenge as many of the papers residing in these databases do not have sufficient or comprehensive keyword information. In these cases however, a search engine based on RKBExplorer is only able to use information to retrieve papers based on author names and title of papers without keywords. In this paper we attempt to address this problem by using the data mining algorithm Association Rule Mining (ARM) to develop keywords based on features retrieved from Resource Description Framework (RDF) data within a bibliographic citation. We have demonstrate the applicability of this method for predicting missing keywords for bibliographic entries in several typical databases.
    −−−−−
    ¹ Paper presented at 1st International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2014) March 27-28, 2014. Organized by VIT University, Chennai, India. Sponsored by BRNS.
  • S P Syed Ibrahim, K R Chandran, C J Kabila Kanthasamy
    2014 年 13 巻 p. 127-137
    発行日: 2014年
    公開日: 2014/11/27
    [早期公開] 公開日: 2014/11/06
    ジャーナル フリー
    The associative classification method integrates association rule mining and classification. Constructing an efficient classifier with a small set of high quality rules is a highly important but indeed a challenging task. The lazy learning associative classification method successfully removes the need for a classifier but suffers from high computation costs. This paper proposes a Compact Highest Subset Confidence-Based Associative Classification scheme that generates compact subsets based on information gain and classifies the new samples without constructing classifiers. Experimental results show that the proposed system out performs both the traditional and the existing lazy learning associative classification methods.
    −−−−−
    ¹ Paper presented at 1st International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2014) March 27-28, 2014. Organized by VIT University, Chennai, India. Sponsored by BRNS.
  • Sugam Sharma, Udoyara S Tim, Johnny Wong, Shashi Gadia, Subhash Sharma
    2014 年 13 巻 p. 138-157
    発行日: 2014年
    公開日: 2014/12/04
    [早期公開] 公開日: 2014/11/24
    ジャーナル フリー
    Today, science is passing through an era of transformation, where the inundation of data, dubbed data deluge is influencing the decision making process. The science is driven by the data and is being termed as data science. In this internet age, the volume of the data has grown up to petabytes, and this large, complex, structured or unstructured, and heterogeneous data in the form of “Big Data” has gained significant attention. The rapid pace of data growth through various disparate sources, especially social media such as Facebook, has seriously challenged the data analytic capabilities of traditional relational databases. The velocity of the expansion of the amount of data gives rise to a complete paradigm shift in how new age data is processed. Confidence in the data engineering of the existing data processing systems is gradually fading whereas the capabilities of the new techniques for capturing, storing, visualizing, and analyzing data are evolving. In this review paper, we discuss some of the modern Big Data models that are leading contributors in the NoSQL era and claim to address Big Data challenges in reliable and efficient ways. Also, we take the potential of Big Data into consideration and try to reshape the original operationaloriented definition of “Big Science” (Furner, 2003) into a new data-driven definition and rephrase it as “The science that deals with Big Data is Big Science.”
  • Hua Qin, Lynne Davis, Matthew Mayernik, Patricia Romero Lankao, John D ...
    2014 年 13 巻 p. 158-171
    発行日: 2014年
    公開日: 2014/12/04
    [早期公開] 公開日: 2014/11/26
    ジャーナル フリー
    Meta-analyses are studies that bring together data or results from multiple independent studies to produce new and over-arching findings. Current data curation systems only partially support meta-analytic research. Some important meta-analytic tasks, such as the selection of relevant studies for review and the integration of research datasets or findings, are not well supported in current data curation systems. To design tools and services that more fully support meta-analyses, we need a better understanding of meta-analytic research. This includes an understanding of both the practices of researchers who perform the analyses and the characteristics of the individual studies that are brought together. In this study, we make an initial contribution to filling this gap by developing a conceptual framework linking meta-analyses with data paths represented in published articles selected for the analysis. The framework focuses on key variables that represent primary/secondary datasets or derived socio-ecological data, contexts of use, and the data transformations that are applied. We introduce the notion of using variables and their relevant information (e.g., metadata and variable relationships) as a type of currency to facilitate synthesis of findings across individual studies and leverage larger bodies of relevant source data produced in small science research. Handling variables in this manner provides an equalizing factor between data from otherwise disparate data-producing communities. We conclude with implications for exploring data integration and synthesis issues as well as system development.
  • A Aparicio-González, J L López-Jurado, R Balbín, J C Alonso, B Amengua ...
    2015 年 13 巻 p. 172-191
    発行日: 2015年
    公開日: 2015/01/27
    [早期公開] 公開日: 2015/01/10
    ジャーナル フリー
    IBAMar is a regional database that puts together all the physical and biochemical data provided by multiparametric probes and water sample analysis taken during the cruises managed by the Balearic Oceanographic Center of the Instituto Español de Oceanografía (COB-IEO) during the last four decades. Initially, it integrated data from hydrographic profiles obtained from CTDs (conductivity, temperature, depth) equipped with several sensors, but it has been recently extended to incorporate data obtained with hydrocasts using oceanographic Niskin or Nansen bottles. The result is an extensive regional resource database that includies physical hydrographic data such as temperature (T), salinity (S), dissolved oxygen (DO), fluorescence, and turbidity, as well as biochemical data, specifically dissolved inorganic nutrients (phosphate, nitrate, nitrite, and silicate) and chlorophyll-a. Different technologies and methodologies were used by independent teams during the four decades of data sampling. However in the IBAMar database, data have been reprocessed using the same protocols and a standard quality control (QC) methodology has been applied to each variable. The result is a homogeneous and quality-controlled data. IBAMar database at standard levels is freely available for exploration and download from http://www.ba.ieo.es/ibamar/.
  • N Moles
    2015 年 13 巻 p. 192-202
    発行日: 2015年
    公開日: 2015/01/27
    [早期公開] 公開日: 2015/01/19
    ジャーナル フリー
    With the growing importance of data to the scholarly record and the critical role journals play in facilitating data sharing, the complex landscape of scholarly journal data publication policies has become an obstacle for research. This paper outlines Data-PE, a framework for evaluating these policies. It takes the form of a conceptual foundation, comprising twelve criteria for evaluation, operationalized through an evaluation tool. Its objective is to function as a flexible means for a variety of stakeholders to appraise individual policies. Examples of the use of the framework are provided and means for the validation of the tool are discussed.
  • James Campbell
    2015 年 13 巻 p. 203-230
    発行日: 2015年
    公開日: 2015/01/27
    [早期公開] 公開日: 2015/01/19
    ジャーナル フリー
    Making scientific data openly accessible and available for re-use is desirable to encourage validation of research results and/or economic development. Understanding what users may, or may not, do with data in online data repositories is key to maximizing the benefits of scientific data re-use. Many online repositories that allow access to scientific data indicate that data is “open,” yet specific usage conditions reviewed on 40 “open” sites suggest that there is no agreed upon understanding of what “open” means with respect to data. This inconsistency can be an impediment to data re-use by researchers and the public.
  • Ge Peng, Jeffrey L Privette, Edward J Kearns, Nancy A Ritchey, Steve A ...
    2015 年 13 巻 p. 231-253
    発行日: 2015年
    公開日: 2015/02/02
    [早期公開] 公開日: 2015/01/27
    ジャーナル フリー
    This paper presents a stewardship maturity assessment model in the form of a matrix for digital environmental datasets. Nine key components are identified based on requirements imposed on digital environmental data and information that are cared for and disseminated by U.S. Federal agencies by U.S. law, i.e., Information Quality Act of 2001, agencies’ guidance, expert bodies’ recommendations, and users. These components include: preservability, accessibility, usability, production sustainability, data quality assurance, data quality control/monitoring, data quality assessment, transparency/traceability, and data integrity. A five-level progressive maturity scale is then defined for each component associated with measurable practices applied to individual datasets, representing Ad Hoc, Minimal, Intermediate, Advanced, and Optimal stages. The rationale for each key component and its maturity levels is described. This maturity model, leveraging community best practices and standards, provides a unified framework for assessing scientific data stewardship. It can be used to create a stewardship maturity scoreboard of dataset(s) and a roadmap for scientific data stewardship improvement or to provide data quality and usability information to users, stakeholders, and decision makers.
  • Wang Xuezhi, Zhao Jianghua, Zhou Yuanchun, Li Jianhui
    2014 年 13 巻 p. 254-264
    発行日: 2014年
    公開日: 2015/03/23
    [早期公開] 公開日: 2014/11/06
    ジャーナル フリー
    The rapid growth in the volume of remote sensing data and its increasing computational requirements bring huge challenges for researchers as traditional systems cannot adequately satisfy the huge demand for service. Cloud computing has the advantage of high scalability and reliability, which can provide firm technical support. This paper proposes a highly scalable geospatial cloud platform named the Geospatial Data Cloud, which is constructed based on cloud computing. The architecture of the platform is first introduced, and then two subsystems, the cloud-based data management platform and the cloud-based data processing platform, are described.
    –––
    This paper was presented at the First Scientific Data Conference on Scientific Research, Big Data, and Data Science, organized by CODATA-China and held in Beijing on 24-25 February, 2014.
Special Issue
Highlights of the 2013 International Forum on 'Polar Data Activities in Global Data Systems'.
feedback
Top