When we analyze a spatial data, we generally conduct a global spatial analysis. But when spatial correlations are higher in local areas than global areas in analyzing spatial data, we think that local prediction performs better than global prediction in the prediction point of view. The present paper shows whether local prediction outperforms global prediction in the case of high correlation in a local area, based on the AMSE (average mean squared error) statistic. To show the usefulness of the proposed method, we perform a small simulation study and show an empirical example with the real transaction data of apartments in Korea.
This paper focuses on the role of the "Like" button on "Facebook Pages" and proposes on approach of analysis for increasing the number of "Likes" on a "Facebook Page." The paper uses the latent class models to analyze the relationship between the number of "Likes" and the contents of "Posts". After the latent class analysis, the average number of "Likes" according to class groupings will be compared. The data taken from Facebook focuses on the case of one company: Satisfaction Guaranteed. Results will show how the proposed approach of analysis to measure independent variable patterns is responsible for increasing the number of "Likes" on the "Facebook Page" of Satisfaction Guaranteed using the latent class analysis.
In predicting the prices for auctions, we often use linear regression methods where the objective variable is the price. To find the estimate for price, we apply regularization methods in regression such as the ridge, lasso, and their relatives. In used car auctions, these methods provide very similar accuracy in the sense of the RMSE, the root-mean-squared error. However, we have found that the accuracy becomes higher when we use the k nearest-neighbor (k-NN) regression method with selected variables via the linear regression methods to this kind of auctions.
Social network analysis is a statistical analysis that analyzes social structure according to a stream of mutual information between observations. In this study, we used the results of a pass between players in a soccer game. Analysis contents are as follows : (1) Who is the team leader and how much of a role do they play and, (2) We find out the players who play an important role in the game by recognizing a lot of pass or passing between a lot of players. The purpose of this study is to generate basis data for future play strategy of the team by evaluating the role of each player within a team. In this study, social network analysis without separating position is conducted and is performed using each position (defenders and non-defenders), respectively. The results of this study are as follows. First, the available data shows, the players who performed the role of leader were Jungwoo Kim, Sungyueng Ki and Chungyong Lee players. The sub-leaders were Jeongsu Lee players. By position, in case of defender, the leader was Jeongsu Lee player. And, in case of non-defender, because all players played in the game excellently, they can each be the leader.
A candlestick chart is generally used by investors while analyzing portfolios. It consists of daily or weekly opening price, high price, low price and closing price, and investors use candlestick chart on the basis of industrial categories. However, since such industrial categories are not based on share prices, at times, price movement of one candlestick chart can differ from that of another candlestick chart even if both candlestick charts are about the same industrial category. Therefore, categories should be created according to share prices as well. One such study in which brands are classified by share prices proposes to use closing price (Wittman, 2002). However, a method that uses only closing price lacks other trade information. Thus, in this study, we propose a method of brand classification that uses all four trading information: opening price, high price, low price and closing price. As examples, we evaluate similarity by using artificial data.
This study introduces a new type of symbolic data namely a candle chart valued time series, and presents a new approaches of candle chart valued time series for forecasting of stock index direction (i.e. up and down) based on future candle chart form. From the approaches for interval valued time series, we propose forecasting methods for the candle chart valued time series based on a combination of two mid-point and two half range between the highest index and the lowest index, and between open index and close index. Also we propose new sum of squares for candle chart valued time series. To evaluate proposed methods, we describe forecasting result of real data set consisting of Asian major 5 countries' stock market indexes. The forecasting results show that the new approaches and sum of square which are based on approach of interval valued time series outperform than others in forecasting candle chart.
A method for simultaneously performing exploratory factor analysis and k-means clustering is proposed. This is achived as an extension of factor analysis model with fixed both common and specific factors. In our strategy, it is avoided that tandem anaylsis problem which makes it impossible for us to interprete the effect of cluster structure. An efficient alternagting least squares algorithm is developed. To illustrate the usefullness, some numerical analysis are conducted.
The purpose of classification for metabolomics data is finding a subset of metabolites called marker candidate which can separate groups efficiently as well as discriminating the groups. We evaluate and compare 5 classification methods on 26 real datasets, and provide the guidelines for finding marker candidate from appropriate classification method. Although this study shows that the predictive accuracies from 5 methods are sufficiently higher (more than 90%) in 19 cases among 26 datasets, PLSDA and SDA give better performance than other methods from the aspects of classification accuracy and metabolites selection.
A search system for VOD lectures is useful if it beyond the search of only text. To facilitate better searching for movie segments of VOD lectures with Japanese subtitles, we propose a method of using subtitles and a solving maximum likelihood detection from a mixture of normal distributions. The detection is performed by a statistical method by using the EM algorithm. This allows to can estimate parameters of each normal distribution and the number of their compositions. In addition to improving evaluation of movie segments, in order to provide movie segment rankings, we evaluate ranking of each normal distribution in an approximated mixture of normal distributions of a search word's frequency. Rankings is computed by distances of between a mixture of normal distribution and removed distribution which removed one normal distribution from a mixture of normal distributions.
I will present three topics of my research interests : Randomization, Quantification and Visualization. First, I report the lack-of-randomness in shuffling of HwaTu or Hanafuda cards (Huh and Lee, 2010). Second, I write a multidimensional scaling procedure for asymmetric distance matrices (Huh and Lee, 2011). Lastly, nonparametric classifiers produced by support vector machine are visualized in reduced dimensions (Huh and Park, 2010).
In this study we use spatio-temporal small area data of suicide in Japan. Especially, we focused on a municipality unit that is a political unit, such as a city, ward, town, or village, incorporated for local self-government. We used line chart of time series of suicide rates and Choropleth map of suicide rates to detect temporal trend and spatial transition from these graphs. Furthermore, in order to reduce difficulties of parameter selection and detecting connections between two graphs, we developed a system to visualize the spatio-temporal small area data of suicide in Japan.
It has been found that male mice emit ultrasonic vocalizations (USVs) towards females during male-female interaction. The purpose of this paper is to classify the waveforms of the mouse USV data. The data are transformed by FFT (Fast Fourier Transformation). Because the USV data are very noisy, it is impossible to analyze them by existing software. We first smooth the USV waveforms from the noisy data by a moving average method, and then fit them with a polynomial regression. After that, we classify the obtained USV curves by a functional clustering method. This analysis also can help us to find a rule (or grammar) of the USVs in communication between mice.
Center for Statistics and Information in Rikkyo University has developed an e-learning course for multivariate analysis. This course is designed for students in arts departments and has two features. First, contents in this course are based on examples of analysis of real data, rather than mathematical aspects. Second, this course has some devices to learn multivariate analysis with interactive materials. These two features enables students to learn multivariate analysis, without a struggle for mathematical aspects.
A study on the human gait is important in the fields of the biometrics study and the sports/health managements for planning optimal trainings. Gait analysis is mainly based on motion capture system and video data. However, from the standpoint of gait recognition, motion capture system is distant idea for biometrics. Video camera based approach is the realistic way. However, video camera is highly visible in the monitoring environment. If the subjects notice the camera system, subjects may change the behavior. In this study, we focusing on the doppler sensor based gait recognition. The purpose of this study is human gait modeling and parameter estimation based on the doppler sensor system.
In recent years, the monitoring system for the elderly are increasing an interest because of the aging society. The non-contact sensors are attracted an attention because the system requires that daily life of user doesn't interfere. Many of these sensors (e.g. infrared sensor, sound sensor, Doppler sensor) has used in the system, especially a microwave Doppler sensor has the advantage against the noise, light and temperature than another sensors. In this paper, as perspective of monitoring for the elderly, we are focusing on the detection of heartbeat and respiration because human state of life or death is finally able to judge. As initial stage of the system, this paper proposes the detection method about the component of respiration and heartbeat under the low-disturbance environment using the microwave Doppler sonsor.
Recently, human body modeling or human pose modeling is a hot topic in many studies. Several statistical methods have been proposed for biometrical analysis and computer graphics, and few works for apparel. In this study, we propose a statistical method which reconstructs human body shapes from height or various semantic values, e.g., height, waist-girth and chest girth, and takes dispersion of human body shapes into account using principal component analysis and regression model.
Selecting landfill site is an important component of waste management process. Inappropriate selection of a site can engender environmental damage, economic inefficiency, and social and political conflict. These concerns indicate that environmental, economic, and social factors should be considered simultaneously when selecting the landfill sites. The site selection of landfill is a complex and multicriteria decision making process, which requires evaluation of several factors where many different attributes are taken into account. The purpose of this study is to examine a decision making process for site selection. First, we identified potential sites through preliminary screening based on exclusionary criteria. Secondly, data layers were created by collecting data and estimating spatial distribution for environmental, economic and social factors. Finally, after evaluating them based on siting criteria, the data layers were combined by fuzzy gamma operator to select candidate sites. Fuzzy analytic hierarchy process (FAHP) was also used to make pairwise comparisons and assign weights to decision criteria.
When a ship moves from point A across the sea to point B, the moving direction is affected by the victor of tidal. It is usually to use the dynamic program to Find the optimal angles for ship when moving, but it needs many point data (vectors) of tidals from point A to point B and solves it sequentially. Here We use the affine transformation to transform the vectors of tidals at point A and B to position-coordination and use the differentials equation to sequentially solve the optimal angles for ship when shipping.
We consider the over-constrained airport gate assignment problem where the number of flights exceeds the number of available gates, and where the objectives are to minimize the number of ungated flights and the total walking distance or connection times. We will use greedy algorithm to solve the problem and compare it with other scheduling method. Actual and forecasted data will be simulated in the experiment. The greedy algorithm minimizes ungated flights while providing initial feasible solutions that allow flexibility in seeking good solutions.
This study was undertaken to verify the effects of GGH(1) on obesity using high fat diet induced male mice. Eight-week old C57BL/6N mice were used for all experiments. Standard chow diet fed mice were used as lean control and high fat diet induced obese mice were randomly divided into 4 groups : obese control, GGH(1)-125mg/kg, GGH(1)-250mg/kg and GGH(1)-500mg/kg. After mice were treated with oral administration for 8 weeks, body weight, feeding efficiency ratio, plasma triglyceride level and visceral adipose tissue weights were measured. Compared with obese controls, GGH(1)-125mg/kg, GGH(1)-250mg/kg and GGH(1)-500mg/kg treated mice had significantly lower body weight gain and feeding efficiency ratio. Consistent with the effects on body weight gain, GGH(1)-125mg/kg, GGH(1)-250mg/kg and GGH(1)-500mg/kg decreased the weights of visceral adipose tissues. GGH(1)-125mg/kg, GGH(1)-250mg/kg and GGH(1)-500mg/kg significantly decreased plasma levels of triglyceride. Consistent with the effects on feeding efficiency ratio, GGH(1)-125mg/kg, GGH(1)-250mg/kg and GGH(1)-500mg/kg decreased plasma leptin concentrations. Plasma AST and ALT were in the physiological range and organs were not different following GGH(1) treatment compared with obese controls, indicating that GGH(1) does not show any toxic effects on liver. These results suggest that GGH(1) reduces obesity by regulating appetite and visceral lipid metabolism in C57BL/6N mice. Of the 3 GGH(1) concentrations, GGH(1)-500mg/kg seems to be most effective in improving obesity and visceral lipid disorders.
In this paper, we analyze the report "Living activities area in Okayama Prefecture" per town. In this analysis, we evaluate the importance of distance for consumer by the distance decay parameter in Huff model. We not only validate well known trend such that consumer think heavily of distance for convenience goods such as grocery and think lightly of distance for leisure, also find recently tend to think of heavily distance for all products. Furthermore, interesting differences were observed between urban and rural, and between households which have and do not have cars.
The drug-approval process in Japan is far behind and the number of the approved drugs is far less than in other countries in the world. This lag has made it difficult for Japanese doctors to catch up with the global standards for the treatment of various diseases. This problem should be solved immediately. To overcome the drug lag and have a new drug approved without delay in Japan, it is mandatory to join multi-national clinical trials and apply for approval of a drug using the results. Therefore, several university hospitals with excellent performance in clinical trials have established UHCT Alliance to improve the trial environment, especially for multi-national trials, and to implement them more efficiently and safely.
In Nippon Professional Baseball, it is important for players to receive useful information. It has been reported that pitchers's records were better this year than last year. On the other hand, batters's records have worsened overall for the same period. Therefore, we created a pitching model by multinomial logit model to provide useful information to batters. First, we justify the use of the multinomial logit model and explain relevant terms used in the model. Second, we define a pitching prediction model by employing multinomial logit model and variables used for this paper after which we describe the applied data. Finally, we present the conclusions.
In recent years, studies using football data have proliferated. Most existing studies focus on the results of matches. In contrast, few studies have considered each event that is recorded successively during the matches. Therefore, in this paper, we characterize and compare football clubs by considering such information. We characterize the ball's movement among players or different field areas into important event such as shoot and track the ball's attack patterns. These important events are collectively defined as attack patterns. We analyze the attack patterns using social network analysis and build digraphs comprising players or areas in order to characterize and compare clubs.
In various sports such as baseball, football and volleyball, league systems are organized with the same teams, and the teams compete against each other every season. In this article, we consider the modeling of winning percentage using state space model. We apply the models above to the data of the Central League in Nippon Professional Baseball (NPB) for the period 1950-2004.
In sparse regression modeling via regularization such as the lasso, elastic net and bridge regression, it is important to select appropriate values of tuning parameters including regularization parameters. The choice of tuning parameters can be viewed as a model selection and evaluation problem. Mallows' C_p type criterion may be used to choose the tuning parameters, for which the concept of degrees of freedom plays a key role. In the present paper, we propose an efficient algorithm which computes the degrees of freedom sequentially by extending the generalized path seeking algorithm. Monte Carlo simulations demonstrate that our methodology performs well in various situations.
Sensitivity analysis based on influence functions has been widely studied in the field of statistics. In particular the evaluation approach has been applied to different statistical methods such as principal component analysis, correspondence analysis, and linear discriminant analysis. However, the study of discriminant methods in pattern recognition is less advanced. With this background, we focused on a subspace method, which is a discriminant method in pattern recognition, and proposed an evaluation method for the influence of training samples to the result of analysis using influence functions. However, the performance and effectiveness of our method were not illustrated well. In this study, we focused on our single-case diagnostics and applied the approach to a representative subspace method, following which we showed good results. Specifically, in situations that had mislabeled samples in the training data, we were able to detect such samples using our approach and subsequently deleted them from the training data to enhance the performance of the target classifier.
This paper discusses a symbolic clustering method for distribution valued dissimilarities. Symbolic Data Analysis (SDA) is a new approach for data analysis proposed by Diday in 1980s. Especially, a clustering method for symbolic data is called "Symbolic clustering". There are a lot of researches including Hierarchical clustering by Bock (2001) and Chavent & Lechecallier (2002), but there are not so many researches dealing with distribution valued dissimilarities. This paper proposes a new method for symbolic clustering using distribution valued dissimilarities.
Recently, a study which recognize human activity by acceleration and angular speed sensor have been actively done. Service deployment of these studies expands medical, sport, security and variable fields. In recognizing human activity, suport vector machine(SVM) is considered as one of best learning machine in many one which is known now, because SVM has high recognition accuracy and needs less computation time. But, SVM has a defect which issues with recognition of the outlier and missing value. Therefore, we focused on Conditional Random Field (CRF) which recognizes the activity while maximizing likelihood of the interval. CRF has been often used in the field such as natural language processing, and the data used in CRF is limited to one-dimensional and categorical data. In this paper, we form an opinion of a method which transfers multidimensional time series data to the data which can be analyzed by CRF and evaluate feature selection.
In this study we discuss Tamhane and Logan (2002)'s multivariate one-sided test for comparing two normal mean vectors under the assumption that the common covariance matrix is unknown. Although they specified a statistic for the test, it is difficult to derive its distribution. They derived its asymptotic distribution under the null hypothesis by using a moment matching method. Although the critical value satisfies a specified significance level approximately, it seems that the closeness of the approximation has not been investigated in detail. In this study we give numerical examples regarding the actual Type I error in various cases for Tamhane and Logan (2002)'s multivariate one-sided test intended to investigate the closeness of the approximation.
The problem of classifying a new observation vector into one of the two known groups distributed as multivariate normal with common covariance matrix is considered. In this paper, we handle the situation that the dimension, p, of the observation vectors is less than the total number, N, of observation vectors from the two groups, but both p and N tend to infinity with the same order. Since the inverse of the sample covariance matrix is close to an ill condition in this situation, it may be better to replace it with the inverse of the ridge-type estimator of the covariance matrix in the linear discriminant analysis (LDA). The resulting rule is called the ridge-type linear discriminant analysis (RLDA). The second-order expansion of the expected probability of misclassifkation (EPMC) for RLDA is derived by Kubokawa, Hyodo and Srivastava (2011), and the second-order unbiased estimator of EMPC is also given. In this study, the estimation accuracy of the second-order unbiased estimator of EPMC is investigated by using Monte Carlo simulation.
We consider multiple comparisons among mean vectors for high-dimensional data under the multivariate normality. The statistic based on Dempster trace criterion is given, and also its approximate upper percentile is derived by using Bonferroni's inequality. Finally, the accuracy of its approximate value is evaluated by Monte Carlo simulation.
We consider a two-sample test for the mean vectors of high-dimensional data when the dimension is large compared to the sample size. In this talk, we discuss the multivariate Behrens-Fisher problem, that is, we assume that the variance-covariance matrices are not homogeneous across groups. For these situations, we propose a Dempster type test statistic. Also, we derive asymptotic null distribution and asymptotic expansion for the upper percentiles of this statistic when both the sample size and the dimension tend to infinity. Finally, we evaluate the accuracy of approximation by Monte Carlo simulation.
The lasso is simultaneous variable selection and parameter estimation procedure in linear regression models. The estimates can be interpreted as a Bayesian posterior mode when independent Laplace prior distributions are placed on the regression coefficients. Park and Casclla (2008) extended the Bayesian lasso linear regression model by placing prior distributions on hyperparameters in independent Laplace distribution. It might be however noted that the point estimate of Bayesian lasso is not sparse. In the present paper, we propose an efficient algorithm which modifies the Bayesian lasso estimates so as to be sparse. Monte Carlo simulations are conducted to investigate the efficiency of the proposed algorithm.
Data which have hierarchical structure are observed in many fields like a sociology, psychology and clinical trials. Hierarchical Generalized Linear Models ( HGLM ) is applied to the hierarchical data to carry out analyses taking account of the data structure. Likelihood (and approximate likelihood) approaches based on asymptotic theory are most widely used in current hierarchical analyses. One of alternative approaches is Bayesian approach. As well known Bayesian approach will be quite robust even when the target data size is small. Purpose of this research is to compare Bayesian and likelihood-based approaches for fitting of Hierarchical generalized linear model.
From precision medicine point of view, it is an interesting theme to search for some subsets with large treatment difference between test drugs and placebo based on patient background information. Many methods such as classification and regression trees (CART ) and active region finder method (ARF ) can be used to find subsets impacted on response variable. However, these methods evaluate only influence on response variable and they don't look a treatment difference. Therefore, it is necessary to develop methods to find some subsets based on the treatment difference information. In addition, there is difficult common issue of course of dimensionality when a subset is identified on high dimensional explanatory variable space. In this paper, we proposed two methods. One is a revised method of ARF to search for the subsets with measuring treatment difference directly. The other one is a combination method of ARF and relative projection pursuit (RPP ) to find the subset with the largest treatment difference on 1-dimensional reducing space from raw high dimensional space. From the results of simulated data analysis with our methods, we showed that our methods could detect the subset with largest treatment difference as designed.
ANP (Analytic Network Process) was developed from AHP (Analytic Hierarchy Process) with network structure. ANP has been used in the domain of decision making, and is useful for solving problems with network structure or dependency between elements. An eigenvector of a pairwise comparison matrix is often employed as an element of a super matrix in ANP. A sensitivity analysis for a pairwise comparison matrix is proposed, because data often lose their reliability. In other words a comparison matrix does not always have enough consistency. In this case, a fuzzy representation for weights is useful. We propose a fuzzy representation for the components of a super matrix, using the results from the sensitivity analysis. It enables us to find the composite weight of ANP as a fuzzy number when the comparison matrix does not have good consistency.
We consider a parallel profile model for several groups when the data has two-step monotone missing observations. For two-step monotone missing data, Anderson and Olkin (1985) obtained the MLEs of mean vector and covariance matrix for one sample problem. By the same way as Anderson and Olkin (1985), the MLEs for two sample problem have been obtained (see, e.g., Shutoh, Hyodo and Seo (2011)). Also, profile analysis of several groups was discussed by Srivastava (1987). In this paper, we construct a test statistic for parallel hypothesis based on the likelihood ratio with two-step monotone missing data. Finally, in order to investigate the accuracy for the null distribution of the proposed statistic, we perform Monte Carlo simulation for some selected values of parameters.
For missing data, EM algorithm is a parameter estimation method. Srivastava (1985) derived likelihood equations and likelihood ratio test without condition of missing patterns. Srivastava and Carter (1986) proposed the numerical solution for likelihood equations by Newton-Raphson method. We propose an estimator of kurtosis parameter for missing data without condition of missing patterns in elliptical population. In order to evaluate accuracy of the kurtosis parameter, the numerical results by Monte Carlo simulation for some selected values of parameters are presented. Then we make sure that it is better to utilize sample which includes missing data than to discard it.
In clinical studies, correlated binary response data are frequently collected. Although various methods for analysis of correlated data have been proposed, the evaluation on those is not sufficient in case of binary responses with specifically various missing mechanisms. Therefore we investigated the performance of six statistical methods (last observation carried forward (LOCF), complete case analysis (CC), conventional generalized estimating equations (GEE), weighted-GEE (WGEE), multiple imputation (MI) and generalized linear mixed-effects models (GLMM)) for correlated binary response with missing. Continuous variables for defining binary responses were used to impute missing values in MI and to calculate the weights for WGEE. This evaluation used actual data from a clinical study that compared two antidepressants.
EM algorithms for maximum likelihood factor analysis have been proposed by Rubin and Thayer (1982). In this paper, it is proved that that their algorithms always produce proper solutions with positive unique variances and factor correlations whose absolute values do not exceed one.