Estimating Individual Preferences in the Music CD Market

This research proposes a model to analyze individual customer preferences using purchase records such as Point-of-Sales (POS) data. To some extent, we can identify the interests of customers from their demographics. Consumers are, however, essentially heterogeneous. It is difficult to determine individual customer behavior in detail through aggregate-level estimation. In this paper, we use a Markov chain Monte Carlo (MCMC) method to construct a hierarchical model for tackling this problem. The model encompasses both “commonality” and “heterogeneity.” We apply this MCMC method to the music CD market, where customers have some commonalities although they are heterogeneous. This empirical analysis shows that a hierarchical Bayes (HB) model has a high predictive performance as compared to the naïve forecasting and aggregate-level models.


Introduction
In classical economics, the individual heterogeneity of preferences is treated as an "error term" or a "disturbance."In marketing, on the other hand, it is assumed that consumers are heterogeneous.The recent developments in information technology have facilitated the collection of Point-of-Sales (POS) data along with customer profiles, and firms can now record the individual time-series purchase data of their customers.At the same time, this has led to a need for new methods to analyze the vast amounts of POS data.Although firms collect massive amounts of data, ironically, a problem they often face is that the individual figures of purchases are not sufficient for analysis.For example, an ordinary household does not purchase detergents numerous times in a year.When we construct a model using classical regression, we need to aggregate the data because of the scarcity of the number of individual samples.However, when using this method, the customer identification information of the data gets ignored.
In cases where it is not possible to estimate the individual parameters of the model, a Bayesian estimation through the use of a Markov chain Monte Carlo (MCMC) method can be utilized to obtain parameters.In an MCMC method, we suppose that all parameters are distributed probability distributions and derive estimates from the random numbers generated from these distributions.Using a MCMC method, we can easily expand models and construct complex models that are difficult to estimate through the most likelihood method.There are many existing application researches in the field of marketing (e.g., Abe, in press; Rossi and Allenby, 2003).
In this paper, we analyze a vast amount of data using an MCMC method and investigate the process for the application of hierarchical Bayes (HB) models for database analysis.
2. Data and modeling procedure researches involving hierarchical modeling, models using discrete choice methods have been used.These models cannot be applied to this case without modification.Therefore, we adopt a dimension reduction method that is used for perceptual mapping (e.g., Churchill, 1999) and a joint space map (e.g., DeSarbo and Wu, 2001;DeSarbo, Ramaswamy, Wedel, and Bijmolt, 1996).Using this method, we assign each artist multidimensional continuous values of attribution in order to deal with the third problem.

Model construction
Purchase records are collected over a period of two years.We use the first half of the data collected (the first year) as a learning period to estimate the parameters of models, and the latter half of the data (the second year) as a calibration period to validate the predictive performance of the model.In other words, we predict the purchases of the coming year using the history of the preceding one.

Objective customers and artists
The total number of purchases recorded in the first year is 605,593 and the number of artists whose music is purchased in this period is 8,545.However, the top 500 artists account for 81% of the total sales.
If the sales figures are too low, we cannot assure good accuracy of artist score estimation.Therefore, we use the sales of the top 500 artists for estimation.
In the first year, 161,805 customers purchased the music CDs of these top 500 artists.However, most of the customers make purchases only once or twice (74,276 customers buy once and 32,410 customers buy twice).In this paper, we only consider the 55,119 customers who purchased CDs over three times in our analysis.In addition, later in this paper, we will discuss the follow-up estimation for the excluded samples.

Data division
We divide the above-mentioned 55,119 customers into two datasets, namely, A and B. We derive artist attributes using dataset A and construct models to estimate individual preferences using dataset B. This procedure aims to avoid a loop of analysis caused by the derivation of artist attributes for dependent and explanatory variables from the same data source.We separate datasets A and B completely and treat artist attributes as an exogenous variable.

Derivation of artist attributes
In this section, we describe the process of the derivation of artist attributes.From dataset A, attributes characterizing artists are extracted and each artist is rated according to these attributes.
At first, we prepare a matrix of the size 27,559 × 500.The number of artist k's CDs purchased by customer j is contained in the (j, k) element of the matrix.The rows of this matrix signify customers' co-purchases.By assuming that the artists whose music CDs are purchased by a particular customer have similar attributes, we can obtain artist attributes reflecting customer preferences from this copurchase matrix.
Artist attributes are obtained from this copurchase matrix by reducing dimensions.Although reducing data gives rise to some errors, it is difficult to use a co-purchase matrix to obtain attributes without modification.We use factor analysis (using the most likelihood and varimax rotation methods (refer Harman, 1976)) and obtain 11 factors whose eigenvalues exceed 2.
In general, dimension reducing methods such as factor analysis are used for mapping in order to compare the relative position of each brand or product.Although we do not use mapping in this paper, we adopt a similar technique-namely, factor analysis-to treat reduced variables as brand and product value attributes.
We denote the factor loadings of artist k as f k .
Figure 1 shows attributes of representative artists.

Construction of the hierarchical Bayes (HB) Model
When an individual n chooses an artist k's CD as his/her t-th purchase, let x nt = f k , where x nt is a jdimensional continuous variable.Assuming that x nt is a revealed variable of his/her latent preference y n with error term ε nt , we get, ) , 0 ( , We assume that there is individual heterogeneity of variance.It is appropriate to make this assumption since some people may depict variety-seeking behavior, while others may show loyalty to a particular artist(s).
In addition, since individual purchase samples are scarce, we supplement our information using the demographic variables r n and their parameters Q.We can describe this relation as follows: r n contains (1) number of purchases in learning period (T n ), (2) age, and (3) gender (male = 0, female = 1).We exclude individuals whose gender information is missing.The percentage of exclusion is less than 0.1%.
We can obtain a model by assembling the above two equations as follows: Customer preference y is a vector variable that supplements individual purchase behavior with aggregate attitude from demographic variables.
We estimate the parameters of this hierarchical model using an MCMC method.For more information on prior distributions, posterior distributions,

Sampling
We choose 5,000 customers from dataset B, consisting of 27,559 customers (N = 5,000), in order to estimate parameters.However, we can subsequently estimate excluded customers by using these parameters.The sampling procedure is as follows: In the simulation, we burn-in these procedures for 1,000 iterations, and then save 10,000 samples.
Figure 2 shows the modeling procedures at this point.

Comparison models
We also estimate the following two models Figure 3 is the cumulative gain chart (Berry andLinoff, 1997, 2000) that arranges customers in the ascending order of their squared differences on the horizontal axis and the cumulative sales rate on the vertical axis.In naïve forecasting, customers who purchase CDs the same number of times are arranged in a random order.Since all models exceed a center line, these three models possess predictive power to a certain degree.The naïve forecasting model shows good performance on the left side.This means that the customers who purchase the music CDs of "ASIAN KUNG-FU GENERATION" in the first year also purchase its CDs in the next year.However, the naïve forecasting model cannot predict customers who do not purchase the band's CDs in the first year.
Unfortunately, this category of people forms the majority.We obtain the predictive performance of another artist and calculate the area between the cumulative gain chart and center line.Let this area be S.This indicator may take a negative value if the prediction is worse than random guessing.
We obtain S for all artists.Since this indicator shows the relative predictive performance of a model, we compare these models by rank.We compare the S of the three models and rank these models for each artist.When S takes a negative value, we classify the artist as "N/A (not applicable)" Furthermore, we exclude artists whose CDs are not purchased in the validation period.Table 2 shows the comparison of the HB model and the naïve forecasting model.This table also indicates that the performance of the HB model is better than that of the naïve forecasting model.
Table 3 shows the comparison of the predictive performances for artists that were not purchased during the first year (learning period).With the naïve forecasting model, the rule becomes equivalent to random guess.Note that the performance of the naïve forecasting model deteriorates severely, whereas that of the HB model stays the same.
As described above, the HB model can maintain a better quality of forecasting among prospects.

Application of second layer
Parameter Q of the second layer of the HB model is a matrix whose size is the dimension of factors (dependent variables) × the number of demographic variables.The (j, d) element of the parameter shows the impact of the d-th explanatory variable on the jth factor.
Table 4 shows the sample mean of parameter Q.It allows us to observe which variables affect which particular factors.In this table, "**" denotes significance at 99% level.We calculate these indicators from the MCMC samples.Significance at the 99% level implies that 0.99% of the highest posterior density interval (HPD/HPDI) does not cross 0. "*" indicates 95% and "・," 90%.We can view the aggregate level tendency of customer preference.Further, using this information, we can perform any actions for prospects.
Q and Γ can also be applied to customers

Promotion of new artists and targeting customers
In cases of target segments that are already defined, we can use the information on artist attributes and customer preferences to identify target customers.
For example, we can define target customers who have like "Ayumi Hamasaki" and "Mika Nakashima" in the following manner.
Let the new artist's attributes, f {New Artist} , be Thus, we can obtain a customer set whose preferences approximate the new artist's attributes and promote the new artist to these customers.
However, it should be noted that this procedure may be futile when averaging too many artists.

Applications for recommendation systems
In the music CD market, the top 500 artists account for 80% of the total sales.However, it is unlikely that ordinary customers know all these 500 artists.It is possible that there are some unknown artists that approximate their preference.Recommendations are especially effective in stimulating these latent needs.
When firms estimate the preferences of their customers, they can rank artists for each customer and develop preference information into recommendation systems.This system is not of the type that recommends a product whose characteristics correspond to those of another-for example, recommending B to customers who purchased product A. Instead, this model also considers customers' purchase records in the past year and compliments this information using

Conclusion
It is difficult to develop a model that considers both the aggregate tendency of demographics and the heterogeneity of customers using the classical framework of econometric methods.Furthermore, while POS data records a vast amount of data, many researchers face the problem of scarcity of data when they try to analyze the data on an individual basis.
The HB model is a breakthrough model as it deals with both the "heterogeneity" and "commonality" of customers.
The useful feature of the model is that it has predictive power for purchases prospective customers (customers who do not purchase in the first one year) that cannot be predicted by the naïve forecasting model.There is no effective technique for taking into account a large number of prospects only through the naïve forecasting model.However, the HB model can estimate the latent needs of both prospective and new customers.
When we use condensed attributes for dependent variables, as the result shows, the model has a high predictive performance.Given the fact that it is difficult to use non-condensed data for modeling, this method is more practicable.However, we need to compare this method with other rotation and estimation methods of factor analysis, and other dimension reducing methods like principal component analysis, fuzzy clustering, and MDS for elaboration.Moreover, we have to determine a method of modeling using non-condensed data.
In addition, we need to expand the model to contain a time series variation of artist attributes and customer preferences.Although the proposal model assumes that these variables are stable over a span of two years, it is possible for them to change over time.
Hence, it is desirable to exclude the assumption of stability when analyzing more long term data.

Appendix. Posterior distributions and generation of random numbers
A.1.Prior distributions, likelihood functions, and posterior distributions Detailed derivations of posterior distributions are found in Rossi and Allenby (2003) and Koop (2003).
Hereafter, N is the number of customers; T n , the number of purchases of the n-th customer; D, the dimension of the explanatory variable r; J, the dimension of y n .Further, I K is an identity matrix whose size is K × K, and O M × N is a zero matrix of size M × N.
We stack y n and r n and arrange the following matrices Y and R.

Figure 1 .
Figure 1.Attributes of representative artists individual.The detailed explanations of posterior distributions can be found in the appendix.
The artist whom the customer purchased most frequently during the first year will be purchased again by him/her in the second year.However, this forecasting model cannot rank people who purchase artist k's CDs the same number of times or those who do not purchase them at all.Aggregate model: An aggregate model estimates y directly from the second layer of the HB model using ordinary least squares (OLS).This model cannot estimate parameters for individuals or classify customers on the basis of their demographic variables (number of purchases, age, and gender).Since the same parameters Q and Γ are common across customers, customers with the same demographic profile have the same preferences.In this section, we compare the predictive performances of the HB and comparison models.Initially, we determine the customers who have a high likelihood of purchasing the music CDs of the rock band "ASIAN KUNG-FU GENERATION" in the validation period (the latter year of data collection).We use the mean of the samples of y n as the predictive values of the HB model.number of samples (in this paper, M = 10,000).Value m in the bracket on the top-right of y n (m) denotes the value obtained at m-th sampling.n y is a vector variable that is of the same size as the artist attributes.The greater the approximate closeness to an artist's attributes, the higher is the likelihood of purchasing the artist's CD.Next, we obtain the squared differences between the attributes of ASIAN KUNG-FU GENERATION and individual preferences.We can derive nk e ˆ-squared differences between customer n and artist kmodel, substitute the OLS estimator of the equation and calculate the squared differences.

Figure 3 .
Figure 3. Cumulative gain chart of Table1 showsthe summations that indicate the ranks of the predictive performances for all the artists.The HB model shows the best performance of 230 artists, while the naïve forecasting model shows the best performance for 159 artists.The HB model shows "N/A" for 83 artists and this figure is the lowest among the three models.On the whole, it can be concluded that the HB model shows the best performance, followed by the naïve forecasting model.Since the aggregate model discriminates individual heterogeneity only on the basis of demographic variables, it cannot describe the diversity of the music CD market.

Table 1 .
Rank of predictive performance

Table 2 .
Comparison of HB Model and naïve

Table 3 .
Comparison of HB Model and naïve

Table 4 .
Parameters Q