Metodološki zvezki, Vol. 3, No. 1, 2006, 147-162 Latent Class Analysis Identification of Syndromes in Alzheimer's Disease: A Bayesian Approach Cathal D. Walsh1 Abstract Latent variable models have been used extensively in the social sciences. In this work a latent class analysis is used to identify syndromes within Alzheimer's disease. The fitting of the model is done in a Bayesian framework, and this is examined in detail here. In particular, the label switching problem is identified, and solutions presented. Graphical summaries of the posterior distribution are included. 1 Introduction Latent Class Analysis (LCA) is the use of a discrete latent variable to model a situation where there are a number of categorical response variables of interest. These models have been used extensively in the social sciences to model heterogeneity of manifest responses in a multivariate sense. Many examples and guidance on practical fitting strategies may be found in Hagenaars and McCutcheon (2002). Examples of the use of LCA in the context of medical diagnosis date back to Young (1983). The methods used in this paper draw on Bayesian strategies for fitting these models, an overview of which can be found in Garrett and Zeger (2000). 2 Model The model for LCA can be described in terms of manifest variables x, and latent categorical variable z. In this case, interest is on manifest variables which consist of a number of binary indicators for each individual, being presence or absence of a particular disease symptom. Let xj be the response vector of individual j taken from a sample of J individuals. Then xij is the presence or absence of symptom 1 Visiting Research Fellow, Mathematics and Statistics, QUT, Gardens Point, Brisbane, QLD 4001, Australia; Cathal.Walsh@tcd.ie 148 Cathal D. Walsh iÎ1,...,I. Let pik be the probability of a positive response on variable i for a person in class zj=k kÎ 1,...,K and h k be the probability that a randomly chosen individual is in class k. The conditional distribution of each xij is Bernouilli: f(xij | zj = k, pik ) = pikixj (1 - p ik ) - xij . Given the class zj=k and the j th individual, independence yields: Õ i=1 f(xj |zj = k) = Õpikxij(1 -p ik) - xij . With K classes, the mixture becomes, k=1 i = 1 The posterior probability that an individual with response xj belongs to class k is; h(zj = k | xj) = i =1. f(xj) Thus, conditioning on the unobservable class, yields a straightforward finite mixture of Bernoulli random variables. This class variable is unknown and some effort is spent on the identification of that for each individual. 3 Fitting Fitting the latent class model involves standard techniques used to deal with missing labels. Thus, for example, in the likelihood framework the EM algorithm Dempster et al. (1977) is used. In a Bayesian context, the missing labels are treated as parameters to be jointly estimated, and samples from the corresponding posterior can be obtained using MCMC. 3.1 EM algorithm In the maximum likelihood setting the label information is treated as unknown, and this is completed before the parameters are estimated. Since this completion step is carried out with uncertain estimates of the parameters, it is repeated with the new estimates in an iterative fashion. Latent Class Analysis Identification of Syndromes… 149 This is the most common method of fitting these models, and care must be taken to ensure that local maxima are not reached. This is done by using multiple restarts from different initial conditions. The algorithm used to obtain point estimates of the parameters then proceeds as; 1. Choose an initial set of posterior probabilities h(zj = k | xj ) ÙÙ 2. Obtain a first approximation to hk and pik 3. Substitute these estimates into the expression for h(zj = k | xj ) to get improved estimates ÙÙ 4. Return to stage 2 to get new approximations for hk and pik This algorithm proceeds quickly, and there is virtually no computational overhead involved. In order to examine standard errors and goodness of fit statistics great care must be taken in the context of sparse data. Since the number of possible response patterns is large, 2I, sparseness is a concern even where data on many hundreds have been obtained. Solutions to these problems include using bootstrap samples or lower order marginals for goodness of fit. 3.2 Bayesian An alternative method of fitting these models is to use a fully Bayesian specification. This requires the model for the data, together with priors for the relevant parameters. The model is as has been specified in Section 2. The priors can be obtained from specialists within the area of application. Alternatively, sensibly vague priors can be placed on the parameters. For example, a Dirichlet prior with equal weights on h would be considered to be ‘flat’ in the sense one would expect. When fitting the model using MCMC a sequence of realisations of the parameters is available at each step, and derived summaries may usefully be presented in examination of model fit and interpretability. An outline of some diagnostics which can be of practical use is given in Garrett and Zeger (2000). One of the referees emphasised that analogous methods can be used to explore the likelihood without using a fully Bayesian model. In an MLE framework, similar summaries may be constructed. The emphasis here, however, is how the Bayesian fitting proceeds. 150 Cathal D. Walsh 4 Application to Alzheimer's disease Alzheimer's disease is a degenerative condition which is affecting increasing numbers in a greying society. Largely due to improvements in health care and population based interventions, we have seen improving outcomes for those suffering from cardiac conditions and cancers. In contrast, however, no definitive cure exists for Alzheimer's Disease. Research into this disease is highly multi-disciplinary involving psychiatrists, neuropsychologists, data managers, clinical psychologists and statisticians. The type of data that arise are complex and as identified by Kryscio and Schmitt (2000), more statisticians are needed in this area. Of particular interest for this work, the clinical side of which is discussed in more detail in Moran et al. (2004), is the relationship between Behavioural and Psychiatric Symptoms of Dementia (BPSD) and the disease itself. The working research hypothesis is that subclasses, or syndromes of the disease may exist. Further, it is supposed that these are the clinical phenotypes of the disease which may be related to genetic factors specific to the individual. By identifying clinical phenotypes, genetic testing of individuals may be sampled in an efficient fashion - ensuring that the different syndromes are represented by the sample of individuals chosen. To this end, the probability of class membership will be a useful inferential summary. 4.1 Data description The data in this case come from a memory clinic in St James's Hospital, Dublin. The Mercer's Institute houses the national memory clinic for Ireland and is the primary centre involved in the differential diagnosis of Alzheimer's disease. The sample of individuals to be examined was restricted to first visit patients with mild disease. This restriction was to ensure clinical homogeneity of the sample. An examination of cases dealt with by the clinic revealed 240 first visits for individuals who were diagnosed as having probable disease according to the NINCDS-ADRDA criteria McKhann et al. (1984), and a Clinical Dementia Rating (CDR) Berg (1988) of 0.5 or 1.0. This restriction to mild disease was made in order to ensure that the symptoms were related to syndrome rather than severity. This strategy ensured that a known source of heterogeneity was eliminated before the analysis began. The Behave-AD Reisberg et al. (1996) instrument had been administered to the primary caregiver and this produces information on the prevalence of each symptom. The Behave-AD produces scores on an ordinal scale for each of the symptoms. However, since this sample consisted of individuals in the mild stages of disease Latent Class Analysis Identification of Syndromes… 151 the symptoms were each recorded as a binary variable. The symptoms of interest in this analysis were Hallucinations, Activity Disturbance, Aggression, Agitation, Diurnal Rhythm Disturbance and Affective disorder. 4.2 Data Since the pattern of symptoms can be described by binary variables, it is convenient to write the combinations in the form {0,1}6, so for example an individual exhibiting all symptoms would be denoted 111111, whereas an individual exhibiting none would be denoted 000000. Thus, Table 1 summarises the data on all cases included. Table 1: Data on prevalence of each symptom pattern. Pattern n 3 1 1 1 2 5 2 1 1 1 1 6 1 Pattern n 14 2 3 9 1 11 24 3 11 2 35 20 3 Pattern n 2 1 1 4 2 3 1 9 3 6 1 25 18 111111 011101 001101 111011 011011 001011 111001 011010 001010 110101 011001 001001 110011 011000 001000 110001 010111 000111 110000 010101 000110 101001 010100 000101 100101 010011 000100 100001 010010 000011 100000 010001 000010 011111 010000 000001 011110 001111 000000 5 Analysis The model was fit in the maximum likelihood framework using LATCLASS Bartholomew and Knott (1999) and in the Bayesian framework using WinBUGS 1.4. Additional processing was carried out using R 1.8. R Dev Core Team (2005). All these packages ran on a 3.2GHz Pentium 4 PC under Windows XP. 5.1 Label switching A key issue which arises when sampling from the joint posterior distribution is that the label that is sampled for each individual is assigned at each step of the 152 Cathal D. Walsh sampler. Since the label is a latent marker, the assignment of the particular label is unique only up to the permutation group. An example makes this clear. A simulation study highlights what occurs. For this simulation, two latent groups were defined. The prevalence vector was set at h=[0.2,0.8] and the symptoms within group 1 were given with prevalence p11 =0.1, p21=0.6 and in group 2 with prevalence p12=0.9, p22=0.3. The total number of individuals was set at 1000. Figure 1: Chain for simulated data showing chains exploring different label space. As is good practice, 5 chains were set running from different starting values. The output of the chain for h is shown in Figure 1. The between chain variability is influenced by the distance of two of the chains from the lower three. This is a cause for a concern when considering whether the chains have converged in distribution. Of course, the issue here is that two of the chains have labelled individuals in one fashion whereas the other three have labeled in the other direction. By rearranging the columns of the sampled matrix, this is made clear. In particular, h1 and h2 have been switched for the top two chains and the resulting output is shown in Figure 2. The arbitrary nature of the labels for latent mixtures and the difficulties caused for Bayesian inference is well known, and is discussed, for example, in Richardson and Green (1997). However, there are a number of strategies to deal with this problem some of which are discussed here. Latent Class Analysis Identification of Syndromes… 153 s Lu Index Figure 2: Transformed chains showing impact of label change. The solution proposed in Richardson and Green (1997) is to place a constraint on the parameter space. This could be in hspace, in pspace or some combination of them. For illustration, the behaviour of the sampler in p space is shown in Figure 3. It is clear from this that the chains are exploring parts of the joint space which is divided by the line of symmetry. An arbitrary ordering of the parameters, either through the prior or post-hoc will break this symmetry and force identifiability. It is noted here that the fact that the chains do not explore the whole space (in the case of the simulated data) means that they can not have (technically) converged. Indeed, this fact is described for mixture models in Celeux et al. (2000) where the authors suggest that “almost the entirety of samplers used for mixture models has not converged.” Of course, in order for the results to be usable, what is required is samples from the posterior, modulo the permutation group. One strategy is the truncation along symmetries as suggested by placing constraints on the parameters. An alternative is to ‘gather’ posterior samples together as described by Stephens (2000). Here the author suggests that a loss function can be used to ‘suck’ the samples in the joint posterior together. Using the ideas presented, an appropriate loss function for this model is represented by a product of Dirichlet and Betas. 154 Cathal D. Walsh d — 0.0 0.2 0.4 0.6 OB 1.0 Pi 11 Figure 3: Transformed chains for p showing impact of label change. Thus the algorithm for this situation is; IK 1. Calculated loss is based on -log(×) of Di(h | a)ÕBe(p ik | qik). i,k=1 2. Estimate aand qfrom the initial set of samples. 3. Run through sampled values permuting labels to minimize the loss. 4. If changes of labels have occurred return to the start. Offline, this algorithm has taken minutes with up to 4 classes, but of the order of an hour for 5 and many hours for 6 classes for 10,000 samples. The expense comes from the fact that the permutation group grows very quickly. In practice, the algorithm will not change the results compared to the simple constraints for the situation given in our simulated example. This is largely due to the fact that there is a large amount of information in the data in this case. Thus the joint posterior is well defined and label switching within chains is unlikely. However, for the case of the Alzheimer's data as recorded, a difference is observable. Due to the smaller amount of data, the information in the likelihood is less, and thus the joint posterior is flatter. This permits the sampler to commute across permutations of labels within a single chain. Indeed, a similar problem would occur with more information if the modes were closer - a feature of larger number of classes. The Figure 4 shows the case of the sampler for hfor the 3 class model. Latent Class Analysis Identification of Syndromes… 155 £ 00 o 1000 2000 3000 4000 5000 Figure 4: Chain for h showing switching. 5.2 Parameter constraints solution Initially, constraints are placed on the h; h1