6.1 Preparing Data for Multivariate Data Analysis

Where multivariate methods (cluster analysis, principal components analysis, correlation matrices) are required, it is usually necessary to transform data and treat them in a special way to avoid breaking the assumptions of a method or drawing erroneous conclusions. This is because the constant sum constraint that defines compositional data - put simply, if one element increases in proportion, an equal decrease must occur in the remaining elements that make up the composition. This is further complicated by unobserved elements, dimensionless (i.e. calibrated) data, and observations below the limit of detection. The package compositions deals with much of these issues, including any necessary transformation, and ensures that the data are treated differently by functions like princomp() and dist(), e.g. compositions::princomp.acomp() by setting a class attribute and providing modified functions.

It’s worth spending some time getting familiar with the documentation for compositions, and possibly the wider literature around compositional data analysis. Several different methods are provided depending on the nature of the compositional data - in these examples, we use the acomp() but other methods are available. This package also deals with the variety of different types of zero and missing values. For these count data we deal with values that are below the limit of detection by letting the compositions package know the detection limit. For these count data, the detection limit is 1, and thus is coded as -1. This allows zeroreplace() to correctly deal with this problem.

In order to correctly identify the observations to their original data source, it is necessary to use row names that uniquely identify observations. For single scans this is not usually an issue — the depth or position variable can be used. However, where a dataset is a composition of multiple cores, you may find that there are multiple observations for a particular depth where cores overlap. We don’t usually use row names when working in the tidyverse style, but because were going to be using base functions like princomp() and dist(), they may become necessary. Here we create a unique identifier in CD166_19_xrf called uid that we will use to uniquely identify observations throughout the rest of our analysis.

CD166_19_xrf <- CD166_19_xrf %>%
  mutate(uid = paste0(label, "_", depth))

TRUE %in% CD166_19_xrf$uid %>% duplicated()

## [1] FALSE

CD166_19_xrf_acomp <- CD166_19_xrf %>%
  filter(qc == TRUE) %>%
  select(any_of(c(elementsList, "uid"))) %>%
  column_to_rownames("uid") %>%
  mutate(across(everything(), function(x){ifelse(x == 0, -1, x)})) %>%
  acomp()
head(CD166_19_xrf_acomp)

##      Al             Si            P               S               Cl           
## S1_1 "0.0004510065" "0.001631931" "<5.934295e-06" " 5.934295e-05" "0.007821401"
## S1_2 "0.0002804828" "0.001505750" "<4.920751e-06" "<4.920751e-06" "0.007445097"
## S1_3 "0.0003343092" "0.001490838" "<4.517691e-06" " 1.445661e-04" "0.006641006"
## S1_4 "0.0001227370" "0.000972455" "<4.720655e-06" " 9.913376e-05" "0.006193500"
## S1_5 "0.0002582984" "0.001135538" "<4.873556e-06" " 1.803216e-04" "0.008479987"
## S1_6 "0.0001333485" "0.001713775" "<4.938833e-06" " 1.382873e-04" "0.007783600"
##      Ar            K            Ca          Sc              Ti           
## S1_1 "0.003584314" "0.01522147" "0.6689969" "<5.934295e-06" "0.009856865"
## S1_2 "0.002927847" "0.01293173" "0.7100004" " 6.889052e-05" "0.008886877"
## S1_3 "0.002285952" "0.01074307" "0.7361036" "<4.517691e-06" "0.009582023"
## S1_4 "0.002619964" "0.01069228" "0.7231761" "<4.720655e-06" "0.009587651"
## S1_5 "0.002880271" "0.01436237" "0.6622139" " 5.165969e-04" "0.008899113"
## S1_6 "0.002489172" "0.01570055" "0.6528297" " 6.914366e-05" "0.009497375"
##      V              Cr            Mn            Fe          Ni            
## S1_1 "3.026491e-04" "0.001531048" "0.003014622" "0.2086973" "0.0007299183"
## S1_2 "4.330261e-04" "0.001604165" "0.002750700" "0.1795779" "0.0005117581"
## S1_3 "9.938921e-05" "0.001359825" "0.002182045" "0.1624200" "0.0006053706"
## S1_4 "3.540491e-04" "0.001444520" "0.002289518" "0.1676399" "0.0006797744"
## S1_5 "4.337464e-04" "0.002144364" "0.003596684" "0.2159131" "0.0008090102"
## S1_6 "5.284551e-04" "0.001980472" "0.003911555" "0.2236452" "0.0007457637"
##      Cu             Zn            Ga              Ge              Br           
## S1_1 "0.0016437998" "0.001216531" " 2.373718e-05" "<5.934295e-06" "0.002913739"
## S1_2 "0.0011317728" "0.001318761" " 2.361961e-04" " 1.968301e-04" "0.002977055"
## S1_3 "0.0006776537" "0.000555676" "<4.517691e-06" " 1.942607e-04" "0.002484730"
## S1_4 "0.0009724550" "0.001128237" "<4.720655e-06" "<4.720655e-06" "0.002874879"
## S1_5 "0.0013451013" "0.001169653" "<4.873556e-06" "<4.873556e-06" "0.003211673"
## S1_6 "0.0012347081" "0.001452017" " 1.975533e-04" " 2.370640e-04" "0.003501632"
##      Rb             Sr           Y              Zr             Pd            
## S1_1 "0.0019523832" "0.04929026" "8.901443e-05" "0.0015132453" "0.0001839632"
## S1_2 "0.0003739771" "0.04517742" "6.889052e-04" "0.0015844819" "0.0003838186"
## S1_3 "0.0013327189" "0.04356861" "5.376053e-04" "0.0012649536" "0.0003388268"
## S1_4 "0.0014775651" "0.04692331" "5.051101e-04" "0.0022423112" "0.0003115632"
## S1_5 "0.0014815609" "0.05013427" "8.772400e-04" "0.0007797689" "0.0001510802"
## S1_6 "0.0015705488" "0.04897840" "5.235163e-04" "0.0015409158" "0.0003555959"
##      Cd             I              Cs              Ba            
## S1_1 "7.714584e-05" "0.0002255032" "<5.934295e-06" "0.0003738606"
## S1_2 "1.673055e-04" "0.0002361961" "<4.920751e-06" "0.0002361961"
## S1_3 "2.484730e-04" "0.0003614153" "<4.517691e-06" "0.0001129423"
## S1_4 "1.132957e-04" "0.0003162839" "<4.720655e-06" "0.0002832393"
## S1_5 "1.657009e-04" "0.0001120918" "<4.873556e-06" "0.0006433093"
## S1_6 "1.382873e-04" "0.0002864523" "<4.938833e-06" "0.0004593114"
##      Nd             Sm             Yb             Ta            W            
## S1_1 "0.0001008830" "2.551747e-04" "0.0008545385" "0.004682159" "0.011821117"
## S1_2 "0.0003296903" "2.263546e-04" "0.0013876519" "0.003818503" "0.009934997"
## S1_3 "0.0001987784" "1.761900e-04" "0.0007137952" "0.003175937" "0.009107666"
## S1_4 "0.0003540491" "3.021219e-04" "0.0010291028" "0.003748200" "0.010196615"
## S1_5 "0.0002046893" "1.754480e-04" "0.0007943896" "0.004478798" "0.011423614"
## S1_6 "0.0004790668" "5.926599e-05" "0.0009087452" "0.004207885" "0.011537113"
##      Pb              Bi            
## S1_1 " 4.569408e-04" "0.0004272693"
## S1_2 "<4.920751e-06" "0.0006692222"
## S1_3 " 5.872999e-05" "0.0008990206"
## S1_4 " 7.175396e-04" "0.0006325678"
## S1_5 " 2.631720e-04" "0.0007651482"
## S1_6 " 2.666970e-04" "0.0008988675"
## attr(,"class")
## [1] "acomp"

Where it is necessary to keep some (or all) of the other information, this can be re-joined to the acomp() object, but only if its acomp class is removed. For example:

CD166_19_xrf_acomp_meta <- full_join(CD166_19_xrf_acomp %>% 
                                       as.data.frame() %>%
                                       rownames_to_column("uid"),
                                     CD166_19_xrf %>%
                                       select(-any_of(elementsList)),
                                     by = "uid"
                                     ) %>%
  arrange(depth, label)