## 6.1 Preparing Data for Multivariate Data Analysis

Where multivariate methods (cluster analysis, principal components analysis, correlation matrices) are required, it is usually necessary to transform data and treat them in a special way to avoid breaking the assumptions of a method or drawing erroneous conclusions. This is because the constant sum constraint that defines compositional data - put simply, if one element increases in proportion, an equal decrease must occur in the remaining elements that make up the composition. This is further complicated by unobserved elements, dimensionless (i.e. calibrated) data, and observations below the limit of detection. The package `compositions` deals with much of these issues, including any necessary transformation, and ensures that the data are treated differently by functions like `princomp()` and `dist()`, e.g. `compositions::princomp.acomp()` by setting a `class` attribute and providing modified functions.

It’s worth spending some time getting familiar with the documentation for `compositions`, and possibly the wider literature around compositional data analysis. Several different methods are provided depending on the nature of the compositional data - in these examples, we use the `acomp()` but other methods are available. This package also deals with the variety of different types of zero and missing values. For these count data we deal with values that are below the limit of detection by letting the `compositions` package know the detection limit. For these count data, the detection limit is `1`, and thus is coded as `-1`. This allows `zeroreplace()` to correctly deal with this problem.

In order to correctly identify the observations to their original data source, it is necessary to use row names that uniquely identify observations. For single scans this is not usually an issue — the `depth` or `position` variable can be used. However, where a dataset is a composition of multiple cores, you may find that there are multiple observations for a particular depth where cores overlap. We don’t usually use row names when working in the `tidyverse` style, but because were going to be using base functions like `princomp()` and `dist()`, they may become necessary. Here we create a unique identifier in `CD166_19_xrf` called `uid` that we will use to uniquely identify observations throughout the rest of our analysis.

``````CD166_19_xrf <- CD166_19_xrf %>%
mutate(uid = paste0(label, "_", depth))

TRUE %in% CD166_19_xrf\$uid %>% duplicated()``````
``## [1] FALSE``
``````CD166_19_xrf_acomp <- CD166_19_xrf %>%
filter(qc == TRUE) %>%
select(any_of(c(elementsList, "uid"))) %>%
column_to_rownames("uid") %>%
mutate(across(everything(), function(x){ifelse(x == 0, -1, x)})) %>%
acomp()
``````##      Al             Si            P               S               Cl
## S1_1 "0.0004510065" "0.001631931" "<5.934295e-06" " 5.934295e-05" "0.007821401"
## S1_2 "0.0002804828" "0.001505750" "<4.920751e-06" "<4.920751e-06" "0.007445097"
## S1_3 "0.0003343092" "0.001490838" "<4.517691e-06" " 1.445661e-04" "0.006641006"
## S1_4 "0.0001227370" "0.000972455" "<4.720655e-06" " 9.913376e-05" "0.006193500"
## S1_5 "0.0002582984" "0.001135538" "<4.873556e-06" " 1.803216e-04" "0.008479987"
## S1_6 "0.0001333485" "0.001713775" "<4.938833e-06" " 1.382873e-04" "0.007783600"
##      Ar            K            Ca          Sc              Ti
## S1_1 "0.003584314" "0.01522147" "0.6689969" "<5.934295e-06" "0.009856865"
## S1_2 "0.002927847" "0.01293173" "0.7100004" " 6.889052e-05" "0.008886877"
## S1_3 "0.002285952" "0.01074307" "0.7361036" "<4.517691e-06" "0.009582023"
## S1_4 "0.002619964" "0.01069228" "0.7231761" "<4.720655e-06" "0.009587651"
## S1_5 "0.002880271" "0.01436237" "0.6622139" " 5.165969e-04" "0.008899113"
## S1_6 "0.002489172" "0.01570055" "0.6528297" " 6.914366e-05" "0.009497375"
##      V              Cr            Mn            Fe          Ni
## S1_1 "3.026491e-04" "0.001531048" "0.003014622" "0.2086973" "0.0007299183"
## S1_2 "4.330261e-04" "0.001604165" "0.002750700" "0.1795779" "0.0005117581"
## S1_3 "9.938921e-05" "0.001359825" "0.002182045" "0.1624200" "0.0006053706"
## S1_4 "3.540491e-04" "0.001444520" "0.002289518" "0.1676399" "0.0006797744"
## S1_5 "4.337464e-04" "0.002144364" "0.003596684" "0.2159131" "0.0008090102"
## S1_6 "5.284551e-04" "0.001980472" "0.003911555" "0.2236452" "0.0007457637"
##      Cu             Zn            Ga              Ge              Br
## S1_1 "0.0016437998" "0.001216531" " 2.373718e-05" "<5.934295e-06" "0.002913739"
## S1_2 "0.0011317728" "0.001318761" " 2.361961e-04" " 1.968301e-04" "0.002977055"
## S1_3 "0.0006776537" "0.000555676" "<4.517691e-06" " 1.942607e-04" "0.002484730"
## S1_4 "0.0009724550" "0.001128237" "<4.720655e-06" "<4.720655e-06" "0.002874879"
## S1_5 "0.0013451013" "0.001169653" "<4.873556e-06" "<4.873556e-06" "0.003211673"
## S1_6 "0.0012347081" "0.001452017" " 1.975533e-04" " 2.370640e-04" "0.003501632"
##      Rb             Sr           Y              Zr             Pd
## S1_1 "0.0019523832" "0.04929026" "8.901443e-05" "0.0015132453" "0.0001839632"
## S1_2 "0.0003739771" "0.04517742" "6.889052e-04" "0.0015844819" "0.0003838186"
## S1_3 "0.0013327189" "0.04356861" "5.376053e-04" "0.0012649536" "0.0003388268"
## S1_4 "0.0014775651" "0.04692331" "5.051101e-04" "0.0022423112" "0.0003115632"
## S1_5 "0.0014815609" "0.05013427" "8.772400e-04" "0.0007797689" "0.0001510802"
## S1_6 "0.0015705488" "0.04897840" "5.235163e-04" "0.0015409158" "0.0003555959"
##      Cd             I              Cs              Ba
## S1_1 "7.714584e-05" "0.0002255032" "<5.934295e-06" "0.0003738606"
## S1_2 "1.673055e-04" "0.0002361961" "<4.920751e-06" "0.0002361961"
## S1_3 "2.484730e-04" "0.0003614153" "<4.517691e-06" "0.0001129423"
## S1_4 "1.132957e-04" "0.0003162839" "<4.720655e-06" "0.0002832393"
## S1_5 "1.657009e-04" "0.0001120918" "<4.873556e-06" "0.0006433093"
## S1_6 "1.382873e-04" "0.0002864523" "<4.938833e-06" "0.0004593114"
##      Nd             Sm             Yb             Ta            W
## S1_1 "0.0001008830" "2.551747e-04" "0.0008545385" "0.004682159" "0.011821117"
## S1_2 "0.0003296903" "2.263546e-04" "0.0013876519" "0.003818503" "0.009934997"
## S1_3 "0.0001987784" "1.761900e-04" "0.0007137952" "0.003175937" "0.009107666"
## S1_4 "0.0003540491" "3.021219e-04" "0.0010291028" "0.003748200" "0.010196615"
## S1_5 "0.0002046893" "1.754480e-04" "0.0007943896" "0.004478798" "0.011423614"
## S1_6 "0.0004790668" "5.926599e-05" "0.0009087452" "0.004207885" "0.011537113"
##      Pb              Bi
## S1_1 " 4.569408e-04" "0.0004272693"
## S1_2 "<4.920751e-06" "0.0006692222"
## S1_3 " 5.872999e-05" "0.0008990206"
## S1_4 " 7.175396e-04" "0.0006325678"
## S1_5 " 2.631720e-04" "0.0007651482"
## S1_6 " 2.666970e-04" "0.0008988675"
## attr(,"class")
## [1] "acomp"``````

Where it is necessary to keep some (or all) of the other information, this can be re-joined to the `acomp()` object, but only if its `acomp` class is removed. For example:

``````CD166_19_xrf_acomp_meta <- full_join(CD166_19_xrf_acomp %>%
as.data.frame() %>%
rownames_to_column("uid"),
CD166_19_xrf %>%
select(-any_of(elementsList)),
by = "uid"
) %>%
arrange(depth, label) ``````