Background

prcr is an R package for person-centered analysis. Person-centered analyses focus on clusters, or profiles, of observations, and their change over time or differences across factors. See Bergman and El-Khouri (1999) for a description of the analytic approach. See Corpus and Wormington (2014) for an example of person-centered analysis in psychology and education.

Example using built-in dataset mtcars

In this example using the built-in to R mtcars data for fuel consumption and other information for 32 automobiles, the variables disp (for engine displacement, in cu. in.), qsec (for the 1/4 mile time, in seconds), and wt for weight (in 1000 lbs.) are clustered with a 2 cluster solution specified. Because the variables are in very different units, the to_scale argument is set to TRUE.

library(prcr)
df <- mtcars
two_profile_solution <- create_profiles(df, disp, hp, wt, n_profiles = 2, to_scale = T)
## Prepared data: Removed 0 incomplete cases
## Hierarchical clustering carried out on: 32 cases
## K-means algorithm converged: 1 iteration
## Clustered data: Using a 2 cluster solution
## Calculated statistics: R-squared = 0.756
summary(two_profile_solution)
## 2 cluster solution (R-squared = 0.756)
## 
## Profile n and means:
## 
## # A tibble: 2 × 4
##               Cluster     disp        hp       wt
##                 <chr>    <dbl>     <dbl>    <dbl>
## 1 Cluster 1 (18 obs.) 135.5389  98.05556 2.609056
## 2 Cluster 2 (14 obs.) 353.1000 209.21429 3.999214
print(two_profile_solution)
## $clustered_processed_data
## 
## # A tibble: 2 × 4
##               Cluster     disp        hp       wt
##                 <chr>    <dbl>     <dbl>    <dbl>
## 1 Cluster 1 (18 obs.) 135.5389  98.05556 2.609056
## 2 Cluster 2 (14 obs.) 353.1000 209.21429 3.999214
## 
## $clustered_raw_data
## 
## # A tibble: 32 × 4
##     disp    hp    wt cluster
##    <dbl> <dbl> <dbl>   <int>
## 1  160.0   110 2.620       1
## 2  160.0   110 2.875       1
## 3  108.0    93 2.320       1
## 4  258.0   110 3.215       1
## 5  360.0   175 3.440       2
## 6  225.0   105 3.460       1
## 7  360.0   245 3.570       2
## 8  146.7    62 3.190       1
## 9  140.8    95 3.150       1
## 10 167.6   123 3.440       1
## # ... with 22 more rows
plot(two_profile_solution)

The output has the class prcr and has slots for additional information that can be extracted from it, such as the r-squared (for comparing the relative fit of different cluster solutions) raw clustered data (i.e., for conducting statistical tests to determine whether the cluster centroids are different from one another and for use in additional analyses) and the processed data (i.e., for creating different plots of the cluster centroids). Perhaps the most important are data_with_dummy_code, the original data frame with columns with dummy coded variables for each of the clusters added.

two_profile_solution$r_squared
## [1] 0.7558058
two_profile_solution$clustered_raw_data
## # A tibble: 32 × 4
##     disp    hp    wt cluster
##    <dbl> <dbl> <dbl>   <int>
## 1  160.0   110 2.620       1
## 2  160.0   110 2.875       1
## 3  108.0    93 2.320       1
## 4  258.0   110 3.215       1
## 5  360.0   175 3.440       2
## 6  225.0   105 3.460       1
## 7  360.0   245 3.570       2
## 8  146.7    62 3.190       1
## 9  140.8    95 3.150       1
## 10 167.6   123 3.440       1
## # ... with 22 more rows
two_profile_solution$clustered_processed_data
## # A tibble: 2 × 4
##               Cluster     disp        hp       wt
##                 <chr>    <dbl>     <dbl>    <dbl>
## 1 Cluster 1 (18 obs.) 135.5389  98.05556 2.609056
## 2 Cluster 2 (14 obs.) 353.1000 209.21429 3.999214
two_profile_solution$data
## # A tibble: 32 × 13
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
## 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
## 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
## 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
## 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
## 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
## 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
## 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
## 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with 22 more rows, and 2 more variables: cluster_1 <dbl>,
## #   cluster_2 <dbl>
two_profile_solution$data_with_dummy_code
## # A tibble: 32 × 13
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
## 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
## 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
## 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
## 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
## 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
## 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
## 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
## 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with 22 more rows, and 2 more variables: cluster_1 <dbl>,
## #   cluster_2 <dbl>

Functions for easily comparing the r-squared value for a range of cluster solutions, and for carrying out cross-validation of the clustering solution, will be added in future updates to the package.