Title: | Utilizing Automated Text Analysis to Support Interpretation of Narrative Feedback |
Version: | 1.0.0 |
Description: | Combine topic modeling and sentiment analysis to identify individual students' gaps, and highlight their strengths and weaknesses across predefined competency domains and professional activities. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | data.table, dplyr, jsonlite, magrittr, rlang, udpipe, reticulate, stats, stringr, textclean, tibble, tidyr, tidytext, topicmodels, utils |
Suggests: | matrixStats, reshape2, sentimentr, slam, stopwords, stringi, textstem, tokenizers |
LazyData: | true |
Depends: | R (≥ 3.5) |
NeedsCompilation: | no |
Packaged: | 2025-10-20 09:03:19 UTC; J.Moonen |
Author: | Joyce Moonen - van Loon
|
Maintainer: | Joyce Moonen - van Loon <j.moonen@maastrichtuniversity.nl> |
Repository: | CRAN |
Date/Publication: | 2025-10-23 13:50:02 UTC |
sumup: Utilizing Automated Text Analysis to Support Interpretation of Narrative Feedback
Description
Combine topic modeling and sentiment analysis to identify individual students' gaps, and highlight their strengths and weaknesses across predefined competency domains and professional activities.
Author(s)
Maintainer: Joyce Moonen - van Loon j.moonen@maastrichtuniversity.nl (ORCID)
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
Append Stopwords
Description
This function is used to append additional stopwords to an existing list.
Usage
append_stopwords(data_stopwords, stopwords_to_append)
Arguments
data_stopwords |
A character vector containing the initial set of stopwords to which more stopwords can be appended. |
stopwords_to_append |
A character vector containing additional stopwords to be appended to the existing |
Details
The append_stopwords
function takes an existing list of stopwords (data_stopwords
) and appends a new set of stopwords (stopwords_to_append
) to it. Additionally, stopwords from the stopwords
package (using the "dutch" language option) are also appended to the list. The final list is then returned with duplicate stopwords removed by calling unique()
.
Value
Returns a character vector of unique stopwords after appending the specified stopwords and additional stopwords from the stopwords
package.
Check all setting
Description
Check for every setting the type and value
Usage
check_all_settings(settings)
Arguments
settings |
A current list of settings (list) |
Value
Either an error or TRUE
Check setting
Description
Check the value of one particular setting
Usage
check_setting(settings, setting_name, setting_value)
Arguments
settings |
A current list of settings (list) |
setting_name |
The name of the setting that needs updating (string) |
setting_value |
The value of the setting |
Value
TRUE if correct, FALSE if incorrect
Clean Text
Description
Cleans and standardizes text by handling HTML tags, punctuation, spacing, and abbreviations.
Usage
clean_text(input_text, corrections)
Arguments
input_text |
A character vector containing the text to be cleaned. |
corrections |
A data frame containing typo corrections. |
Value
A cleaned character vector.
Correct Text
Description
Applies typo corrections to an input text using a given correction table.
Usage
correct_text(input_text, corrections)
Arguments
input_text |
A character vector containing the text to be corrected. |
corrections |
A data frame containing two columns: |
Value
A character vector with corrected text.
Create Narratives Dataset
Description
Restructures a dataset by normalizing and transforming its textual content.
Usage
create_dataset_narratives(data)
Arguments
data |
A data.table or data.frame containing narrative data. |
Value
A structured data.table with cleaned text.
Create Output for Sum Up
Description
Processes sentence-level Sum Up's output, assigns sentences to topics, calculates topic strengths, and organizes data into structured feedback categories.
Usage
create_output(sentence_complete_output, sentences_lda, settings)
Arguments
sentence_complete_output |
A data frame containing sentence-level topic polarity scores. |
sentences_lda |
An LDA model object containing topic-term distributions. |
settings |
A list of settings containing parameters such as the number of topics, threshold values, scoring scale, and output format. |
Details
This function processes Sum Up's output to assign sentences to topics based on a probability threshold. It calculates the topic strength based on sentence polarity and recency, categorizes topics by feedback quantity and quality, and includes unassigned feedback in a separate category.
Value
A structured list containing topic-wise feedback, strength, and quality assessments.
If settings$output_json
is TRUE
, returns a JSON string.
Load Stopwords
Description
Load stopwords from a file
Usage
default_stopwords(stopwords_file)
Arguments
stopwords_file |
A string representing the path to a file containing a list of stopwords, with one word per line. This file is expected to have tab-separated values with the stopwords in the first column. |
Details
The default_stopwords
function loads a list of stopwords from a specified file, with each stopword being listed on a new line. It returns a character vector of these stopwords.
Value
Returns a character vector of stopwords read from the specified file.
Example example_data
Description
example_data
contains an example of narrative feedback data and scores of formative assessments
from a student's portfolio in the Master education in Medicine from a Dutch university, translated to English.
Usage
example_data
Format
A data frame with 6 rows and 15 columns:
- submissionid
ID of the assessment form from which the data originates (NB: as the data is gathered per competency, the same submissionid can be used for multiple rows of data).
- formid
ID of the type of assessment form (NB: set to 1 if not applicable).
- templatename
Name of the type of assessment form (NB: set to same 'dummy' name for all rows, if not applicable).
- assistant
Identifier of the student. Needed when data from multiple portfolios is loaded. Set all to same value if not applicable.
- authority
Identifier of the assessor. NB: set all to same value if not applicable.
- specialty
Textual identifier of the specialty from which the assessment originates (NB: set to same 'dummy' name for all rows if not applicable).
- hospital
Textual identifier of the hospital from which the assessment originates (NB: set to same 'dummy' name for all rows if not applicable).
- portfolioid
Identifier of the sub-portfolio, e.g. when using rotations/clerkships. Define the IDs in the settings. NB: set all to 1 if not applicable.
- competencyid
Identifier of the competency used in the educational framework. Define the IDs in the settings. NB: set all to 1 if not applicable.
- sterk
Narrative feedback in case it is described in a predefined area for 'strengths' in the performance of the student.
- verbeter
Narrative feedback in case it is described in a predefined area for 'improvement areas' in the performance of the student.
- feedback
Narrative feedback in case it is NOT described in a predefined area for 'strengths' or 'improvement area'.
- score
Numeric score on the same assessment form. The scale can be set in settings. NB: set to NA or NULL if not applicable.
- datereferenced
Date of the assessment in format yyyy-mm-dd.
- zorgpunten
Comma-separated list of competencies that are marked as 'concerns' in the performance of the student. Make sure the names of competencies match the values in setting 'competencies'. NB: leave empty if not applicable.
Source
Example data for package documentation
Load Corrections
Description
Reads a typo correction file and prepares it for use in text cleaning.
Usage
get_corrections(corrections_file)
Arguments
corrections_file |
Path to a CSV file with typo corrections. |
Value
A data frame containing typoEdit
(pattern) and replacement
(correction).
Get Polarity from Grasp
Description
This function computes the polarity score for a given sentence using the Grasp sentiment analysis model. It applies specific logic based on the comment type (e.g., improvement suggestions) and language (Dutch, English, German, or French).
Usage
get_polarity(grasp, sentence, comment_type, language)
Arguments
grasp |
The imported Grasp model, which is used to compute sentiment polarity. |
sentence |
A character string representing the sentence for which sentiment polarity needs to be calculated. |
comment_type |
A string that indicates the type of feedback (e.g., "verbeter" for improvement). This influences sentiment adjustment in Dutch language. |
language |
A string indicating the language of the sentence ("nl" for Dutch, "en" for English, "de" for German, "fr" for French). |
Details
For Dutch sentences, the function applies specific transformations (such as removing certain words like "naar" and "gemaakt") before calculating sentiment. It also handles feedback sentences marked as "verbeter" (improvement suggestions) by adding the word "suboptimal" to adjust the sentiment analysis for improvement-type feedback.
Value
A numeric value representing the sentiment polarity of the sentence. Positive values indicate positive sentiment, negative values indicate negative sentiment, and zero indicates neutral sentiment.
Lemmatize Sentences Using a UDPipe Model
Description
This function processes a dataset of sentences using a UDPipe model to perform lemmatization, applies corrections to the lemmas, and associates metadata (e.g., submission ID, document) with the processed sentences. If UDPipie is not available, package NLP is used.
Usage
lemmatize(
sentences,
udpipe_model_file,
corrections_file,
language,
use_seeds = TRUE
)
Arguments
sentences |
A data frame containing sentences, with at least a |
udpipe_model_file |
A character string representing the path to the UDPipe model file used for annotation. |
corrections_file |
A character string representing the path to a CSV file containing corrections to be applied to the lemmas. |
language |
A character string representing the language of the dataset ('nl', 'en', 'de' or 'fr') |
use_seeds |
A logical value indicating whether to use the seeds (e.g. of an educational framework) |
Details
This function loads a UDPipe model, annotates the input sentences, corrects lemmas based on a provided corrections file, and optionally filters by noun tokens. It returns a data frame with the sentence-level annotations.
Value
A data frame containing the lemmatized sentences with the following columns:
-
doc_id
: Document identifier. -
lemma
: The lemmatized form of each word. -
upos
: Universal part-of-speech tag. -
sentenceid
: The sentence ID for the sentence from which the lemma was extracted. -
document
: Document associated with the sentence. -
submissionid
: Submission ID associated with the sentence.
Obtain Word Counts
Description
This function calculates the word counts for each document in the annotated dataset, excluding stopwords. It unites the annotated dataset, applies stopwords filtering, and counts the occurrences of each word.
Usage
obtain_word_counts(data_annotated, data_stopwords, stopwords_to_append)
Arguments
data_annotated |
A data frame containing the annotated text data. The data must include columns like |
data_stopwords |
A vector of stopwords that will be excluded from the word count. |
stopwords_to_append |
A vector of additional stopwords to be appended to |
Details
This function works by first uniting the document
, submissionid
, and sentenceid
columns into a new identifier mID
. Then, it filters out stopwords from the word count, counts the frequency of words, and returns the result.
Value
A tibble containing the word counts for each unique combination of mID
(document, submissionid, sentenceid) and word
. The tibble has columns:
-
mID
: A unique identifier combiningdocument
,submissionid
, andsentenceid
. -
word
: The word itself. -
n
: The frequency of the word.
Replace Abbreviations
Description
Replaces periods in abbreviations to ensure correct spacing and formatting.
Usage
replace_abbr(abbr)
Arguments
abbr |
A character string containing the abbreviation. |
Value
A corrected abbreviation string.
Run Sum Up
Description
This function runs a series of text processing and analysis steps including text cleaning, tokenization, lemmatization, topic modeling, and sentiment analysis. It then classifies sentences into topics and generates an output summarizing the results.
Usage
run_sumup(dataset, settings = NULL)
Arguments
dataset |
A data frame containing the text data to be analyzed. It should include at least the following columns: |
settings |
A list containing settings for various processing steps. If not provided, default settings are used. |
Details
This function performs the following steps:
Cleans the input text data using
text_clean
.Tokenizes the text into sentences and removes stopwords.
Lemmatizes and annotates the sentences using a UDPipe model.
Counts word frequencies and excludes stopwords.
Performs topic modeling on the word counts.
Runs sentiment analysis based on the specified method (Grasp or SentimentR).
Classifies sentences into topics using the topic classification model.
Generates output summarizing the topics and sentiment.
Value
A list or JSON output (depending on settings) containing the processed text data classified by topics and sentiment, as well as various metrics related to topics, such as strength and feedback count.
Examples
data(example_data)
ex_data <- example_data
ex_settings <- set_default_settings()
ex_settings <- update_setting(ex_settings , "language", "en")
ex_settings <- update_setting(ex_settings , "use_sentiment_analysis", "sentimentr")
result <- run_sumup(ex_data, ex_settings )
Sentiment Analysis using Grasp
Description
This function performs sentiment analysis on a collection of sentences using the Grasp sentiment analysis model. It computes the polarity of each sentence and returns a table with sentiment scores.
Usage
sentiment_analysis_grasp(documents, sentences, language, grasp_folder)
Arguments
documents |
A character vector of document identifiers for which sentiment analysis needs to be performed. |
sentences |
A data frame containing sentence-level data. This should include columns like |
language |
A string indicating the language of the sentences to analyze. Supported languages include "nl" (Dutch), "de" (German), and "fr" (French). |
grasp_folder |
The folder path where the Grasp Python module is located. This is required to run the Grasp sentiment analysis. |
Details
The function loads the Grasp sentiment analysis model via reticulate
and computes sentiment polarity for each sentence. It performs a special handling for Dutch language sentences that start with certain keywords to prepend "suboptimal" to improve sentiment accuracy. The sentiment is calculated using the pov
function from the Grasp model. To use the sentiment setting for Dutch medical education, copy file nl_pov.json
, included in this package, and paste it in grasp_folder/lm.
Value
A data table with the following columns:
-
sentenceid
: The ID of the sentence. -
sentence
: The sentence being analyzed. -
polarity
: The sentiment polarity score for the sentence. -
document
: The document identifier to which the sentence belongs.
Sentiment Analysis using sentimentr
Description
This function performs sentiment analysis on sentences using a predefined lexicon and the sentimentr
package. The sentiment score is calculated based on a dictionary of words with associated sentiment values.
Usage
sentiment_analysis_sentimentr(
documents,
sentences,
dictionary_file,
use_dictionary_file_in_sentimentr
)
Arguments
documents |
A character vector of document identifiers for which sentiment analysis needs to be performed. This is included for consistency, but is not directly used in the analysis. |
sentences |
A data frame containing sentence-level data. It should include the following columns:
|
dictionary_file |
A string representing the path to a CSV file containing the sentiment dictionary. The file should have two columns:
|
use_dictionary_file_in_sentimentr |
A boolean determining whether the dictionary file is used (TRUE) or the built-in lexicon (FALSE) |
Details
This function loads a sentiment dictionary, processes the sentences, and calculates a sentiment score for each sentence using the sentimentr
package. The dictionary is expected to have words associated with sentiment values that influence the sentiment score calculation. Positive sentiment scores indicate a positive sentiment, while negative values indicate negative sentiment.
The dictionary file (SumUp_Dictionary.csv
) is read from the data/
directory by default. The file should have a column of words and corresponding sentiment values.
Value
A data table with the following columns:
-
sentenceid
: The ID of the sentence. -
sentence
: The sentence being analyzed. -
polarity
: The sentiment polarity score for the sentence. -
document
: The document identifier to which the sentence belongs.
Settings functionality for package 'sumup'
Description
This file contains helper functions used throughout package 'sumup' to set and update the settings of the algorithm.
Create a list of settings for the algorithms in 'sumup'
Usage
set_default_settings()
Value
A list containing all settings used in 'sumup'
Author(s)
Joyce M.W. Moonen - van Loon
This file defines functions used to set and update the settings used in the 'sumup' algorithms Functions in this file:
set_default_settings(): Set the default settings list of the package (optimized for a Dutch Master in Medicine education using CanMeds competencies and professional activities)
update_educational_framework_settings(settings): Update settings related to the educational framework
update_setting(settings, setting_name, new_value): Updates setting 'setting_name' with value 'new_value', returning the new list of settings
check_all_settings(settings): Check whether all settings are of the correct type and fit the required ranges
check_setting(settings, setting_name, setting_value): Check whether setting_value is allowed for setting 'setting_name'
Usage: Load this package and call the functions directly.
Note: This file is part of the 'sumup' package.
Text Cleaning and Processing Functions
Description
This file contains helper functions for text processing and cleaning used in the 'sumup' package. The functions provide capabilities for text correction, cleaning, dataset preparation, and handling abbreviation replacements.
Cleans narrative text columns (sterk
, verbeter
, feedback
) in a dataset.
Usage
text_clean(data, corrections_file)
Arguments
data |
A data.table or data.frame containing text data. |
corrections_file |
Path to a CSV file containing typo corrections. |
Value
A cleaned dataset with standardized text.
Author(s)
Joyce M.W. Moonen - van Loon
This file defines several functions:
-
text_clean()
: Cleans multiple columns in a dataset. -
correct_text()
: Applies typo corrections using a predefined correction list. -
clean_text()
: Cleans and formats input text, handling HTML tags, spacing, punctuation, and typos. -
get_corrections()
: Reads a correction file and processes typo replacements. -
create_dataset_narratives()
: Prepares a dataset by restructuring and normalizing its textual content. -
replace_abbr()
: Corrects abbreviations.
Tidy and Split Narrative Text
Description
This function processes narrative data by splitting the text into sentences or simply subsetting the data based on specific comment types. It ensures consistency across various comment types and removes unwanted columns and duplicates.
Usage
tidy_text(narratives, split_in_sentences = TRUE)
Arguments
narratives |
A data frame or data.table containing the narratives to be processed. The dataset should include columns representing different types of comments (e.g., |
split_in_sentences |
A logical value indicating whether to split the text into individual sentences. If |
Details
The tidy_text
function processes a dataset of narratives, splitting them into individual sentences (if split_in_sentences = TRUE
) or subsetting them based on comment types (if split_in_sentences = FALSE
). The comment types are predefined as sterk
, verbeter
, and feedback
.
When
split_in_sentences = TRUE
, the function unnests the narrative data into sentences using regular expressions to identify sentence boundaries (periods, question marks, and exclamation marks).When
split_in_sentences = FALSE
, the function subsets the data by comment type and ensures that the relevant columns are kept. The function also performs text cleaning by squishing spaces and removing HTML tags.
Value
A data.table containing sentences (or narrative data) with the following columns:
-
document
: The document ID. -
submissionid
: The submission ID. -
competencyid
: The competency ID. -
assistant
: The assistant information. -
portfolioid
: The portfolio ID. -
sentenceid
: A unique identifier for each sentence (or narrative entry). -
sentence
: The cleaned-up sentence text.
Topic Classification for Narrative Data
Description
This function classifies sentences from narrative data into topics based on Latent Dirichlet Allocation (LDA) topic modeling results. It integrates additional features like sentiment analysis, annotated data, and seeded topics to enhance the classification.
Usage
topic_classification(
data,
narratives,
sentences,
sentences_lda,
sentence_polarity,
data_annotated,
use_beta,
use_seeds,
nr_topics,
seeded_topics,
competencies
)
Arguments
data |
A data frame or data.table containing metadata and other relevant data for topic classification. |
narratives |
A data frame or data.table containing the narrative data, including comments and feedback. |
sentences |
A data frame or data.table containing the sentences to be classified, including the text and metadata. |
sentences_lda |
A result object from LDA topic modeling, which contains the topic distribution for each term. |
sentence_polarity |
A data frame or data.table containing sentence-level polarity scores (sentiment analysis results). |
data_annotated |
A data frame or data.table containing the lemmatized text and other annotations. |
use_beta |
A logical value indicating whether to use the |
use_seeds |
A logical value indicating whether to incorporate seeded topics for topic classification. |
nr_topics |
An integer specifying the number of topics to classify. |
seeded_topics |
A list of character vectors representing the topics with predefined seed words. |
competencies |
A list of mames of the competencies per id as used in the data set |
Details
The topic_classification
function integrates several sources of data to classify sentences into topics:
It uses LDA topic modeling results to assign a probability for each topic.
The function can incorporate seeded topics, which enhance the classification by matching predefined keywords to the topics.
Sentiment analysis is used to add polarity scores to the sentence data.
The output includes additional metadata for each sentence, such as feedback type, score, and associated comments.
Value
A data.table containing the classified sentences with the following columns:
-
sentenceid
: Unique identifier for each sentence. -
sentence
: The sentence text. -
polarity
: The sentiment polarity score of the sentence. -
max_probability
: The highest topic probability for each sentence. Additional columns corresponding to each topic, representing the probability of the sentence belonging to that topic.
Metadata fields such as
document
,submissionid
,competencyid
,feedbacktype
,score
, andcomment
.
Topic Modeling with Latent Dirichlet Allocation (LDA)
Description
This function performs topic modeling on word count data using Latent Dirichlet Allocation (LDA). It supports both standard LDA and seeded LDA, where predefined topics can guide the topic modeling process.
Usage
topic_modeling(
word_counts,
seeded_topics,
seed_weight,
nr_topics,
set_seed,
lda_seed,
lda_alpha,
lda_best,
lda_burnin,
lda_verbose,
lda_iter,
lda_thin
)
Arguments
word_counts |
A data frame or data.table containing word counts, with columns for the document ID ( |
seeded_topics |
A list of character vectors representing predefined terms for each seed topic. If provided, seeded LDA will be performed. |
seed_weight |
A numeric value indicating the weight assigned to the seeded terms in the LDA model. This parameter influences how strongly the predefined seed topics affect the topic modeling. |
nr_topics |
An integer specifying the number of topics to be modeled by the LDA algorithm. |
set_seed |
A numeric value setting the seed for the topic modeling algorithm. Default set to 1234. |
lda_seed |
A numeric seed to be set for Gibbs Sampling. Default set to 1000. |
lda_alpha |
A numeric value that set the initial value for alpha. |
lda_best |
If TRUE only the model with the maximum (posterior) likelihood is returned, by default equals TRUE. |
lda_burnin |
A number of omitted Gibbs iterations at beginning, by default equals 0. |
lda_verbose |
A numeric value. If a positive integer, then the progress is reported every verbose iterations. If 0 (default), no output is generated during model fitting. |
lda_iter |
Number of Gibbs iterations (after omitting the burnin iterations), by default equals 2000. |
lda_thin |
Number of omitted in-between Gibbs iterations, by default equals iter. |
Details
The topic_modeling
function performs topic modeling using Latent Dirichlet Allocation (LDA) on a document-term matrix (DTM). If seeded_topics
is provided, a seeded LDA approach is used where predefined topics help guide the model's generation of topics. The function supports:
-
Standard LDA: Uses the traditional Gibbs sampling approach to estimate topics from word counts.
-
Seeded LDA: Incorporates predefined seed terms into the LDA model by assigning a weight (
seed_weight
) to these terms.
The function uses the topicmodels
package for LDA and the slam
package to manipulate sparse matrices. The results are captured in an LDA model object, which contains topic-word distributions and document-topic assignments.
Value
A topicmodels LDA object containing the result of the topic modeling. This object includes the topic distribution for each document and the terms associated with each topic.
Update settings related to the educational framework
Description
When changing the seeds_file or use_seeds settings, update all parameters relating to the educational framework
Usage
update_educational_framework_settings(settings)
Arguments
settings |
A current list of settings (list) |
Value
An updated list containing all settings used in 'sumup'
Update setting
Description
Replace the value of one particular setting value
Usage
update_setting(settings, setting_name, new_value)
Arguments
settings |
A current list of settings (list) |
setting_name |
The name of the setting that needs updating (string) |
new_value |
The new value of the setting |
Value
A list containing all settings used in 'sumup'