Multiple document comparison for textual overlap

multi_doc_compare(texts, n_grams, sd_criterion)

Arguments

texts

character vector of texts, each text is a string in the vector

n_grams

integer to specify ngram units

sd_criterion

numeric set a standard deviation criterion for returning documents that are unsually similar, 2-3 is pretty good

Value

list

  • dtm matrix document term matrix for all texts

  • histogram a histogram of the cosine similarity values between every text

  • similarities matrix cosine similarities between every text

  • mean_similarity numeric the mean similarity between all texts

  • sd_similarity numeric the standard deviation of the similarities

  • check_these dataframe document pairs that were above the criterion, might want to check these ones))

Examples