library(playjareyesores)
library(nomnoml)

Some brief background. Occassionally, I have students cheat on their essays or long-answer questions, typically by copying their answers from somewhere. My university provides access to plagiarism detection services like turnitin and safeassign, however these tools can be cumbersome to use. I have also sometimes resorted to using R for plagiarism detection. I am writing this package to collect my code for this purpose and put it one place.

Also, I am using this vignette as a pseudo blog to work out which kinds of functions to focus on building. So, this is a tutorial for myself.

I’m reviewing a paper right now. It’s very similar to another paper I’ve read by the same author. Whole chunks are copied from one previous paper to another. I can’t be bothered to “manually” figure out which parts are the same. It would be nice to have some R functions for this.

Ok, did some googling and tried out a bunch things. I found some great packages that already exist, so am going to play around with them a bit. These packages include textreuse, and tabulizer (which unfortunately depends on Java, but allows for importing pdfs with 2 columns, useful for scientific papers).

## Import and cleaning functions

Here are some wrapper functions for importing pdfs or plain text, and doing some basic cleaning, involving, deleting new lines, deleting “-” which happens when lines run over in a .pdf, and converting everything to lowercase.

p1 <- clean_1_col_pdf(file="data/test1col.pdf")
p2 <- clean_2_col_pdf(file="data/CGM2006.pdf")
p3 <- clean_plain_txt(file="data/sometext.txt")

## Compare documents using textreuse

The textreuse package can do some heavy lifting in terms of comparing documents for overlap. Below I use the align_local function to compare two of my papers. Looks like my self-plagiarism is limited to one of the methods sections, which makes sense because these two papers would have used the same methods.

library(textreuse)
p1 <- clean_2_col_pdf(file="data/CGM2006.pdf")
p2 <- clean_2_col_pdf(file="data/Crump et al. - 2008.pdf")
test <- align_local(p1,p2)

## Crump et al. 2006

[1] “participants were seated approximately 57 cm from the computer monitor at the beginning of each trial participants were presented with a fixation cross displayed in white against a black background for 1,000 #### msec ## followed by a blank interval of 250 msec ## next a color word prime was centrally displayed in white against a black background ### ######### ######### for 100 msec ## ########### following the prime display a colored ##### shape ##### probe display appeared participants were instructed to name the color of the probe as quickly and accurately as possible the probe was presented on the screen until the participant made a vocal response vocal response latencies were recorded with ##### a microphone and a voice activated relay timed the response from the onset of the probe display an experimenter coded each response as correct incorrect or spoil a spoil was defined as a trial in which noise unrelated to the onset of the intended response triggered the voice key after ### ####### the completion #### of #### the ### experiment participants were ## shown ########## pictures ## of congruent and incongruent ### trial ########### types in both ########## the ## high and low proportion congruent conditions the participants were then ######## asked #### to ### estimate ######## the ####### percentages of congruent ## and ######### incongruent ####### trial ########## types #### that occurred in both the high ##### and ### low #### proportion ## congruent ####### conditions ######### ### the ######### ## participants were ## asked #### to ########## give ### estimates ####### that #### summed to 100 for ## each #### of ####### the ###### proportion congruent conditions results for each participant response times rts for each condition were submitted to an outlier elimination procedure van selst jolicœur ######### 1994 mean rts were then computed from ##### those ### which ######### remained ############ these ### means ####### #### #### ########### were submitted to a repeated # measures anova that included proportion congruent high vs low and ## # congruency congruent vs incongruent as ######## within ######## participants ##### factors the mean rts in ### ##### ##### ### each condition collapsed across participants ## #### ########## are displayed in table 1.1 # the ##### ########## ## ######## ##### ### # ########### main effect of congruency was signif icant f 1,15 # 95.79 ## ###### mse 1,393.96 ###### p 0001 responses on ### congruent trials 486 msec were faster ### ## than responses on ### incongruent trials 577 ### msec ## more important the proportion congruent ## congruency interaction was significant f 1,15 # 7.87 ## ##### mse 133.11 ###### p 05 ### the stroop effect for the high proportion ######## condition was larger 99 ### msec ## than the stroop effect for the low proportion”

## Crump et al. 2008

[1] “participants were seated approximately 57 cm from the computer monitor at the beginning of each trial participants were presented with a fixation cross displayed in white against a black background for ##### 1000 #### ms followed by a blank interval of 250 #### ms next a color word prime ### ######### displayed in white against a black background was presented centrally for 100 #### ms immediately following the prime display a ####### color ##### patch probe display appeared participants were instructed to name the color of the probe as quickly and accurately as possible the probe was presented on the screen until the participant made a vocal response vocal response latencies were recorded #### using a microphone and a voice activated relay timed the response from the onset of the probe display an experimenter coded each response as correct incorrect or spoil a spoil was defined as a trial in which noise unrelated to the onset of the intended response triggered the voice key ##### 2.2 results the ########## data ## from ### two ########## participants #### in ##### experiment ######## 1a ## ######### and ########### one ##### participant ##### in #### experiment ### 1b #### ### ### ########## ######### ########## ### ############ were #### excluded ##### from ## all ######## analyses ### because ########### of ######### an ### equipment ########### failure ##### associated ##### with #### ######## ## #### the #### voice ### key ### used ########## to ######### collect ########## responses for the remaining 15 participants #### in ##### each ## experiment #### rts ######### greater #### than ###### ## 100 ### ms #### from ## correct ### trials ########## ######### ########## ####### for #### ########### ######## ##### ### ### each condition were submitted to an outlier elimination procedure van selst ######## jolicoeur 1994 mean rts were then computed #### using ##### the ##### remaining ######## observations ##### the ##### results from both experiments were submitted to a ######## 2 ######## ##### #### ######## proportion congruent high vs low ### by 2 congruency congruent vs incongruent ## repeated ###### measures ############ anova ####### ### #### rts ## and error rates for each condition collapsed across participants in each experiment are displayed in table ### 2 ### 2.2.1 experiment 1a location there was a significant main effect of congruency ### ###### ##### f #### 1 ##### 14 386.70 mse ######## 403.17 p 0001 responses ## for congruent trials ### #### were faster 468 ms than responses ## for incongruent trials ### 570 #### ms more important the proportion congruent by congruency interaction was significant f #### 1 #### 14 12.11 mse ###### 184.02 p ## 005 the stroop effect for the high proportion location condition was larger ## 114 #### ms than the stroop effect for the low proportion”

## Estimating number of same words

The align_local function is interesting, it produces to strings, one for each document, that are lined up as well as possible, and identical in length. When words don’t line up, there is a fudge factor, and missing, deleted or different words are replaced with a hashtag. Here’s a quick and dirty way to figure out how many of the words exactly line up.

al_summary <- align_local_sum(test)
al_summary$sum #> [1] 242 al_summary$sentence

[1] “participants were seated approximately 57 cm from the computer monitor at the beginning of each trial participants were presented with a fixation cross displayed in white against a black background for followed by a blank interval of 250 next a color word prime displayed in white against a black background for 100 following the prime display a probe display appeared participants were instructed to name the color of the probe as quickly and accurately as possible the probe was presented on the screen until the participant made a vocal response vocal response latencies were recorded a microphone and a voice activated relay timed the response from the onset of the probe display an experimenter coded each response as correct incorrect or spoil a spoil was defined as a trial in which noise unrelated to the onset of the intended response triggered the voice key the participants and in were of the the participants 100 for each condition were submitted to an outlier elimination procedure van selst 1994 mean rts were then computed were submitted to a proportion congruent high vs low congruency congruent vs incongruent rts each condition collapsed across participants are displayed in table main effect of congruency f mse p 0001 responses congruent trials were faster than responses incongruent trials more important the proportion congruent congruency interaction was significant f mse p the stroop effect for the high proportion condition was larger than the stroop effect for the low proportion”

This seems useful to me. The sentence that you are looking at didn’t necessarilly appear as consecutive words, however it gives a reasonable bird’s eye view of the overlap between two documents. If there are a big chunks here, one document was copied from another.

## N gram methods

Another useful way to detect overlap is to use n-gram methods. Here, I use the tm package to create a document term matrix for two texts. The function allows you to set the number of n-grams. All unique n-grams between the texts are computed and counted for each text. Then I fid the proportion of common n-grams between the two texts. In general, if texts use the same words, they will have some overlap, but as the number of n-grams grows, texts that are not the same will have vanishingly small overlap.

ngram_proportion_same(p1,p2,1)
#> [1] 0.3208059
ngram_proportion_same(p1,p2,2)
#> [1] 0.1612376
ngram_proportion_same(p1,p2,3)
#> [1] 0.09709452
ngram_proportion_same(p1,p2,4)
#> [1] 0.07081316
ngram_proportion_same(p1,p2,5)
#> [1] 0.05574508

By default, the overlapping ngrams are not returned, but they can be using show="ngrams". Note, that I’ve done some further cleaning the texts, this was necessary after I added another feature to show="ngrams", see next section. In general, it’s not clear to me how “clean” the text needs to be, and I’ll try to get back to this and figure it out.

p1 <- qdapRegex::rm_non_ascii(p1) %>%
tm::removeNumbers() %>%
qdapRegex::rm_non_words()
p2 <- qdapRegex::rm_non_ascii(p2) %>%
tm::removeNumbers() %>%
qdapRegex::rm_non_words()
out <- ngram_proportion_same(p1,p2,10,show="ngrams")
out$the_grams[1:10] # show first 10 ngrams #> [1] "a century of research on the stroop effect an integrative" #> [2] "a circle in diameter or a square in width that" #> [3] "a first language had normal color vision and had normal" #> [4] "a fixation cross displayed in white against a black background" #> [5] "a microphone and a voice activated relay timed the response" #> [6] "a mixed design anova with experiment location vs shape as" #> [7] "a parallel distributed processing model of the stroop effect psychological" #> [8] "a significant main effect of congruency f mse p responses" #> [9] "a simple priming procedure involving the presentation of a color" #> [10] "a solution to the effect of sample size on outlier" These two papers were on the same topic, and had many of the same references, which account for much of the overlap. ## Reporting t1 <- "here is some text. I'd like to write about a few things. Then I'm going to compare what I wrote here with what I'm going to write in a little bit. After that, I'm going to make a function to look at the documents side by side, and bold the ngrams that are the same between the texts." t2 <-"And some more text for you. This time I'm not as certain what I'm going to say, but I'm going to compare what I write here with what I wrote before. I'll do that in a little. The purpose is to get some text that I can use to make a report that lines up the documents, showing which ngrams were the same." out <- ngram_proportion_same(t1,t2,3,show="ngrams", meta=c("A title","B title")) ## A [1] “here is some text. i’d like to write about a few things. then I’M GOING TO COMPARE WHAT I wrote HERE WITH WHAT I’M GOING TO write in a little bit. after that, I’M GOING TO make a function to look at the documents side by side, and bold the ngrams that are the same between the texts.” ## B [1] “and some more text for you. this time i’m not as certain what I’M GOING TO say, but I’M GOING TO COMPARE WHAT I write HERE WITH WHAT i wrote before. i’ll do that in a little. the purpose is to get some text that i can use TO MAKE A report that lines up the documents, showing which ngrams were the same.” ### Range of n grams out <- ngrams_analysis(t1,t2,range = 2:5) attributes(out) #>$names
#> [1] "ngram2" "ngram3" "ngram4" "ngram5"
attributes(out$ngram2) #>$names
#> [1] "proportion" "the_grams"  "a_title"    "b_title"    "a_print"
#> [6] "b_print"

## Full report using ngrams_report

The ngrams_report function takes the output from the ngrams_analysis function and returns a print object for use in an .RMD document that prints to HTML. The print out is a summary of the analysis.

Here is an example using two papers published by Robert Sternberg, who has been identified as self-plagiarizing in some of his work (for example see). I thought this would be a good test case. I obtained that Sternberg published in 2010, where there was supposed overlap between the texts. Let’s use the ngrams_report function to find out. Remember, to set the knitr chunk option to results ="asis".

# load in papers
p1 <- clean_1_col_pdf("data/Sternberg2010.pdf")
p2 <- clean_1_col_pdf("data/Sternberg2010b.pdf")

# clean them up a a  bit more
p1 <- LSAfun::breakdown(p1) %>%
qdapRegex::rm_white()
p2 <- LSAfun::breakdown(p2) %>%
qdapRegex::rm_white()

# run the ngram analysis
meta_titles <- c("Sternberg 2010A, School Psychology International",
"Sternberg 2010B, Journal of Cognitive Education and Psychology")
out <- ngrams_analysis(p1,p2,3:8, meta = meta_titles)

# print out results below

ngrams_report(out, print_n = 5, highlight_n = "ngram5", color="red")

## Ngram descriptives

ngram proportion count total_unique
ngram3 0.4012662 3993 9951
ngram4 0.3447862 3733 10827
ngram5 0.2995246 3402 11358
ngram6 0.2600509 3066 11790
ngram7 0.2244881 2730 12161
ngram8 0.1924646 2406 12501

## Ngram examples

### ngram3

a analytically sound

a apply b

a basis for

a berkeley and

a better world

### ngram4

a analytically sound b

a apply b use

a basis for augmenting

a berkeley and harvard

a better world rather

### ngram5

a analytically sound b balanced

a apply b use c

a basis for augmenting ap

a better world rather than

a book of case studies

### ngram6

a analytically sound b balanced c

a apply b use c put

a basis for augmenting ap exams

a better world rather than destroy

a book of case studies of

### ngram7

a analytically sound b balanced c logical

a basis for augmenting ap exams in

a better world rather than destroy it

a chance to develop and also to

a classroom team project succeed physical education

### ngram8

a analytically sound b balanced c logical and

a basis for augmenting ap exams in psychology

a better world rather than destroy it originally

a chance to develop and also to challenge

a classroom team project succeed physical education d

## Reporting style options

The color option changes the font color of the highlighted text. It should respect any color that would normally work with CSS. Note, that the higlighted words are also wrapped by an HTML span code with id = ngram. As a result, css could be used to further stylize the highlighted text.

t1 <- "here is some text. I'd like to write about a few things. Then I'm going to compare what I wrote here with what I'm going to write in a little bit. After that, I'm going to make a function to look at the documents side by side, and bold the ngrams that are the same between the texts."
t2 <-"And some more text for you. This time I'm not as certain what I'm going to say, but I'm going to compare what I write here with what I wrote before. I'll do that in a little. The purpose is to get some text that I can use to make a report that lines up the documents, showing which ngrams were the same."

out <- ngrams_analysis(t1,t2,range = 2:5, meta=c("A title","B title"))

# print out results below

ngrams_report(out, print_n = 5, highlight_n = "ngram5", color="blue")

## Ngram descriptives

ngram proportion count total_unique
ngram2 0.1212121 12 99
ngram3 0.0754717 8 106
ngram4 0.0360360 4 111
ngram5 0.0180180 2 111

## Ngram examples

compare what

going to

here with

i wrote

i’m going

compare what i

going to compare

here with what

i’m going to

to compare what

### ngram4

going to compare what

i’m going to compare

to compare what i

what i’m going to

NA

### ngram5

going to compare what i

i’m going to compare what

NA

NA

NA

## A title

here is some text. i’d like to write about a few things. then i’m GOING TO COMPARE WHAT I wrote here with what i’m going to write in a little bit. after that, i’m going to make a function to look at the documents side by side, and bold the ngrams that are the same between the texts.

## B title

and some more text for you. this time i’m not as certain what i’m going to say, but i’m GOING TO COMPARE WHAT I write here with what i wrote before. i’ll do that in a little. the purpose is to get some text that i can use to make a report that lines up the documents, showing which ngrams were the same.