Deduplication
Locating duplicates
Duplicate identification and removal is a large field in itself, but the method applied in revtools is intentionally simple. Basically, you can use the function find_duplicates
to look for repeated information within a column of data within a data.frame
:
data <- read_bibliography("my_data.ris")
matches <- find_duplicates(data)
From version 0.4.1, find_duplicates
searches for identical doi’s by default; but another common approach is to look for similar titles:
matches <- find_duplicates(data, match_variable = "title")
Using title matching changes the defaults from exact matching to fuzzy matching using the ‘stringdist’ package. However, you can specify these arguments manually if you’d prefer:
matches <- find_duplicates(data,
match_variable = "title",
method = "lv",
threshold = 2
)
Once you have searched for potential duplicates, you can use extract_unique_references
to automatically extract one reference from every ‘group’ of matched documents. This function picks the document with the most text from each group.
data_unique <- extract_unique_references(data, matches)
Although this works, it relies very heavily on the user to check the results. The settings that you choose in find_duplicates
strongly affect the accuracy of the result, so it is risky to simply rely on extract_unique_references
to do the work for you. A safer choice is to use screen_duplicates
to investigate different string matching algorithms, and to interrogate the results via screen_duplicates
.
Screening duplicates
You can launch screen_duplicates
in one of three different ways. First, if you want to import data within the app, then you just run the function by itself:
screen_duplicates()
Second, you can launch the app using data from the workspace:
data <- read_bibliography("my_data.ris")
screen_duplicates(data)
Finally, if you want to save results from the app back to the workspace, then you need to specify an object where that data can be returned:
data <- read_bibliography("my_data.ris")
result <- screen_duplicates(data)
Specifying variables
The ‘Data’ tab contains four menus:

screen_duplicates
, then this allows you to drag-and-drop a dataset directly in to the app.
Is there a variable describing duplicates in this dataset?: If you have identified calculated duplicates in your dataset - either manually, or using
find_duplicates
- then you can use this to specify where those data are located in your data.frame
.
Select column to search for duplicates: This specifies which data should be searched for matches. Most often this will be the article title, but you might want to search for matches in DOIs, or even journal titles.
Select grouping Variable(s): If no variables are specified, then the matching function (
find_duplicates
) will search every value against every other value in a while
loop. This is computationally expensive for large datasets, so this menu allows you to limit the search for matches. The default is to search for titles, but only within the entries with the same journal and year.
String distances
This section has 5 options:

Select method allows you to select a matching algorithm for the specified function.
Select maximum distance sets the threshold for matching pairs of strings.
Make lower case and remove punctuation do just that.
App behaviour
If the selected algorithm doesn’t detect any duplicates with the specified settings, then it will show a warning message to that effect. If it does locate some potential duplicates, then it will then present pairs of potential duplicates to you, and invite you to select which you would like to keep:
Once you have checked all possible duplicates, the app will prompt you to save your data as an exported file. Alternatively, you can exit the ‘save’ screen and click ‘exit app’ to return your saved results to the workspace.
Working with large datasets
To follow
Classifying duplicates by source
To follow