Duplicate identification and removal is a large field in itself, but the method applied in revtools is intentionally simple. Basically, you can use the function
find_duplicates to look for repeated information within a column of data within a
data <- read_bibliography("my_data.ris") matches <- find_duplicates( data = data, match_variable = "title", group_variable = NULL, match_function = "fuzzdist", method = "fuzz_partial_ratio", threshold = 0 )
Once you have searched for potential duplicates, you can use
extract_unique_references to automatically extract one reference from every ‘group’ of matched documents. This function simply picks the document with the most text from each group.
data_unique <- extract_unique_references(data, matches)
Although this works, it relies very heavily on the user to check the results. The settings that you choose in
find_duplicates strongly affect the accuracy of the result, so it is risky to simply rely on
extract_unique_references to do the work for you. A safer choice is to use
screen_duplicates to investigate different string matching algorithms, and to interrogate the results, via an interactive interface generated by Shiny.
You can launch
screen_duplicates in one of three different ways. First, if you want to import data within the app, then you just run the function by itself:
Second, you can launch the app using data from the workspace:
data <- read_bibliography("my_data.ris") screen_duplicates(data)
Finally, if you want to save results from the app back to the workspace, then you need to specify an object where that data can be returned:
data <- read_bibliography("my_data.ris") result <- screen_duplicates(data)
The ‘Data’ tab contains four menus:
Import: If you haven’t passed any data to
screen_duplicates, then this allows you to drag-and-drop a dataset directly in to the app.
Is there a variable describing duplicates in this dataset?: If you have identified calculated duplicates in your dataset - either manually, or using
find_duplicates - then you can use this to specify where those data are located in your
Select column to search for duplicates: This specifies which data should be searched for matches. Most often this will be the article title, but you might want to search for matches in DOIs, or even journal titles.
Select grouping Variable(s): If no variables are specified, then the matching function (
find_duplicates) will search every value against every other value in a
while loop. This is computationally expensive for large datasets, so this menu allows you to limit the search for matches. The default is to search for titles, but only within the entries with the same journal and year.
This section has 5 options:
Select function has three options:
- fuzzdist: fuzzy string matching based on the fuzzywuzzy Python library
- stringdist: Ditto, but from R::stringdist
- exact: match strings exactly
Select method allows you to select a matching algorithm for the specified function.
Select maximum distance sets the threshold for matching pairs of strings.
Make lower case and remove punctuation do just that.
If the selected algorithm doesn’t detect any duplicates with the specified settings, then it will show a warning message to that effect. If it does locate some potential duplicates, then it will then present pairs of potential duplicates to you, and invite you to select which you would like to keep:
Once you have checked all possible duplicates, the app will prompt you to save your data as an exported file. Alternatively, you can exit the ‘save’ screen and click ‘exit app’ to return your saved results to the workspace.