Deduplication

Locating duplicates

Duplicate identification and removal is a large field in itself, but the method applied in revtools is intentionally simple. Basically, you can use the function find_duplicates to look for repeated information within a column of data within a data.frame:

data <- read_bibliography("my_data.ris")
matches <- find_duplicates(data)

From version 0.4.1, find_duplicates searches for identical doi’s by default; but another common approach is to look for similar titles:

matches <- find_duplicates(data, match_variable = "title")

Using title matching changes the defaults from exact matching to fuzzy matching using the ‘stringdist’ package. However, you can specify these arguments manually if you’d prefer:

matches <- find_duplicates(data,
  match_variable = "title",
  method = "lv",
  threshold = 2
)

Once you have searched for potential duplicates, you can use extract_unique_references to automatically extract one reference from every ‘group’ of matched documents. This function picks the document with the most text from each group.

data_unique <- extract_unique_references(data, matches)

Although this works, it relies very heavily on the user to check the results. The settings that you choose in find_duplicates strongly affect the accuracy of the result, so it is risky to simply rely on extract_unique_references to do the work for you. A safer choice is to use screen_duplicates to investigate different string matching algorithms, and to interrogate the results via screen_duplicates.

Screening duplicates

You can launch screen_duplicates in one of three different ways. First, if you want to import data within the app, then you just run the function by itself:

screen_duplicates()

Second, you can launch the app using data from the workspace:

data <- read_bibliography("my_data.ris")
screen_duplicates(data)

Finally, if you want to save results from the app back to the workspace, then you need to specify an object where that data can be returned:

data <- read_bibliography("my_data.ris")
result <- screen_duplicates(data)

Specifying variables

The ‘Data’ tab contains four menus:

Import: If you haven't passed any data to screen_duplicates, then this allows you to drag-and-drop a dataset directly in to the app.

Is there a variable describing duplicates in this dataset?: If you have identified calculated duplicates in your dataset - either manually, or using find_duplicates - then you can use this to specify where those data are located in your data.frame.

Select column to search for duplicates: This specifies which data should be searched for matches. Most often this will be the article title, but you might want to search for matches in DOIs, or even journal titles.

Select grouping Variable(s): If no variables are specified, then the matching function (find_duplicates) will search every value against every other value in a while loop. This is computationally expensive for large datasets, so this menu allows you to limit the search for matches. The default is to search for titles, but only within the entries with the same journal and year.


String distances

This section has 5 options:

Select function has three options: - fuzzdist: fuzzy string matching based on the fuzzywuzzy Python library - stringdist: Ditto, but from R::stringdist - exact: match strings exactly

Select method allows you to select a matching algorithm for the specified function.

Select maximum distance sets the threshold for matching pairs of strings.

Make lower case and remove punctuation do just that.


App behaviour

If the selected algorithm doesn’t detect any duplicates with the specified settings, then it will show a warning message to that effect. If it does locate some potential duplicates, then it will then present pairs of potential duplicates to you, and invite you to select which you would like to keep:

Once you have checked all possible duplicates, the app will prompt you to save your data as an exported file. Alternatively, you can exit the ‘save’ screen and click ‘exit app’ to return your saved results to the workspace.

Working with large datasets

To follow

Classifying duplicates by source

To follow