Deduplication of EndNote and Zotero RIS files:
- deduplicate one file: produces a new RIS file with the unique records
- deduplicate two files (NEW-RECORDS and OLD-RECORDS): deduplicates both files and produces a RIS file with the unique records from NEW-RECORDS
- mark the duplicates of one file: produces a RIS file with the Label field containing the ID of the duplicate record
DedupEndNote is available at http://dedupendnote.nl:9777
- Export one or two EndNote or Zotero databases as RIS file(s)
- Upload the file(s)
- Choose the action
- Download the result file (RIS)
- Import the result file into a new EndNote or Zotero database
DedupEndNote is a Java web application (Java 21, Spring Boot 3.5.6, fat jar). It can be started locally with:
java -jar DedupEndNote-[VERSION].jar
and the application will be available at
http://localhost:9777
Deduplication in EndNote and Zotero misses many duplicate records. Building and maintaining a Journals List within Endnote can partly solve this problem, but there remain lots of cases where EndNote is too unforgiving when comparing records. Some bibliographic databases offer deduplication for their own databases (OVID: Medline and EMBASE), but this does not help PubMed, Cochrane or Web of Science users.
DedupEndNote deduplicates an EndNote or Zotero RIS file and writes a new RIS file with the unique records, which can be imported into a new EndNote or Zotero database. It is more forgiving than EndNote and Zotero itself when comparing records, but tests have shown that it identifies many more duplicates (see below under "Performance").
The program has been tested on EndNote databases with records from:
- CINAHL (EBSCOHost)
- ClinicalTrials.gov
- Cochrane Library (Trials)
- EMBASE (OVID)
- Medline (OVID)
- PsycINFO (OVID)
- PubMed
- Scopus
- Web of Science
The program has been tested with files with up to 50.000 records.
Each pair of records is compared in 5 different ways. The general rule is:
| Comparison | Result | Action |
|---|---|---|
| 1 ... 5 | YES | go to next comparison if present, else mark the records as duplicates |
| (insufficient data for comparison) | ||
| NO | stop comparisons for this pair of record |
The following comparisons are used (in this order, chosen for performance reasons):
- Publication year: Are they at most 1 year apart?
- Prepocessing: publication years before 1800 are removed (see insufficient data)
- Insufficient data: Records without a publication year are compared to all records unless they have been identified as a duplicate.
- Special cases: Cochrane Reviews are compared for the same publication year only.
- Starting page or DOI: Are they the same?
If the Starting and Ending page of at least one of the publications are more than 2 pages apart, then: the DOIs are compared first. If the DOIs are different or one or both are absent, then the starting pages are compared, otherwise: the starting pages are compared first.
- Preprocessing: Article number is treated as a starting page if starting page itself is empty or contains "-".
- Preprocessing: Starting pages are compared only for number: "S123" and "123" are considered the same.
- Preprocessing: In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out. URL- and HTML-encoded DOIs are decoded ('10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X' becomes '10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X'). DOIs are lowercased.
- Insufficient data: If one or both DOIs are missing and one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.
- Special cases: Cochrane Reviews: if both records have a DOI, only the DOIs are compared, otherwise the starting pages are compared.
- Authors: Is the Jaro-Winkler similarity of the authors > 0.67?
- Preprocessing: The author "Anonymous," is treated as no author.
- Preprocessing: Group author names are removed. "Author" names which contain "consortium", "grp", "group", "nct" or "study" are considered group author names.
- Preprocessing: Only the first 40 authors are retained.
- Preprocessing: First names are reduced to initials ("Moorthy, Ranjith K." becomes "Moorthy, R. K.").
- Preprocessing: All authors from each record are joined by "; ".
- Insufficient data: If one or both records have no authors, the answer is YES (except if one of the records is a reply (see below) and one of the records has no starting page or DOI).
- Title: Is the Jaro-Winkler similarity
of (one of) the normalized titles > 0.89?
The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles. Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed.
- Preprocessing: The titles are normalized (converted to lower case, text between "<...>" removed, all characters which are not letters or numbers are replaced by a space character, ...).
- Preprocessing: In the titles of retracted publications all parts which refer to the retraction are removed. "RETRACTED: Response of Breast Cancer Cells and Cancer Stem Cells to Metformin and Hyperthermia Alone or Combined (Retracted article. See vol. 20, 2025)". A publication is considered a retraction if the Title starts with "retracted", "removed" or "withdrawn", or contains "retracted article" (all case insensitive).
- Insufficient data: If one of the records is a reply, erratum or comment (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).
- ISBN, ISSN or Journal: Are they the same (ISBN, ISSN) or similar (Journal)?
This rule is skipped if both records have the same DOI (that comparison was made in step 2).
The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals, ISBNs as ISSNs. All ISBNs, ISSns and journal titles (including abbreviations) in the records are used.
If both records have an ISBN, the ISBNs are compared (stop), if both have an ISSN, the ISSns are compared (stop), else the journal titles are compared.
Abbreviated and full journal titles are compared in a sensible way (see examples below).
- Preprocessing: ISBNs and ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
- Preprocessing: Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
- Preprocessing: the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).
Remarks: Comment : a publication is considered a comment if the title (fields ST and TI) contains words as "comment" or "commentary".
Erratum : a publication is considered am erratum if the title (fields ST and TI) contains "Correction", "Corrigendum" or "Erratum".
Reply : a publication is considered a reply if the title (fields ST and TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).
T3 field : Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.
When writing the output file (except in Mark Mode), the following fields can be changed:
- Author (AU):
- if the (only) author is "Anonymous", the author is omitted
- DOI (DO):
- the DOIs of the removed duplicate records are copied to the saved record and deduplicated. The DOI field is important for finding the full text in EndNote.
- DOIs of the form "10.1038/ctg.2014.12", "http://dx.doi.org/10.1038/ctg.2014.12", ... are rewritten in the prescribed form "https://doi.org/10.1038/ctg.2014.12". DOIs of this form are clickable links in EndNote.
- Publication year (PY):
- if the saved record has no value for its Publication year but one of the removed duplicate records has, the first not empty Publication year of the duplicates is copied to the saved record.
- Starting page (SP) and Article Number (C7):
- the article number from field C7 is put in the Pages field (SP) if the Pages field is empty or does not contain a "-", overwriting the Pages field content.
- the article number field (C7) is omitted
- if the saved record has no value for its Pages field (e.g. PubMed ahead of print publications) but one of the removed duplicate records has, the first not empty pages of the duplicates are copied to the saved record.
- the Pages field gets an unabbreviated form: e.g. "482-91" is rewritten as "482-491".
- if the ending page is the same as the starting page, only the starting page is written ("192" instead of "192-192").
- Title (TI):
- If the publication is a reply / erratum / comment / retraction, the title is replaced with the longest title from the duplicates (e.g. "Reply from the authors" is replaced by "Coagulation parameters and portal vein thrombosis in cirrhosis Reply")
The output file is a new RIS file which can be imported into a new EndNote or Zotero database.
DedupEndNote is slower than EndNote in deduplicating records because its comparisons are more time consuming. EndNote can deduplicate a EndNote database of ca. 15,000 records in less dan 5 seconds. DedupEndNote needs around 20 seconds to deduplicate the export file in RIS format (115MB).
Data are from:
- [SRA] Rathbone, J., Carter, M., Hoffmann, T. et al.
Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module.
Syst Rev 4, 6 (2015). https://doi.org/10.1186/2046-4053-4-6
The data sets are available at https://osf.io/dyvnj/ - [McKeown] McKeown, S., Mir, Z.M. Considerations for conducting systematic reviews: evaluating the performance of different methods for de-duplicating references. Syst Rev 10, 38 (2021). https://doi.org/10.1186/s13643-021-01583-y
- [BIG_SET] Own test database for DedupEndNote on portal vein thrombosis (52,828 records, with 5078 records validated)
| Name | Tool | True pos | False neg | Sensitivity | True neg | False pos | Specificity | Accuracy |
|---|---|---|---|---|---|---|---|---|
| SRA: Cytology screening (1856 rec) |
EndNote X9 | 885 | 518 | 63.1% | 452 | 1 | 99.8% | 72.0% |
| SRA-DM | 1265 | 139 | 90.1% | 452 | 0 | 100.0% | 92.5% | |
| DedupEndNote | 1361 | 59 | 95.8% | 436 | 0 | 100.0% | 96.8% | |
| SRA: Haematology (1415 rec) | EndNote | 159 | 87 | 64.6% | 1165 | 4 | 99.7% | 93.6% |
| SRA-DM | 208 | 38 | 84.6% | 1169 | 0 | 100.0% | 97.3% | |
| DedupEndNote | 225 | 11 | 95.2% | 1177 | 2 | 99.8% | 99.1% | |
| SRA: Respiratory (1988 rec) |
EndNote X9 | 410 | 391 | 51.2% | 1185 | 2 | 99.8% | 80.2% |
| SRA-DM | 674 | 125 | 84.4% | 1189 | 0 | 100.0% | 93.7% | |
| DedupEndNote | 766 | 34 | 95.8% | 1184 | 4 | 99.7% | 98.1% | |
| SRA: Stroke (1292 rec) |
EndNote X9 | 372 | 134 | 73.5% | 784 | 2 | 99.7% | 89.5% |
| SRA-DM | 426 | 81 | 84.0% | 785 | 0 | 100.0% | 93.7% | |
| DedupEndNote | 497 | 13 | 97.4% | 782 | 0 | 100.0% | 98.9% | |
| McKeown (3130 rec) |
OVID | 1982 | 90 | 95.7% | 1058 | 0 | 100.0% | 97.1% |
| EndNote | 1541 | 531 | 74.4% | 850 | 208 | 80.3% | 76.4% | |
| Mendeley | 1877 | 195 | 90.6% | 1041 | 17 | 98.4% | 93.2% | |
| Zotero | 1473 | 599 | 71.1% | 1038 | 20 | 98.1% | 80.2% | |
| Covidence | 1952 | 120 | 94.2% | 1056 | 2 | 99.8% | 96.1% | |
| Rayyan | 2023 | 49 | 97.6% | 1006 | 52 | 95.1% | 96.8% | |
| DedupEndNote | 2018 | 56 | 97.3% | 1056 | 0 | 100.0% | 98.2% | |
| BIG_SET (4926 rec) |
DedupEndNote | 3737 | 176 | 95.7% | 959 | 10 | 99.0% | 96.3% |
The following list is old (version 1.0.0), and should / will be updated
- wrong DOI and journal in second of 2 EMBASE (OVID) records: same authors, year, title, DOI, different journal and starting page
- Segovia, M. C., et al. (2019). "Combined multivisceral and renal transplant in a patient with JAK-2 mutation." Transplantation 103(7 Supplement 2): S143. DOI: 10.1097/01.tp.0000576288.84252.91
- Segovia, M. C., et al. (2019). "Combined multivisceral and renal transplant in a patient with JAK-2 mutation." Neurology 92(15 Supplement 1). DOI: 10.1097/01.tp.0000576288.84252.91
- wrong DOI and journal in second of 2 EMBASE (OVID) records: same authors, year, title, DOI, different journal and starting page
- Galvao, F. H., et al. (2019). "Intestinal and multivisceral transplantation at hospital dasclinicas da faculdade De medicina da universidade De Sao Paulo (HC-FMUSP)-Brazil." Transplantation 103(7 Supplement 2): S171. DOI: 10.1097/01.tp.0000576492.69414.80
- Galvao, F. H., et al. (2019). "Intestinal and multivisceral transplantation at hospital dasclinicas da faculdade De medicina da universidade De Sao Paulo (HC-FMUSP)-Brazil." Neurology 92(15 Supplement 1). DOI: 10.1097/01.tp.0000576492.69414.80
- reversed title 3 seen as similar to reversed title of 1 and 2: same authors, year, title, starting page
- Cool, J., et al. (2018). "TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6): S1178-S1178. [Web of Science]
- Cool, J., et al. (2018). "TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6 Supplement 1): S-1178. DOI: 10.1016/S0016-5085%2818%2933901-5 [Embase OVID]
- Cool, J., et al. (2018). "THE ASSOCIATION BETWEEN PORTAL VEIN THROMBOSIS AND OTHER VENOUS THROMBOEMBOLISM IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6): S1178-S1179. [Web of Science]
- ClinicalTrials.gov records
Records from ClinicalTrials.gov are also available within the Cochrane Library and EMBASE, but the format of the data is quite different. DedupEndNote changes the data of these records to a common format when it imports them so that deduplication can work. The deduplicated output is also standardized:
- Reference Type: Journal Article
- Authors: (empty)
- Journal: https://clinicaltrials.gov
- Pages: the ClinicalTrials.gov ID (e.g. NCT06923007)
- URL: the first URL is for ClinicalTrials.gov (e.g. https://clinicaltrials.gov/study/NCT06923007)
- the other fields are from the first record in a deduplication set
- Input file size: The maximum size of the input file is limited to 150MB.
- Input file format: only EndNote RIS file (at present)
- Input file encoding: The program assumes that the input file is encoded as UTF-8.
- The program uses a bibliographic point of view: an article or conference abstract that has been published in more than one (issue of a) journal is not considered a duplicate publication.
- If authors AND (all) titles AND (all) journal names for a record use a non-Latin script, results for this record may be inaccurate.
- Each input file must be an export from ONE EndNote database: the ID fields are used internally for identifying the records, so they have to be unique. When comparing 2 files the ID fields may be common between the 2 files.
- The program has been developed and tested for biomedical databases (PubMed, EMBASE, ...) and some general databases (Web of Science, Scopus). Deduplicating records from other databases is not garanteed to work.
- Records for each publication year are compared to records from the same and the following year: a record from 2016 is compared to the records from 2015 (when treating the records from 2015) and from 2016 and 2017 (when treating the records from 2016). A PubMed ahead-of-print record from 2013 and a corresponding record from 2017 (when it was 'officially' published) will not be compared (and possibly deduplicated).
- Bibliographic databases are not always very accurate in the starting page of a publication. Because starting page is part of the comparisons, DedupEndNote misses the duplicates when bibliographic databases don't agree on the starting page (and one or both records have no DOIs).