DedupEndNote

Deduplication of EndNote and Zotero RIS files:

deduplicate one file: produces a new RIS file with the unique records
deduplicate two files (NEW-RECORDS and OLD-RECORDS): deduplicates both files and produces a RIS file with the unique records from NEW-RECORDS
mark the duplicates of one file: produces a RIS file with the Label field containing the ID of the duplicate record

DedupEndNote is available at http://dedupendnote.nl:9777

Actions

Export one or two EndNote or Zotero databases as RIS file(s)
Upload the file(s)
Choose the action
Download the result file (RIS)
Import the result file into a new EndNote or Zotero database

Building your own version

DedupEndNote is a Java web application (Java 21, Spring Boot 3.5.6, fat jar). It can be started locally with:

    java -jar DedupEndNote-[VERSION].jar

and the application will be available at

    http://localhost:9777

Why DedupEndNote?

Deduplication in EndNote and Zotero misses many duplicate records. Building and maintaining a Journals List within Endnote can partly solve this problem, but there remain lots of cases where EndNote is too unforgiving when comparing records. Some bibliographic databases offer deduplication for their own databases (OVID: Medline and EMBASE), but this does not help PubMed, Cochrane or Web of Science users.

DedupEndNote deduplicates an EndNote or Zotero RIS file and writes a new RIS file with the unique records, which can be imported into a new EndNote or Zotero database. It is more forgiving than EndNote and Zotero itself when comparing records, but tests have shown that it identifies many more duplicates (see below under "Performance").

The program has been tested on EndNote databases with records from:

CINAHL (EBSCOHost)
ClinicalTrials.gov
Cochrane Library (Trials)
EMBASE (OVID)
Medline (OVID)
PsycINFO (OVID)
PubMed
Scopus
Web of Science

The program has been tested with files with up to 50.000 records.

What does DedupEndNote do?

1. Deduplicate

Each pair of records is compared in 5 different ways. The general rule is:

Comparison	Result	Action
1 ... 5	YES	go to next comparison if present, else mark the records as duplicates
	(insufficient data for comparison)
	NO	stop comparisons for this pair of record

The following comparisons are used (in this order, chosen for performance reasons):

Publication year: Are they at most 1 year apart?

Prepocessing: publication years before 1800 are removed (see insufficient data)
Insufficient data: Records without a publication year are compared to all records unless they have been identified as a duplicate.
Special cases: Cochrane Reviews are compared for the same publication year only.

Starting page or DOI: Are they the same?
If the Starting and Ending page of at least one of the publications are more than 2 pages apart, then: the DOIs are compared first. If the DOIs are different or one or both are absent, then the starting pages are compared, otherwise: the starting pages are compared first.

Preprocessing: Article number is treated as a starting page if starting page itself is empty or contains "-".
Preprocessing: Starting pages are compared only for number: "S123" and "123" are considered the same.
Preprocessing: In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out. URL- and HTML-encoded DOIs are decoded ('10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X' becomes '10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X'). DOIs are lowercased.
Insufficient data: If one or both DOIs are missing and one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.
Special cases: Cochrane Reviews: if both records have a DOI, only the DOIs are compared, otherwise the starting pages are compared.

Authors: Is the Jaro-Winkler similarity of the authors > 0.67?

Preprocessing: The author "Anonymous," is treated as no author.
Preprocessing: Group author names are removed. "Author" names which contain "consortium", "grp", "group", "nct" or "study" are considered group author names.
Preprocessing: Only the first 40 authors are retained.
Preprocessing: First names are reduced to initials ("Moorthy, Ranjith K." becomes "Moorthy, R. K.").
Preprocessing: All authors from each record are joined by "; ".
Insufficient data: If one or both records have no authors, the answer is YES (except if one of the records is a reply (see below) and one of the records has no starting page or DOI).

Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.89?
The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles. Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed.

Preprocessing: The titles are normalized (converted to lower case, text between "<...>" removed, all characters which are not letters or numbers are replaced by a space character, ...).
Preprocessing: In the titles of retracted publications all parts which refer to the retraction are removed. "RETRACTED: Response of Breast Cancer Cells and Cancer Stem Cells to Metformin and Hyperthermia Alone or Combined (Retracted article. See vol. 20, 2025)". A publication is considered a retraction if the Title starts with "retracted", "removed" or "withdrawn", or contains "retracted article" (all case insensitive).
Insufficient data: If one of the records is a reply, erratum or comment (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).

ISBN, ISSN or Journal: Are they the same (ISBN, ISSN) or similar (Journal)?
This rule is skipped if both records have the same DOI (that comparison was made in step 2).
The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals, ISBNs as ISSNs. All ISBNs, ISSns and journal titles (including abbreviations) in the records are used.
If both records have an ISBN, the ISBNs are compared (stop), if both have an ISSN, the ISSns are compared (stop), else the journal titles are compared.
Abbreviated and full journal titles are compared in a sensible way (see examples below).

Preprocessing: ISBNs and ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
Preprocessing: Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
Preprocessing: the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).

Remarks: Comment : a publication is considered a comment if the title (fields ST and TI) contains words as "comment" or "commentary".

Erratum : a publication is considered am erratum if the title (fields ST and TI) contains "Correction", "Corrigendum" or "Erratum".

Reply : a publication is considered a reply if the title (fields ST and TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).

T3 field : Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.

2. Enrich the records

When writing the output file (except in Mark Mode), the following fields can be changed:

Author (AU):
- if the (only) author is "Anonymous", the author is omitted
DOI (DO):
- the DOIs of the removed duplicate records are copied to the saved record and deduplicated. The DOI field is important for finding the full text in EndNote.
- DOIs of the form "10.1038/ctg.2014.12", "http://dx.doi.org/10.1038/ctg.2014.12", ... are rewritten in the prescribed form "https://doi.org/10.1038/ctg.2014.12". DOIs of this form are clickable links in EndNote.
Publication year (PY):
- if the saved record has no value for its Publication year but one of the removed duplicate records has, the first not empty Publication year of the duplicates is copied to the saved record.
Starting page (SP) and Article Number (C7):
- the article number from field C7 is put in the Pages field (SP) if the Pages field is empty or does not contain a "-", overwriting the Pages field content.
- the article number field (C7) is omitted
- if the saved record has no value for its Pages field (e.g. PubMed ahead of print publications) but one of the removed duplicate records has, the first not empty pages of the duplicates are copied to the saved record.
- the Pages field gets an unabbreviated form: e.g. "482-91" is rewritten as "482-491".
- if the ending page is the same as the starting page, only the starting page is written ("192" instead of "192-192").
Title (TI):
- If the publication is a reply / erratum / comment / retraction, the title is replaced with the longest title from the duplicates (e.g. "Reply from the authors" is replaced by "Coagulation parameters and portal vein thrombosis in cirrhosis Reply")

The output file is a new RIS file which can be imported into a new EndNote or Zotero database.

DedupEndNote is slower than EndNote in deduplicating records because its comparisons are more time consuming. EndNote can deduplicate a EndNote database of ca. 15,000 records in less dan 5 seconds. DedupEndNote needs around 20 seconds to deduplicate the export file in RIS format (115MB).

Performance

Data are from:

[SRA] Rathbone, J., Carter, M., Hoffmann, T. et al. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Syst Rev 4, 6 (2015). https://doi.org/10.1186/2046-4053-4-6
The data sets are available at https://osf.io/dyvnj/
[McKeown] McKeown, S., Mir, Z.M. Considerations for conducting systematic reviews: evaluating the performance of different methods for de-duplicating references. Syst Rev 10, 38 (2021). https://doi.org/10.1186/s13643-021-01583-y
[BIG_SET] Own test database for DedupEndNote on portal vein thrombosis (52,828 records, with 5078 records validated)

Name	Tool	True pos	False neg	Sensitivity	True neg	False pos	Specificity	Accuracy
SRA: Cytology screening (1856 rec)	EndNote X9	885	518	63.1%	452	1	99.8%	72.0%
	SRA-DM	1265	139	90.1%	452	0	100.0%	92.5%
	DedupEndNote	1361	59	95.8%	436	0	100.0%	96.8%

SRA: Haematology (1415 rec)	EndNote	159	87	64.6%	1165	4	99.7%	93.6%
	SRA-DM	208	38	84.6%	1169	0	100.0%	97.3%
	DedupEndNote	225	11	95.2%	1177	2	99.8%	99.1%

SRA: Respiratory (1988 rec)	EndNote X9	410	391	51.2%	1185	2	99.8%	80.2%
	SRA-DM	674	125	84.4%	1189	0	100.0%	93.7%
	DedupEndNote	766	34	95.8%	1184	4	99.7%	98.1%

SRA: Stroke (1292 rec)	EndNote X9	372	134	73.5%	784	2	99.7%	89.5%
	SRA-DM	426	81	84.0%	785	0	100.0%	93.7%
	DedupEndNote	497	13	97.4%	782	0	100.0%	98.9%

McKeown (3130 rec)	OVID	1982	90	95.7%	1058	0	100.0%	97.1%
	EndNote	1541	531	74.4%	850	208	80.3%	76.4%
	Mendeley	1877	195	90.6%	1041	17	98.4%	93.2%
	Zotero	1473	599	71.1%	1038	20	98.1%	80.2%
	Covidence	1952	120	94.2%	1056	2	99.8%	96.1%
	Rayyan	2023	49	97.6%	1006	52	95.1%	96.8%
	DedupEndNote	2018	56	97.3%	1056	0	100.0%	98.2%

BIG_SET (4926 rec)	DedupEndNote	3737	176	95.7%	959	10	99.0%	96.3%

The false positive cases from BIG_SET:

The following list is old (version 1.0.0), and should / will be updated

wrong DOI and journal in second of 2 EMBASE (OVID) records: same authors, year, title, DOI, different journal and starting page
- Segovia, M. C., et al. (2019). "Combined multivisceral and renal transplant in a patient with JAK-2 mutation." Transplantation 103(7 Supplement 2): S143. DOI: 10.1097/01.tp.0000576288.84252.91
- Segovia, M. C., et al. (2019). "Combined multivisceral and renal transplant in a patient with JAK-2 mutation." Neurology 92(15 Supplement 1). DOI: 10.1097/01.tp.0000576288.84252.91
wrong DOI and journal in second of 2 EMBASE (OVID) records: same authors, year, title, DOI, different journal and starting page
- Galvao, F. H., et al. (2019). "Intestinal and multivisceral transplantation at hospital dasclinicas da faculdade De medicina da universidade De Sao Paulo (HC-FMUSP)-Brazil." Transplantation 103(7 Supplement 2): S171. DOI: 10.1097/01.tp.0000576492.69414.80
- Galvao, F. H., et al. (2019). "Intestinal and multivisceral transplantation at hospital dasclinicas da faculdade De medicina da universidade De Sao Paulo (HC-FMUSP)-Brazil." Neurology 92(15 Supplement 1). DOI: 10.1097/01.tp.0000576492.69414.80
reversed title 3 seen as similar to reversed title of 1 and 2: same authors, year, title, starting page
- Cool, J., et al. (2018). "TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6): S1178-S1178. [Web of Science]
- Cool, J., et al. (2018). "TRENDS IN THE PREVALENCE OF PORTAL VEIN THROMBOSIS AND ASSOCIATED MORTALITY IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6 Supplement 1): S-1178. DOI: 10.1016/S0016-5085%2818%2933901-5 [Embase OVID]
- Cool, J., et al. (2018). "THE ASSOCIATION BETWEEN PORTAL VEIN THROMBOSIS AND OTHER VENOUS THROMBOEMBOLISM IN CIRRHOSIS: ANALYSIS OF A NATIONALLY REPRESENTATIVE INPATIENT COHORT." Gastroenterology 154(6): S1178-S1179. [Web of Science]

Special cases

ClinicalTrials.gov records

Records from ClinicalTrials.gov are also available within the Cochrane Library and EMBASE, but the format of the data is quite different. DedupEndNote changes the data of these records to a common format when it imports them so that deduplication can work. The deduplicated output is also standardized:

Reference Type: Journal Article
Authors: (empty)
Journal: https://clinicaltrials.gov
Pages: the ClinicalTrials.gov ID (e.g. NCT06923007)
URL: the first URL is for ClinicalTrials.gov (e.g. https://clinicaltrials.gov/study/NCT06923007)
the other fields are from the first record in a deduplication set

Limitations

Input file size: The maximum size of the input file is limited to 150MB.
Input file format: only EndNote RIS file (at present)
Input file encoding: The program assumes that the input file is encoded as UTF-8.
The program uses a bibliographic point of view: an article or conference abstract that has been published in more than one (issue of a) journal is not considered a duplicate publication.
If authors AND (all) titles AND (all) journal names for a record use a non-Latin script, results for this record may be inaccurate.
Each input file must be an export from ONE EndNote database: the ID fields are used internally for identifying the records, so they have to be unique. When comparing 2 files the ID fields may be common between the 2 files.
The program has been developed and tested for biomedical databases (PubMed, EMBASE, ...) and some general databases (Web of Science, Scopus). Deduplicating records from other databases is not garanteed to work.
Records for each publication year are compared to records from the same and the following year: a record from 2016 is compared to the records from 2015 (when treating the records from 2015) and from 2016 and 2017 (when treating the records from 2016). A PubMed ahead-of-print record from 2013 and a corresponding record from 2017 (when it was 'officially' published) will not be compared (and possibly deduplicated).
Bibliographic databases are not always very accurate in the starting page of a publication. Because starting page is part of the comparisons, DedupEndNote misses the duplicates when bibliographic databases don't agree on the starting page (and one or both records have no DOIs).

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.mvn/wrapper		.mvn/wrapper
.settings		.settings
.vscode		.vscode
src		src
.checkstyle		.checkstyle
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
lombok.config		lombok.config
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DedupEndNote

Actions

Building your own version

Why DedupEndNote?

What does DedupEndNote do?

1. Deduplicate

2. Enrich the records

Performance

The false positive cases from BIG_SET:

Special cases

Limitations

About

Uh oh!

Uh oh!

Languages

License

globbestael/DedupEndNote

Folders and files

Latest commit

History

Repository files navigation

DedupEndNote

Actions

Building your own version

Why DedupEndNote?

What does DedupEndNote do?

1. Deduplicate

2. Enrich the records

Performance

The false positive cases from BIG_SET:

Special cases

Limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages