110 lines
4.0 KiB
Plaintext
110 lines
4.0 KiB
Plaintext
--- Datasets/Corpus ---
|
|
-- Microsoft Research Paraphrase Corpus --
|
|
|- Primary Usage -
|
|
Paraphrase Identification
|
|
|- Content -
|
|
~6,000 sentence pairs from online news sources, labled (1,0)
|
|
Relatively small and limited to news articles
|
|
|- Best For -
|
|
Initial development, benchmarking
|
|
|
|
-- PAN Plagiarism Detection Corpus --
|
|
|- Primary Usage -
|
|
Plagiarism Detection Research
|
|
|- Content -
|
|
Different years of competitions with various text types
|
|
PAN-PC-10/11 External plagiarism detection
|
|
PAN-SS-13 Single source plagiarism with complexity levels
|
|
Academic texts, web content, student essays
|
|
|- Best For -
|
|
Advanced evluation on realistic plagiarism
|
|
|
|
-- Quora Question Pairs --
|
|
|- Primary Usage -
|
|
Duplicate Questions - (Paraphrased, likely unintentional)
|
|
|- Content -
|
|
+400,000 question pairs from Quora
|
|
Labled duplicate and not duplicate
|
|
Focuses on questions, not general statements
|
|
|- Best For -
|
|
Training data-intensive models
|
|
|
|
-- SemEval Paraphrase Datasets --
|
|
|- Primary Usage -
|
|
Paraphrase and semantic similarity
|
|
|- Content -
|
|
Datasets from SemEval competitions
|
|
SemEval-2012 Task 6: Semantic textual similarity
|
|
SemEval-2015 Task 1: Paraphrase & Semantic similarity
|
|
SemEval-2017 Task 1: Semantic similarity
|
|
News, headlines, image captions
|
|
Well annotated, multiple languages
|
|
Fragmented across "Tasks"
|
|
|- Best For -
|
|
Multi domain evaluation (different criteria)
|
|
|
|
-- P4P (Paraphrase for Plagiarism) Corpus --
|
|
|- Primary Usage -
|
|
Plagiarism detection with paraphrasing
|
|
|- Content -
|
|
Academic texts with paraphrased plagiarism
|
|
Source-plagiarism mappings, paraphrase types
|
|
Academic writing
|
|
Limited availability + academic focus
|
|
Access limited to requests
|
|
|- Best For -
|
|
Paraphrase-specific plagiarism research
|
|
|
|
-- ParaBank 2.0 --
|
|
|- Primary Usage -
|
|
Paraphrase generation and evaluation
|
|
|- Content -
|
|
Large-scale paraphrase pairs generated from parallel text
|
|
Multiple paraphrase candidates per sentence
|
|
Machine generated, may contain noise
|
|
|- Best For -
|
|
Large scale training and data augmentation
|
|
|
|
-- Twitter Paraphrase Corpus --
|
|
|- Primary Usage -
|
|
Short text paraphrase detection
|
|
|- Content -
|
|
tweet pairs annotated for paraphrase relationship
|
|
Paraphrase scoresand binary lables
|
|
~20,000 pairs
|
|
Informal language, real-world usage
|
|
Short text, informal grammer (difficult to parse)
|
|
|- Best For -
|
|
Informal language and social media applications
|
|
|
|
-- UW-Stanford Paraphrase Corpus --
|
|
|- Primary Usage -
|
|
Paraphrase detection
|
|
|- Content -
|
|
Sentence pairs from news & web text
|
|
Paraphrase judgments
|
|
~3,000 pairs (Very small dataset)
|
|
High quality human judgments (good test set)
|
|
|- Best For -
|
|
High-precision evaluation, testing
|
|
|
|
-- Sheffield Plagiarism Corpus --
|
|
|- Primary Usage -
|
|
Academic plagiarism detection research
|
|
|- Content -
|
|
Original academic texts and publications
|
|
Modified documents with various types of plagiarism
|
|
Detailed Markups of plagiarised sections with source mappings
|
|
Plagiarism types
|
|
Verbatim copying, Parphrasing, Structural plagiarism
|
|
Academic writing and student essays
|
|
Realistic plagiarism + obfuscation types
|
|
|- Best For -
|
|
Evaluating real academic plagiarism detection
|
|
|
|
-- New PAN25 --
|
|
|- 3 parts
|
|
spot_check
|
|
train
|
|
validation
|