reserch text

This commit is contained in:
Henry Dowd
2025-11-29 14:54:32 +00:00
parent fb68bc869a
commit 02cdc7bac6
7 changed files with 447 additions and 210 deletions

View File

@@ -0,0 +1,109 @@
--- Datasets/Corpus ---
-- Microsoft Research Paraphrase Corpus --
|- Primary Usage -
Paraphrase Identification
|- Content -
~6,000 sentence pairs from online news sources, labled (1,0)
Relatively small and limited to news articles
|- Best For -
Initial development, benchmarking
-- PAN Plagiarism Detection Corpus --
|- Primary Usage -
Plagiarism Detection Research
|- Content -
Different years of competitions with various text types
PAN-PC-10/11 External plagiarism detection
PAN-SS-13 Single source plagiarism with complexity levels
Academic texts, web content, student essays
|- Best For -
Advanced evluation on realistic plagiarism
-- Quora Question Pairs --
|- Primary Usage -
Duplicate Questions - (Paraphrased, likely unintentional)
|- Content -
+400,000 question pairs from Quora
Labled duplicate and not duplicate
Focuses on questions, not general statements
|- Best For -
Training data-intensive models
-- SemEval Paraphrase Datasets --
|- Primary Usage -
Paraphrase and semantic similarity
|- Content -
Datasets from SemEval competitions
SemEval-2012 Task 6: Semantic textual similarity
SemEval-2015 Task 1: Paraphrase & Semantic similarity
SemEval-2017 Task 1: Semantic similarity
News, headlines, image captions
Well annotated, multiple languages
Fragmented across "Tasks"
|- Best For -
Multi domain evaluation (different criteria)
-- P4P (Paraphrase for Plagiarism) Corpus --
|- Primary Usage -
Plagiarism detection with paraphrasing
|- Content -
Academic texts with paraphrased plagiarism
Source-plagiarism mappings, paraphrase types
Academic writing
Limited availability + academic focus
Access limited to requests
|- Best For -
Paraphrase-specific plagiarism research
-- ParaBank 2.0 --
|- Primary Usage -
Paraphrase generation and evaluation
|- Content -
Large-scale paraphrase pairs generated from parallel text
Multiple paraphrase candidates per sentence
Machine generated, may contain noise
|- Best For -
Large scale training and data augmentation
-- Twitter Paraphrase Corpus --
|- Primary Usage -
Short text paraphrase detection
|- Content -
tweet pairs annotated for paraphrase relationship
Paraphrase scoresand binary lables
~20,000 pairs
Informal language, real-world usage
Short text, informal grammer (difficult to parse)
|- Best For -
Informal language and social media applications
-- UW-Stanford Paraphrase Corpus --
|- Primary Usage -
Paraphrase detection
|- Content -
Sentence pairs from news & web text
Paraphrase judgments
~3,000 pairs (Very small dataset)
High quality human judgments (good test set)
|- Best For -
High-precision evaluation, testing
-- Sheffield Plagiarism Corpus --
|- Primary Usage -
Academic plagiarism detection research
|- Content -
Original academic texts and publications
Modified documents with various types of plagiarism
Detailed Markups of plagiarised sections with source mappings
Plagiarism types
Verbatim copying, Parphrasing, Structural plagiarism
Academic writing and student essays
Realistic plagiarism + obfuscation types
|- Best For -
Evaluating real academic plagiarism detection
-- New PAN25 --
|- 3 parts
spot_check
train
validation

View File

@@ -0,0 +1,75 @@
--- Parsers ---
-- SpaCy --
|- Philosophy -
Fast, easy to use, Industrial Strength
|- Models -
Pre-trained, "en_core_web_trf", "en_core_web_sm/md/lg"
|- Outputs -
Provides universal dependancies (UD) labels by default
|- Use -
Very easy to use, a few lines of code to parse sentence and its dependencies
|- Integration -
Works very well with python data science stack, networkx integrates easily
|- Performance -
Larger models very accurate, smaller are very fast
-- Stanford Stanza --
|- Philosophy -
Pure python + modern version of Stanford Core NLP
Research oriented, highly accurate
|- Models -
Pre-trained models on different treebanks can handle complex gramatical sctructures.
|- Output -
Universal dependencies
|- Use -
Good API, is clean and fits well with python
|- Integration -
Pure python integrated well with oher python libraries
|- Performance -
Accuracy is among the best available, Speed is slower than spacy non-transformer models
-- Allen NLP --
|- Philosophy -
Research first, built on python
Designed for "state of the art" deep learning models in NLP "Go to choice if you plan to modify or train your own models"
|- Models -
Biffane dependancy parser is most widely used (highly accurate)
|- Output -
Universal Dependancies
|- Use -
More difficult than SpaCy or Stanza, requires better understanding of the libraries abstactions
|- Integration -
Excellent in the python ecosystem, for pre-trained model is overkill
|- Performance -
State-of-the-art accuracy, inference speed can be slower due to model complexity
-- Spark NLP ---
|- Philosophy -
Built on Apache Spark or scalable, distributed NLP processing
For massive datasets in a distributed computing
|- Models -
Provided its own anotated models
often transformer architecture
|- Output -
Universal Dependancies
|- Use -
Good if familiar with spark ML API
Setup more involved than pure python libraries
|- Integration -
Ideal for big data pipelines
Unnesisarily heavy for single corpus analysis
|- Performance -
Very high accuracy
Designed for speed and scale on clusters
-- Overall --
|- SpaCy or Stanze --
SpaCy is much simpler to set up and use, - robust, highly accurate system to be set up quickly and relativly simply
Stanza is more complex and requires more complex setup, - maximise baseline accuracy when parsing in exchange for speed and simlicity
-- Choice --
|- SpaCy -
Use SpaCy initially, if parsing errors appear will switch to Stanza to check issues

View File

@@ -0,0 +1,10 @@
--- Project Overview ---
3 Layer detection
|- 1. Surface-Level similarity (Direct copying)
|- 2. Text Analysis
\_ (a) Semantic Similarity (Keywords/contextual meaning)
\_ (b) Syntactic Similarity (gramatical structure)
|- 3. Parapharse Detection
-- 0. Foundation --