reserch text
This commit is contained in:
109
research/Corpus_research.txt
Normal file
109
research/Corpus_research.txt
Normal file
@@ -0,0 +1,109 @@
|
||||
--- Datasets/Corpus ---
|
||||
-- Microsoft Research Paraphrase Corpus --
|
||||
|- Primary Usage -
|
||||
Paraphrase Identification
|
||||
|- Content -
|
||||
~6,000 sentence pairs from online news sources, labled (1,0)
|
||||
Relatively small and limited to news articles
|
||||
|- Best For -
|
||||
Initial development, benchmarking
|
||||
|
||||
-- PAN Plagiarism Detection Corpus --
|
||||
|- Primary Usage -
|
||||
Plagiarism Detection Research
|
||||
|- Content -
|
||||
Different years of competitions with various text types
|
||||
PAN-PC-10/11 External plagiarism detection
|
||||
PAN-SS-13 Single source plagiarism with complexity levels
|
||||
Academic texts, web content, student essays
|
||||
|- Best For -
|
||||
Advanced evluation on realistic plagiarism
|
||||
|
||||
-- Quora Question Pairs --
|
||||
|- Primary Usage -
|
||||
Duplicate Questions - (Paraphrased, likely unintentional)
|
||||
|- Content -
|
||||
+400,000 question pairs from Quora
|
||||
Labled duplicate and not duplicate
|
||||
Focuses on questions, not general statements
|
||||
|- Best For -
|
||||
Training data-intensive models
|
||||
|
||||
-- SemEval Paraphrase Datasets --
|
||||
|- Primary Usage -
|
||||
Paraphrase and semantic similarity
|
||||
|- Content -
|
||||
Datasets from SemEval competitions
|
||||
SemEval-2012 Task 6: Semantic textual similarity
|
||||
SemEval-2015 Task 1: Paraphrase & Semantic similarity
|
||||
SemEval-2017 Task 1: Semantic similarity
|
||||
News, headlines, image captions
|
||||
Well annotated, multiple languages
|
||||
Fragmented across "Tasks"
|
||||
|- Best For -
|
||||
Multi domain evaluation (different criteria)
|
||||
|
||||
-- P4P (Paraphrase for Plagiarism) Corpus --
|
||||
|- Primary Usage -
|
||||
Plagiarism detection with paraphrasing
|
||||
|- Content -
|
||||
Academic texts with paraphrased plagiarism
|
||||
Source-plagiarism mappings, paraphrase types
|
||||
Academic writing
|
||||
Limited availability + academic focus
|
||||
Access limited to requests
|
||||
|- Best For -
|
||||
Paraphrase-specific plagiarism research
|
||||
|
||||
-- ParaBank 2.0 --
|
||||
|- Primary Usage -
|
||||
Paraphrase generation and evaluation
|
||||
|- Content -
|
||||
Large-scale paraphrase pairs generated from parallel text
|
||||
Multiple paraphrase candidates per sentence
|
||||
Machine generated, may contain noise
|
||||
|- Best For -
|
||||
Large scale training and data augmentation
|
||||
|
||||
-- Twitter Paraphrase Corpus --
|
||||
|- Primary Usage -
|
||||
Short text paraphrase detection
|
||||
|- Content -
|
||||
tweet pairs annotated for paraphrase relationship
|
||||
Paraphrase scoresand binary lables
|
||||
~20,000 pairs
|
||||
Informal language, real-world usage
|
||||
Short text, informal grammer (difficult to parse)
|
||||
|- Best For -
|
||||
Informal language and social media applications
|
||||
|
||||
-- UW-Stanford Paraphrase Corpus --
|
||||
|- Primary Usage -
|
||||
Paraphrase detection
|
||||
|- Content -
|
||||
Sentence pairs from news & web text
|
||||
Paraphrase judgments
|
||||
~3,000 pairs (Very small dataset)
|
||||
High quality human judgments (good test set)
|
||||
|- Best For -
|
||||
High-precision evaluation, testing
|
||||
|
||||
-- Sheffield Plagiarism Corpus --
|
||||
|- Primary Usage -
|
||||
Academic plagiarism detection research
|
||||
|- Content -
|
||||
Original academic texts and publications
|
||||
Modified documents with various types of plagiarism
|
||||
Detailed Markups of plagiarised sections with source mappings
|
||||
Plagiarism types
|
||||
Verbatim copying, Parphrasing, Structural plagiarism
|
||||
Academic writing and student essays
|
||||
Realistic plagiarism + obfuscation types
|
||||
|- Best For -
|
||||
Evaluating real academic plagiarism detection
|
||||
|
||||
-- New PAN25 --
|
||||
|- 3 parts
|
||||
spot_check
|
||||
train
|
||||
validation
|
||||
75
research/Parser_research.txt
Normal file
75
research/Parser_research.txt
Normal file
@@ -0,0 +1,75 @@
|
||||
--- Parsers ---
|
||||
-- SpaCy --
|
||||
|- Philosophy -
|
||||
Fast, easy to use, Industrial Strength
|
||||
|- Models -
|
||||
Pre-trained, "en_core_web_trf", "en_core_web_sm/md/lg"
|
||||
|- Outputs -
|
||||
Provides universal dependancies (UD) labels by default
|
||||
|- Use -
|
||||
Very easy to use, a few lines of code to parse sentence and its dependencies
|
||||
|- Integration -
|
||||
Works very well with python data science stack, networkx integrates easily
|
||||
|- Performance -
|
||||
Larger models very accurate, smaller are very fast
|
||||
|
||||
-- Stanford Stanza --
|
||||
|- Philosophy -
|
||||
Pure python + modern version of Stanford Core NLP
|
||||
Research oriented, highly accurate
|
||||
|- Models -
|
||||
Pre-trained models on different treebanks can handle complex gramatical sctructures.
|
||||
|- Output -
|
||||
Universal dependencies
|
||||
|- Use -
|
||||
Good API, is clean and fits well with python
|
||||
|- Integration -
|
||||
Pure python integrated well with oher python libraries
|
||||
|- Performance -
|
||||
Accuracy is among the best available, Speed is slower than spacy non-transformer models
|
||||
|
||||
-- Allen NLP --
|
||||
|- Philosophy -
|
||||
Research first, built on python
|
||||
Designed for "state of the art" deep learning models in NLP "Go to choice if you plan to modify or train your own models"
|
||||
|- Models -
|
||||
Biffane dependancy parser is most widely used (highly accurate)
|
||||
|- Output -
|
||||
Universal Dependancies
|
||||
|- Use -
|
||||
More difficult than SpaCy or Stanza, requires better understanding of the libraries abstactions
|
||||
|- Integration -
|
||||
Excellent in the python ecosystem, for pre-trained model is overkill
|
||||
|- Performance -
|
||||
State-of-the-art accuracy, inference speed can be slower due to model complexity
|
||||
|
||||
-- Spark NLP ---
|
||||
|- Philosophy -
|
||||
Built on Apache Spark or scalable, distributed NLP processing
|
||||
For massive datasets in a distributed computing
|
||||
|- Models -
|
||||
Provided its own anotated models
|
||||
often transformer architecture
|
||||
|- Output -
|
||||
Universal Dependancies
|
||||
|- Use -
|
||||
Good if familiar with spark ML API
|
||||
Setup more involved than pure python libraries
|
||||
|- Integration -
|
||||
Ideal for big data pipelines
|
||||
Unnesisarily heavy for single corpus analysis
|
||||
|- Performance -
|
||||
Very high accuracy
|
||||
Designed for speed and scale on clusters
|
||||
|
||||
-- Overall --
|
||||
|- SpaCy or Stanze --
|
||||
SpaCy is much simpler to set up and use, - robust, highly accurate system to be set up quickly and relativly simply
|
||||
Stanza is more complex and requires more complex setup, - maximise baseline accuracy when parsing in exchange for speed and simlicity
|
||||
|
||||
-- Choice --
|
||||
|- SpaCy -
|
||||
Use SpaCy initially, if parsing errors appear will switch to Stanza to check issues
|
||||
|
||||
|
||||
|
||||
10
research/project_overview.txt
Normal file
10
research/project_overview.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
--- Project Overview ---
|
||||
3 Layer detection
|
||||
|- 1. Surface-Level similarity (Direct copying)
|
||||
|- 2. Text Analysis
|
||||
\_ (a) Semantic Similarity (Keywords/contextual meaning)
|
||||
\_ (b) Syntactic Similarity (gramatical structure)
|
||||
|- 3. Parapharse Detection
|
||||
|
||||
-- 0. Foundation --
|
||||
|
||||
Reference in New Issue
Block a user