reserch text

2025-11-29 14:54:32 +00:00
parent fb68bc869a
commit 02cdc7bac6
7 changed files with 447 additions and 210 deletions
--- a/research/Corpus_research.txt
+++ b/research/Corpus_research.txt
@@ -0,0 +1,109 @@
+--- Datasets/Corpus ---
+    -- Microsoft Research Paraphrase Corpus --
+        |- Primary Usage -
+            Paraphrase Identification
+        |- Content -
+            ~6,000 sentence pairs from online news sources, labled (1,0)
+            Relatively small and limited to news articles
+        |- Best For -
+            Initial development, benchmarking
+
+    -- PAN Plagiarism Detection Corpus --
+        |- Primary Usage -
+            Plagiarism Detection Research
+        |- Content -
+            Different years of competitions with various text types
+            PAN-PC-10/11 External plagiarism detection
+            PAN-SS-13 Single source plagiarism with complexity levels
+            Academic texts, web content, student essays
+        |- Best For -
+            Advanced evluation on realistic plagiarism
+
+    -- Quora Question Pairs --
+        |- Primary Usage -
+            Duplicate Questions - (Paraphrased, likely unintentional)
+        |- Content -
+            +400,000 question pairs from Quora
+            Labled duplicate and not duplicate
+            Focuses on questions, not general statements
+        |- Best For -
+            Training data-intensive models
+
+    -- SemEval Paraphrase Datasets --
+        |- Primary Usage -
+           Paraphrase and semantic similarity
+        |- Content -
+            Datasets from SemEval competitions
+            SemEval-2012 Task 6: Semantic textual similarity
+            SemEval-2015 Task 1: Paraphrase & Semantic similarity
+            SemEval-2017 Task 1: Semantic similarity
+            News, headlines, image captions
+            Well annotated, multiple languages
+            Fragmented across "Tasks"
+        |- Best For -
+            Multi domain evaluation (different criteria)
+
+    -- P4P (Paraphrase for Plagiarism) Corpus --
+        |- Primary Usage -
+            Plagiarism detection with paraphrasing
+        |- Content -
+            Academic texts with paraphrased plagiarism
+            Source-plagiarism mappings, paraphrase types
+            Academic writing
+            Limited availability + academic focus
+            Access limited to requests
+        |- Best For -
+            Paraphrase-specific plagiarism research
+
+    -- ParaBank 2.0 --
+        |- Primary Usage -
+            Paraphrase generation and evaluation
+        |- Content -
+            Large-scale paraphrase pairs generated from parallel text
+            Multiple paraphrase candidates per sentence
+            Machine generated, may contain noise
+        |- Best For -
+            Large scale training and data augmentation
+
+    -- Twitter Paraphrase Corpus --
+        |- Primary Usage -
+            Short text paraphrase detection
+        |- Content -
+            tweet pairs annotated for paraphrase relationship
+            Paraphrase scoresand binary lables
+            ~20,000 pairs
+            Informal language, real-world usage
+            Short text, informal grammer (difficult to parse)
+        |- Best For -
+            Informal language and social media applications
+    
+    -- UW-Stanford Paraphrase Corpus --
+        |- Primary Usage -
+            Paraphrase detection
+        |- Content -
+            Sentence pairs from news & web text
+            Paraphrase judgments
+            ~3,000 pairs (Very small dataset)
+            High quality human judgments (good test set)
+        |- Best For -
+            High-precision evaluation, testing
+
+    -- Sheffield Plagiarism Corpus --
+        |- Primary Usage -
+            Academic plagiarism detection research
+        |- Content -
+            Original academic texts and publications
+            Modified documents with various types of plagiarism
+            Detailed Markups of plagiarised sections with source mappings
+            Plagiarism types
+            Verbatim copying, Parphrasing, Structural plagiarism
+            Academic writing and student essays
+            Realistic plagiarism + obfuscation types
+        |- Best For -
+            Evaluating real academic plagiarism detection
+
+    -- New PAN25 --
+        |- 3 parts
+            spot_check
+            train
+            validation
--- a/research/Parser_research.txt
+++ b/research/Parser_research.txt
@@ -0,0 +1,75 @@
+--- Parsers ---
+    -- SpaCy --
+        |- Philosophy -
+            Fast, easy to use, Industrial Strength
+        |- Models -
+            Pre-trained, "en_core_web_trf", "en_core_web_sm/md/lg"
+        |- Outputs - 
+            Provides universal dependancies (UD) labels by default
+        |- Use -
+            Very easy to use, a few lines of code to parse sentence and its dependencies
+        |- Integration -
+            Works very well with python data science stack, networkx integrates easily
+        |- Performance -
+            Larger models very accurate, smaller are very fast
+
+    -- Stanford Stanza --
+        |- Philosophy -
+            Pure python + modern version of Stanford Core NLP
+            Research oriented, highly accurate
+        |- Models - 
+            Pre-trained models on different treebanks can handle complex gramatical sctructures.
+        |- Output -
+            Universal dependencies
+        |- Use -
+            Good API, is clean and fits well with python
+        |- Integration - 
+            Pure python integrated well with oher python libraries
+        |- Performance -
+            Accuracy is among the best available, Speed is slower than spacy non-transformer models
+    
+    -- Allen NLP --
+        |- Philosophy -
+            Research first, built on python
+            Designed for "state of the art" deep learning models in NLP "Go to choice if you plan to modify or train your own models"
+        |- Models -
+            Biffane dependancy parser is most widely used (highly accurate)
+        |- Output -
+            Universal Dependancies
+        |- Use -
+            More difficult than SpaCy or Stanza, requires better understanding of the libraries abstactions
+        |- Integration -
+            Excellent in the python ecosystem, for pre-trained model is overkill
+        |- Performance -
+            State-of-the-art accuracy, inference speed can be slower due to model complexity
+    
+    -- Spark NLP ---
+        |- Philosophy -
+            Built on Apache Spark or scalable, distributed NLP processing
+            For massive datasets in a distributed computing
+        |- Models -
+            Provided its own anotated models
+            often transformer architecture
+        |- Output -
+            Universal Dependancies
+        |- Use -
+            Good if familiar with spark ML API
+            Setup more involved than pure python libraries
+        |- Integration -
+            Ideal for big data pipelines
+            Unnesisarily heavy for single corpus analysis
+        |- Performance -
+            Very high accuracy
+            Designed for speed and scale on clusters
+    
+    -- Overall --
+        |- SpaCy or Stanze --
+            SpaCy is much simpler to set up and use, - robust, highly accurate system to be set up quickly and relativly simply
+            Stanza is more complex and requires more complex setup, - maximise baseline accuracy when parsing in exchange for speed and simlicity
+
+    -- Choice --
+        |- SpaCy -
+            Use SpaCy initially, if parsing errors appear will switch to Stanza to check issues
+
+
+            
--- a/research/project_overview.txt
+++ b/research/project_overview.txt
@@ -0,0 +1,10 @@
+--- Project Overview ---
+    3 Layer detection
+        |- 1. Surface-Level similarity (Direct copying)
+        |- 2. Text Analysis
+            \_ (a) Semantic Similarity  (Keywords/contextual meaning)
+            \_ (b) Syntactic Similarity (gramatical structure)
+        |- 3. Parapharse Detection
+
+    -- 0. Foundation --
+