reserch text

2025-11-29 14:54:32 +00:00
parent fb68bc869a
commit 02cdc7bac6
7 changed files with 447 additions and 210 deletions
--- a/research/Parser_research.txt
+++ b/research/Parser_research.txt
@@ -0,0 +1,75 @@
+--- Parsers ---
+    -- SpaCy --
+        |- Philosophy -
+            Fast, easy to use, Industrial Strength
+        |- Models -
+            Pre-trained, "en_core_web_trf", "en_core_web_sm/md/lg"
+        |- Outputs - 
+            Provides universal dependancies (UD) labels by default
+        |- Use -
+            Very easy to use, a few lines of code to parse sentence and its dependencies
+        |- Integration -
+            Works very well with python data science stack, networkx integrates easily
+        |- Performance -
+            Larger models very accurate, smaller are very fast
+
+    -- Stanford Stanza --
+        |- Philosophy -
+            Pure python + modern version of Stanford Core NLP
+            Research oriented, highly accurate
+        |- Models - 
+            Pre-trained models on different treebanks can handle complex gramatical sctructures.
+        |- Output -
+            Universal dependencies
+        |- Use -
+            Good API, is clean and fits well with python
+        |- Integration - 
+            Pure python integrated well with oher python libraries
+        |- Performance -
+            Accuracy is among the best available, Speed is slower than spacy non-transformer models
+    
+    -- Allen NLP --
+        |- Philosophy -
+            Research first, built on python
+            Designed for "state of the art" deep learning models in NLP "Go to choice if you plan to modify or train your own models"
+        |- Models -
+            Biffane dependancy parser is most widely used (highly accurate)
+        |- Output -
+            Universal Dependancies
+        |- Use -
+            More difficult than SpaCy or Stanza, requires better understanding of the libraries abstactions
+        |- Integration -
+            Excellent in the python ecosystem, for pre-trained model is overkill
+        |- Performance -
+            State-of-the-art accuracy, inference speed can be slower due to model complexity
+    
+    -- Spark NLP ---
+        |- Philosophy -
+            Built on Apache Spark or scalable, distributed NLP processing
+            For massive datasets in a distributed computing
+        |- Models -
+            Provided its own anotated models
+            often transformer architecture
+        |- Output -
+            Universal Dependancies
+        |- Use -
+            Good if familiar with spark ML API
+            Setup more involved than pure python libraries
+        |- Integration -
+            Ideal for big data pipelines
+            Unnesisarily heavy for single corpus analysis
+        |- Performance -
+            Very high accuracy
+            Designed for speed and scale on clusters
+    
+    -- Overall --
+        |- SpaCy or Stanze --
+            SpaCy is much simpler to set up and use, - robust, highly accurate system to be set up quickly and relativly simply
+            Stanza is more complex and requires more complex setup, - maximise baseline accuracy when parsing in exchange for speed and simlicity
+
+    -- Choice --
+        |- SpaCy -
+            Use SpaCy initially, if parsing errors appear will switch to Stanza to check issues
+
+
+