reserch text

This commit is contained in:
Henry Dowd
2025-11-29 14:54:32 +00:00
parent fb68bc869a
commit 02cdc7bac6
7 changed files with 447 additions and 210 deletions

View File

@@ -0,0 +1,75 @@
--- Parsers ---
-- SpaCy --
|- Philosophy -
Fast, easy to use, Industrial Strength
|- Models -
Pre-trained, "en_core_web_trf", "en_core_web_sm/md/lg"
|- Outputs -
Provides universal dependancies (UD) labels by default
|- Use -
Very easy to use, a few lines of code to parse sentence and its dependencies
|- Integration -
Works very well with python data science stack, networkx integrates easily
|- Performance -
Larger models very accurate, smaller are very fast
-- Stanford Stanza --
|- Philosophy -
Pure python + modern version of Stanford Core NLP
Research oriented, highly accurate
|- Models -
Pre-trained models on different treebanks can handle complex gramatical sctructures.
|- Output -
Universal dependencies
|- Use -
Good API, is clean and fits well with python
|- Integration -
Pure python integrated well with oher python libraries
|- Performance -
Accuracy is among the best available, Speed is slower than spacy non-transformer models
-- Allen NLP --
|- Philosophy -
Research first, built on python
Designed for "state of the art" deep learning models in NLP "Go to choice if you plan to modify or train your own models"
|- Models -
Biffane dependancy parser is most widely used (highly accurate)
|- Output -
Universal Dependancies
|- Use -
More difficult than SpaCy or Stanza, requires better understanding of the libraries abstactions
|- Integration -
Excellent in the python ecosystem, for pre-trained model is overkill
|- Performance -
State-of-the-art accuracy, inference speed can be slower due to model complexity
-- Spark NLP ---
|- Philosophy -
Built on Apache Spark or scalable, distributed NLP processing
For massive datasets in a distributed computing
|- Models -
Provided its own anotated models
often transformer architecture
|- Output -
Universal Dependancies
|- Use -
Good if familiar with spark ML API
Setup more involved than pure python libraries
|- Integration -
Ideal for big data pipelines
Unnesisarily heavy for single corpus analysis
|- Performance -
Very high accuracy
Designed for speed and scale on clusters
-- Overall --
|- SpaCy or Stanze --
SpaCy is much simpler to set up and use, - robust, highly accurate system to be set up quickly and relativly simply
Stanza is more complex and requires more complex setup, - maximise baseline accuracy when parsing in exchange for speed and simlicity
-- Choice --
|- SpaCy -
Use SpaCy initially, if parsing errors appear will switch to Stanza to check issues