reserch text
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -4,3 +4,5 @@ venv
|
||||
.idea
|
||||
__pycache__
|
||||
data/raw
|
||||
MSR
|
||||
PAN25
|
||||
|
||||
41
interim_report
Normal file
41
interim_report
Normal file
@@ -0,0 +1,41 @@
|
||||
1. Project Objectives
|
||||
|
||||
Primary Aim The main goal of this project is to build a software system that can identify highly similar sentences, with a focus on catching plagiarism and paraphrased text. Most current detection tools are good at flagging direct text matches, but they often fail when words are swapped out even if the core message stays the same. To fix this, I am building a detector that looks at two things: syntactic structure (grammar) and semantic context (meaning).
|
||||
|
||||
Technical Approach The system uses a dual-branch architecture. The structural branch uses a dependency parser—likely SpaCy or Stanford CoreNLP—to pull out lexical dependencies. These dependencies get converted into a graph representation, which effectively turns the sentence into a tree structure for comparison. Similarity is then assessed by calculating the largest mapping between these parse trees.
|
||||
|
||||
The semantic branch addresses the limitations of structural analysis. Since syntax can vary significantly while meaning remains constant, this branch compares the semantic content of the sentences. This ensures the system captures "conceptual plagiarism" that purely syntactic or lexical methods might miss.
|
||||
|
||||
Deliverables The primary artifact will be a software tool that accepts sentence pairs and outputs a similarity score derived from a weighted fusion of graph matching and semantic analysis. The system will be benchmarked against the Microsoft Research Paraphrase Corpus (MSRP) to quantify its accuracy in real-world scenarios. A secondary deliverable is the evaluation of different graph comparison algorithms to determine the most effective method for NLP-based structural matching.
|
||||
|
||||
2. Description of work completed
|
||||
|
||||
The development environment is established in Python. The data pipeline is operational, with the Microsoft Research Paraphrase Corpus (MSRP) selected as the ground truth dataset, ingested, and pre-processed for analysis. A technical review of dependency parsers was conducted to select the optimal tool for the syntactic branch. Initial coding phases are complete, including notebooks for data exploration, baseline semantic experiments, and the skeleton structure for the fusion model.
|
||||
|
||||
2.1 Evidence of work completed
|
||||
|
||||
Data Engineering The MSRP dataset was successfully ingested and cleaned. Exploratory Data Analysis (EDA) showed a significant class imbalance in the dataset. Roughly 67% of the sentence pairs are labeled as paraphrases, leaving only 33% as non-paraphrases. This is important for evaluation because a basic classifier could simply guess "paraphrase" every time and still achieve 67% accuracy. Because of this, I will rely on F1-score and Precision/Recall rather than raw accuracy to judge performance.
|
||||
|
||||
Pre-processing functions were written to clean the text, including tokenization and handling of special characters. I also checked the dataset complexity. The creators of the corpus removed any pairs with a Levenshtein distance lower than 8, which means there are no "trivial" paraphrases where only one or two words differ.
|
||||
|
||||
Architecture Design A modular architecture was designed to process the data:
|
||||
|
||||
Syntactic Branch: Research was conducted on parsers to evaluate their ability to generate robust dependency graphs. The focus was on finding a parser that balances speed with the detail required for tree comparison.
|
||||
|
||||
Semantic Branch: Baseline strategies were implemented using vector-based approaches.
|
||||
|
||||
Fusion Model: The code structure for the final classifier has been initiated.
|
||||
|
||||
2.2 Literature review
|
||||
|
||||
Corpora The Microsoft Research Paraphrase Corpus (MSRP) was identified as the primary benchmark. It is the industry standard for sentence-level similarity tasks, consisting of 5,801 sentence pairs extracted from news sources. Crucially, the "non-paraphrase" examples in this dataset still exhibit high lexical overlap, making them "hard negatives" that confuse simple string-matching algorithms. This characteristic makes MSRP the ideal stress test for my proposed structure-plus-meaning approach.
|
||||
|
||||
Parsers A comparative review of NLP libraries (SpaCy, Stanford CoreNLP, AllenNLP) highlighted a critical trade-off between processing speed and the richness of the linguistic annotations. This review informed the selection of tools capable of supporting the "largest common subtree" analysis. The decision process prioritized libraries that allow easy extraction of dependency heads and children, which is a prerequisite for the planned graph construction.
|
||||
|
||||
3. Future Work
|
||||
|
||||
Semantic Analysis Implementation One major issue is polysemy, where words have different meanings depending on the context. Simple word vectors struggle with this. To solve it, I plan to upgrade the semantic module to use contextual embeddings from a pre-trained Transformer model like BERT or RoBERTa. Unlike static embeddings, these models create dynamic representations of words based on their surroundings, which should provide much better precision. I also plan to experiment with SIF (Smooth Inverse Frequency) sentence embeddings as a computationally lighter alternative to compare against the Transformer results.
|
||||
|
||||
Syntactic Analysis Implementation The syntactic module relies on a graph comparison algorithm. My main plan is to use Tree Edit Distance (TED). This calculates the minimum number of edits needed to turn one parse tree into another. I also plan to look into the Largest Common Substructure algorithm as an alternative. This part of the project involves high algorithmic complexity, so the calculations will need to be optimized to run efficiently on the full corpus.
|
||||
|
||||
Model Training and Evaluation The final step is training a supervised machine learning model, such as Logistic Regression or an SVM. This model will take the outputs from the semantic and syntactic modules and use them as features to generate a final probability score (0-100%). Evaluation will cover standard metrics like Precision, Recall, and F1-score to ensure the system catches plagiarism without triggering too many false positives.
|
||||
@@ -10,6 +10,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "12579bf734bb1a92",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -17,11 +18,13 @@
|
||||
"start_time": "2025-11-23T13:53:56.325948Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import token\n",
|
||||
"import spacy\n",
|
||||
"from spacy import displacy\n",
|
||||
"from IPython.display import display, HTML\n",
|
||||
"import torch\n",
|
||||
"\n",
|
||||
"nlp = spacy.load(\"en_core_web_md\") # Medium size model\n",
|
||||
"\n",
|
||||
@@ -30,9 +33,7 @@
|
||||
" \"On the mat, the cat was sitting.\",\n",
|
||||
" \"A completely different sentence about something else.\"\n",
|
||||
"]"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": 1
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
@@ -44,6 +45,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "e003ac06a58cfbb4",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -51,14 +53,6 @@
|
||||
"start_time": "2025-11-23T13:54:12.896440Z"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"\n",
|
||||
"for sent in test_sentences:\n",
|
||||
" doc = nlp(sent)\n",
|
||||
" print(f\"Sentence: {sent}\")\n",
|
||||
" print(f\"Tokens: {[token.text for token in doc]}\")\n",
|
||||
" print(\"---\")\n"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
@@ -76,10 +70,18 @@
|
||||
]
|
||||
}
|
||||
],
|
||||
"execution_count": 2
|
||||
"source": [
|
||||
"\n",
|
||||
"for sent in test_sentences:\n",
|
||||
" doc = nlp(sent)\n",
|
||||
" print(f\"Sentence: {sent}\")\n",
|
||||
" print(f\"Tokens: {[token.text for token in doc]}\")\n",
|
||||
" print(\"---\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "5e488a878a5cfccb",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -87,6 +89,39 @@
|
||||
"start_time": "2025-11-23T13:55:22.744266Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"--------------------------------------------------\n",
|
||||
"Sentence: The cat sat on the mat.\n",
|
||||
"--- Direct Sentence ---\n",
|
||||
"the cat sat on the mat.\n",
|
||||
"--- Semantic Sentence ---\n",
|
||||
"cat sit mat\n",
|
||||
"--- Syntactic Sentence ---\n",
|
||||
"the cat sit on the mat .\n",
|
||||
"--------------------------------------------------\n",
|
||||
"Sentence: On the mat, the cat was sitting.\n",
|
||||
"--- Direct Sentence ---\n",
|
||||
"on the mat, the cat was sitting.\n",
|
||||
"--- Semantic Sentence ---\n",
|
||||
"mat cat sit\n",
|
||||
"--- Syntactic Sentence ---\n",
|
||||
"on the mat , the cat be sit .\n",
|
||||
"--------------------------------------------------\n",
|
||||
"Sentence: A completely different sentence about something else.\n",
|
||||
"--- Direct Sentence ---\n",
|
||||
"a completely different sentence about something else.\n",
|
||||
"--- Semantic Sentence ---\n",
|
||||
"completely different sentence\n",
|
||||
"--- Syntactic Sentence ---\n",
|
||||
"a completely different sentence about something else .\n",
|
||||
"--------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"\n",
|
||||
"class TextPreprocessor:\n",
|
||||
@@ -157,44 +192,11 @@
|
||||
"# print(\"--- Syntactic Analysis ---\")\n",
|
||||
"# print(f\"Preprocessed Sentence: {preprocessor.syntactic_analysis(sent)}\")\n",
|
||||
"# print(\"-\" * 50)"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"--------------------------------------------------\n",
|
||||
"Sentence: The cat sat on the mat.\n",
|
||||
"--- Direct Sentence ---\n",
|
||||
"the cat sat on the mat.\n",
|
||||
"--- Semantic Sentence ---\n",
|
||||
"cat sit mat\n",
|
||||
"--- Syntactic Sentence ---\n",
|
||||
"the cat sit on the mat .\n",
|
||||
"--------------------------------------------------\n",
|
||||
"Sentence: On the mat, the cat was sitting.\n",
|
||||
"--- Direct Sentence ---\n",
|
||||
"on the mat, the cat was sitting.\n",
|
||||
"--- Semantic Sentence ---\n",
|
||||
"mat cat sit\n",
|
||||
"--- Syntactic Sentence ---\n",
|
||||
"on the mat , the cat be sit .\n",
|
||||
"--------------------------------------------------\n",
|
||||
"Sentence: A completely different sentence about something else.\n",
|
||||
"--- Direct Sentence ---\n",
|
||||
"a completely different sentence about something else.\n",
|
||||
"--- Semantic Sentence ---\n",
|
||||
"completely different sentence\n",
|
||||
"--- Syntactic Sentence ---\n",
|
||||
"a completely different sentence about something else .\n",
|
||||
"--------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"execution_count": 3
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "83fc18c9de2e354",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -202,24 +204,6 @@
|
||||
"start_time": "2025-11-23T13:55:33.565711Z"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"\n",
|
||||
"def extract_parse_tree(text):\n",
|
||||
" doc = nlp(text)\n",
|
||||
"\n",
|
||||
" print(f\"Sentence: {text}\")\n",
|
||||
" print(\"\\nDependenct Parse Tree:\")\n",
|
||||
" print(\"-\" * 50)\n",
|
||||
"\n",
|
||||
" for token in doc:\n",
|
||||
" print(f\"{token.text:<12} {token.dep_:<12} {token.head.text:<12} {[child.text for child in token.children]}\")\n",
|
||||
"\n",
|
||||
" return doc\n",
|
||||
"\n",
|
||||
"for sentence in processed_syntactic:\n",
|
||||
" doc = extract_parse_tree(sentence)\n",
|
||||
" print(\"\\n\" + \"=\"*60 + \"\\n\")"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
@@ -273,7 +257,24 @@
|
||||
]
|
||||
}
|
||||
],
|
||||
"execution_count": 4
|
||||
"source": [
|
||||
"\n",
|
||||
"def extract_parse_tree(text):\n",
|
||||
" doc = nlp(text)\n",
|
||||
"\n",
|
||||
" print(f\"Sentence: {text}\")\n",
|
||||
" print(\"\\nDependenct Parse Tree:\")\n",
|
||||
" print(\"-\" * 50)\n",
|
||||
"\n",
|
||||
" for token in doc:\n",
|
||||
" print(f\"{token.text:<12} {token.dep_:<12} {token.head.text:<12} {[child.text for child in token.children]}\")\n",
|
||||
"\n",
|
||||
" return doc\n",
|
||||
"\n",
|
||||
"for sentence in processed_syntactic:\n",
|
||||
" doc = extract_parse_tree(sentence)\n",
|
||||
" print(\"\\n\" + \"=\"*60 + \"\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
@@ -285,6 +286,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "e413238c1af12f62",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
@@ -292,22 +294,6 @@
|
||||
"start_time": "2025-11-23T13:56:21.702279Z"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"\n",
|
||||
"\n",
|
||||
"def visualize_parse_tree(text):\n",
|
||||
" doc = nlp(text)\n",
|
||||
" html = displacy.render(doc, style=\"dep\", jupyter=False, options={\"distance\": 100})\n",
|
||||
" display(HTML(html))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for sentence in processed_syntactic:\n",
|
||||
" print(f\"Sentence: {sentence}\")\n",
|
||||
" print(\"---\")\n",
|
||||
" print(f\"Processed Sentence: \" + sentence)\n",
|
||||
" visualize_parse_tree(sentence)"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
@@ -320,11 +306,8 @@
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
],
|
||||
"text/html": [
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"4ec65c2317524f6dab93c3963fcc5973-0\" class=\"displacy\" width=\"650\" height=\"237.0\" direction=\"ltr\" style=\"max-width: none; height: 237.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"7988d30657e54e32939db20da966a81e-0\" class=\"displacy\" width=\"650\" height=\"237.0\" direction=\"ltr\" style=\"max-width: none; height: 237.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"147.0\">\n",
|
||||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">the</tspan>\n",
|
||||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">DET</tspan>\n",
|
||||
@@ -356,52 +339,52 @@
|
||||
"</text>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-4ec65c2317524f6dab93c3963fcc5973-0-0\" stroke-width=\"2px\" d=\"M70,102.0 C70,52.0 145.0,52.0 145.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-7988d30657e54e32939db20da966a81e-0-0\" stroke-width=\"2px\" d=\"M70,102.0 C70,52.0 145.0,52.0 145.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-4ec65c2317524f6dab93c3963fcc5973-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-7988d30657e54e32939db20da966a81e-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M70,104.0 L62,92.0 78,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-4ec65c2317524f6dab93c3963fcc5973-0-1\" stroke-width=\"2px\" d=\"M170,102.0 C170,52.0 245.0,52.0 245.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-7988d30657e54e32939db20da966a81e-0-1\" stroke-width=\"2px\" d=\"M170,102.0 C170,52.0 245.0,52.0 245.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-4ec65c2317524f6dab93c3963fcc5973-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-7988d30657e54e32939db20da966a81e-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M170,104.0 L162,92.0 178,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-4ec65c2317524f6dab93c3963fcc5973-0-2\" stroke-width=\"2px\" d=\"M270,102.0 C270,52.0 345.0,52.0 345.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-7988d30657e54e32939db20da966a81e-0-2\" stroke-width=\"2px\" d=\"M270,102.0 C270,52.0 345.0,52.0 345.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-4ec65c2317524f6dab93c3963fcc5973-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-7988d30657e54e32939db20da966a81e-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M345.0,104.0 L353.0,92.0 337.0,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-4ec65c2317524f6dab93c3963fcc5973-0-3\" stroke-width=\"2px\" d=\"M470,102.0 C470,52.0 545.0,52.0 545.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-7988d30657e54e32939db20da966a81e-0-3\" stroke-width=\"2px\" d=\"M470,102.0 C470,52.0 545.0,52.0 545.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-4ec65c2317524f6dab93c3963fcc5973-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-7988d30657e54e32939db20da966a81e-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M470,104.0 L462,92.0 478,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-4ec65c2317524f6dab93c3963fcc5973-0-4\" stroke-width=\"2px\" d=\"M370,102.0 C370,2.0 550.0,2.0 550.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-7988d30657e54e32939db20da966a81e-0-4\" stroke-width=\"2px\" d=\"M370,102.0 C370,2.0 550.0,2.0 550.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-4ec65c2317524f6dab93c3963fcc5973-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-7988d30657e54e32939db20da966a81e-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M550.0,104.0 L558.0,92.0 542.0,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"</svg>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data",
|
||||
"jetTransient": {
|
||||
"display_id": null
|
||||
}
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
@@ -414,11 +397,8 @@
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
],
|
||||
"text/html": [
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"239d41ebf05341878765b3474355df7c-0\" class=\"displacy\" width=\"750\" height=\"287.0\" direction=\"ltr\" style=\"max-width: none; height: 287.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"ede226a392d6400bb9fe1654f4f5b08c-0\" class=\"displacy\" width=\"750\" height=\"287.0\" direction=\"ltr\" style=\"max-width: none; height: 287.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
|
||||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">on</tspan>\n",
|
||||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADP</tspan>\n",
|
||||
@@ -455,60 +435,60 @@
|
||||
"</text>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-239d41ebf05341878765b3474355df7c-0-0\" stroke-width=\"2px\" d=\"M70,152.0 C70,2.0 650.0,2.0 650.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-ede226a392d6400bb9fe1654f4f5b08c-0-0\" stroke-width=\"2px\" d=\"M70,152.0 C70,2.0 650.0,2.0 650.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-239d41ebf05341878765b3474355df7c-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-ede226a392d6400bb9fe1654f4f5b08c-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M70,154.0 L62,142.0 78,142.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-239d41ebf05341878765b3474355df7c-0-1\" stroke-width=\"2px\" d=\"M170,152.0 C170,102.0 240.0,102.0 240.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-ede226a392d6400bb9fe1654f4f5b08c-0-1\" stroke-width=\"2px\" d=\"M170,152.0 C170,102.0 240.0,102.0 240.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-239d41ebf05341878765b3474355df7c-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-ede226a392d6400bb9fe1654f4f5b08c-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M170,154.0 L162,142.0 178,142.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-239d41ebf05341878765b3474355df7c-0-2\" stroke-width=\"2px\" d=\"M70,152.0 C70,52.0 245.0,52.0 245.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-ede226a392d6400bb9fe1654f4f5b08c-0-2\" stroke-width=\"2px\" d=\"M70,152.0 C70,52.0 245.0,52.0 245.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-239d41ebf05341878765b3474355df7c-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-ede226a392d6400bb9fe1654f4f5b08c-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M245.0,154.0 L253.0,142.0 237.0,142.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-239d41ebf05341878765b3474355df7c-0-3\" stroke-width=\"2px\" d=\"M370,152.0 C370,102.0 440.0,102.0 440.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-ede226a392d6400bb9fe1654f4f5b08c-0-3\" stroke-width=\"2px\" d=\"M370,152.0 C370,102.0 440.0,102.0 440.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-239d41ebf05341878765b3474355df7c-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-ede226a392d6400bb9fe1654f4f5b08c-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M370,154.0 L362,142.0 378,142.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-239d41ebf05341878765b3474355df7c-0-4\" stroke-width=\"2px\" d=\"M470,152.0 C470,52.0 645.0,52.0 645.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-ede226a392d6400bb9fe1654f4f5b08c-0-4\" stroke-width=\"2px\" d=\"M470,152.0 C470,52.0 645.0,52.0 645.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-239d41ebf05341878765b3474355df7c-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-ede226a392d6400bb9fe1654f4f5b08c-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M470,154.0 L462,142.0 478,142.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-239d41ebf05341878765b3474355df7c-0-5\" stroke-width=\"2px\" d=\"M570,152.0 C570,102.0 640.0,102.0 640.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-ede226a392d6400bb9fe1654f4f5b08c-0-5\" stroke-width=\"2px\" d=\"M570,152.0 C570,102.0 640.0,102.0 640.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-239d41ebf05341878765b3474355df7c-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">aux</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-ede226a392d6400bb9fe1654f4f5b08c-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">aux</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M570,154.0 L562,142.0 578,142.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"</svg>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data",
|
||||
"jetTransient": {
|
||||
"display_id": null
|
||||
}
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
@@ -521,11 +501,8 @@
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
],
|
||||
"text/html": [
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"645c3d0343ff46cfb12d7ba372193893-0\" class=\"displacy\" width=\"750\" height=\"237.0\" direction=\"ltr\" style=\"max-width: none; height: 237.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"e2bf7e2546b1463e841e53291bfb9bb2-0\" class=\"displacy\" width=\"750\" height=\"237.0\" direction=\"ltr\" style=\"max-width: none; height: 237.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"147.0\">\n",
|
||||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">a</tspan>\n",
|
||||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">DET</tspan>\n",
|
||||
@@ -562,63 +539,78 @@
|
||||
"</text>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-645c3d0343ff46cfb12d7ba372193893-0-0\" stroke-width=\"2px\" d=\"M70,102.0 C70,2.0 350.0,2.0 350.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-0\" stroke-width=\"2px\" d=\"M70,102.0 C70,2.0 350.0,2.0 350.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-645c3d0343ff46cfb12d7ba372193893-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M70,104.0 L62,92.0 78,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-645c3d0343ff46cfb12d7ba372193893-0-1\" stroke-width=\"2px\" d=\"M170,102.0 C170,52.0 245.0,52.0 245.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-1\" stroke-width=\"2px\" d=\"M170,102.0 C170,52.0 245.0,52.0 245.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-645c3d0343ff46cfb12d7ba372193893-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">advmod</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">advmod</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M170,104.0 L162,92.0 178,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-645c3d0343ff46cfb12d7ba372193893-0-2\" stroke-width=\"2px\" d=\"M270,102.0 C270,52.0 345.0,52.0 345.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-2\" stroke-width=\"2px\" d=\"M270,102.0 C270,52.0 345.0,52.0 345.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-645c3d0343ff46cfb12d7ba372193893-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M270,104.0 L262,92.0 278,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-645c3d0343ff46cfb12d7ba372193893-0-3\" stroke-width=\"2px\" d=\"M370,102.0 C370,52.0 445.0,52.0 445.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-3\" stroke-width=\"2px\" d=\"M370,102.0 C370,52.0 445.0,52.0 445.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-645c3d0343ff46cfb12d7ba372193893-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M445.0,104.0 L453.0,92.0 437.0,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-645c3d0343ff46cfb12d7ba372193893-0-4\" stroke-width=\"2px\" d=\"M470,102.0 C470,52.0 545.0,52.0 545.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-4\" stroke-width=\"2px\" d=\"M470,102.0 C470,52.0 545.0,52.0 545.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-645c3d0343ff46cfb12d7ba372193893-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M545.0,104.0 L553.0,92.0 537.0,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"\n",
|
||||
"<g class=\"displacy-arrow\">\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-645c3d0343ff46cfb12d7ba372193893-0-5\" stroke-width=\"2px\" d=\"M570,102.0 C570,52.0 645.0,52.0 645.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <path class=\"displacy-arc\" id=\"arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-5\" stroke-width=\"2px\" d=\"M570,102.0 C570,52.0 645.0,52.0 645.0,102.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||||
" <textPath xlink:href=\"#arrow-645c3d0343ff46cfb12d7ba372193893-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">advmod</textPath>\n",
|
||||
" <textPath xlink:href=\"#arrow-e2bf7e2546b1463e841e53291bfb9bb2-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">advmod</textPath>\n",
|
||||
" </text>\n",
|
||||
" <path class=\"displacy-arrowhead\" d=\"M645.0,104.0 L653.0,92.0 637.0,92.0\" fill=\"currentColor\"/>\n",
|
||||
"</g>\n",
|
||||
"</svg>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data",
|
||||
"jetTransient": {
|
||||
"display_id": null
|
||||
}
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"execution_count": 6
|
||||
"source": [
|
||||
"\n",
|
||||
"\n",
|
||||
"def visualize_parse_tree(text):\n",
|
||||
" doc = nlp(text)\n",
|
||||
" html = displacy.render(doc, style=\"dep\", jupyter=False, options={\"distance\": 100})\n",
|
||||
" display(HTML(html))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for sentence in processed_syntactic:\n",
|
||||
" print(f\"Sentence: {sentence}\")\n",
|
||||
" print(\"---\")\n",
|
||||
" print(f\"Processed Sentence: \" + sentence)\n",
|
||||
" visualize_parse_tree(sentence)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
@@ -631,9 +623,21 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -1,35 +1,55 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1638b7b97e3bd6f",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-11-22T11:40:21.711998Z",
|
||||
"start_time": "2025-11-22T11:40:20.129376Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import spacy\n",
|
||||
"nlp = spacy.load(\"en_core_web_md\") # Medium model"
|
||||
],
|
||||
"id": "1638b7b97e3bd6f",
|
||||
"outputs": [],
|
||||
"execution_count": 11
|
||||
"nlp = spacy.load(\"en_core_web_lg\") # Medium model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"cell_type": "markdown",
|
||||
"source": "Test word vectors",
|
||||
"id": "b79941bf4553fd6"
|
||||
"id": "b79941bf4553fd6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Test word vectors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "8a3c4314a90086fe",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-11-22T11:47:39.286432Z",
|
||||
"start_time": "2025-11-22T11:47:39.271377Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "ValueError",
|
||||
"evalue": "[E010] Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the docs:\nhttps://spacy.io/usage/models",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
||||
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
|
||||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 9\u001b[39m\n\u001b[32m 7\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m word2 \u001b[38;5;129;01min\u001b[39;00m words:\n\u001b[32m 8\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m word1 != word2:\n\u001b[32m----> \u001b[39m\u001b[32m9\u001b[39m similarity = \u001b[43mnlp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mvocab\u001b[49m\u001b[43m[\u001b[49m\u001b[43mword1\u001b[49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43msimilarity\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnlp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mvocab\u001b[49m\u001b[43m[\u001b[49m\u001b[43mword2\u001b[49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 10\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mword1\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m - \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mword2\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00msimilarity\u001b[38;5;132;01m:\u001b[39;00m\u001b[33m.3f\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m)\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32m~/code/plagiarism-detector/.venv/lib/python3.13/site-packages/spacy/lexeme.pyx:146\u001b[39m, in \u001b[36mspacy.lexeme.Lexeme.similarity\u001b[39m\u001b[34m()\u001b[39m\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32m~/code/plagiarism-detector/.venv/lib/python3.13/site-packages/spacy/lexeme.pyx:164\u001b[39m, in \u001b[36mspacy.lexeme.Lexeme.vector_norm.__get__\u001b[39m\u001b[34m()\u001b[39m\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32m~/code/plagiarism-detector/.venv/lib/python3.13/site-packages/spacy/lexeme.pyx:176\u001b[39m, in \u001b[36mspacy.lexeme.Lexeme.vector.__get__\u001b[39m\u001b[34m()\u001b[39m\n",
|
||||
"\u001b[31mValueError\u001b[39m: [E010] Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the docs:\nhttps://spacy.io/usage/models"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def test_word_vectors(word):\n",
|
||||
" print(word, nlp.vocab[word].vector.shape)\n",
|
||||
@@ -43,62 +63,27 @@
|
||||
" print(f\"{word1} - {word2}: {similarity:.3f}\")\n",
|
||||
"\n",
|
||||
"\n"
|
||||
],
|
||||
"id": "8a3c4314a90086fe",
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"cat - dog: 1.000\n",
|
||||
"cat - feline: 0.363\n",
|
||||
"cat - feral: 0.483\n",
|
||||
"cat - vehicle: 0.078\n",
|
||||
"cat - car: 0.193\n",
|
||||
"dog - cat: 1.000\n",
|
||||
"dog - feline: 0.363\n",
|
||||
"dog - feral: 0.483\n",
|
||||
"dog - vehicle: 0.078\n",
|
||||
"dog - car: 0.193\n",
|
||||
"feline - cat: 0.363\n",
|
||||
"feline - dog: 0.363\n",
|
||||
"feline - feral: 0.412\n",
|
||||
"feline - vehicle: 0.180\n",
|
||||
"feline - car: 0.050\n",
|
||||
"feral - cat: 0.483\n",
|
||||
"feral - dog: 0.483\n",
|
||||
"feral - feline: 0.412\n",
|
||||
"feral - vehicle: 0.175\n",
|
||||
"feral - car: 0.161\n",
|
||||
"vehicle - cat: 0.078\n",
|
||||
"vehicle - dog: 0.078\n",
|
||||
"vehicle - feline: 0.180\n",
|
||||
"vehicle - feral: 0.175\n",
|
||||
"vehicle - car: 0.205\n",
|
||||
"car - cat: 0.193\n",
|
||||
"car - dog: 0.193\n",
|
||||
"car - feline: 0.050\n",
|
||||
"car - feral: 0.161\n",
|
||||
"car - vehicle: 0.205\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"execution_count": 15
|
||||
]
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"cell_type": "markdown",
|
||||
"source": "Simple averaging",
|
||||
"id": "8f32b5695f554268"
|
||||
"id": "8f32b5695f554268",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Simple averaging"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "68a6757447e4a1c7",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-11-18T23:45:03.085563Z",
|
||||
"start_time": "2025-11-18T23:45:03.082190Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def sentence_similarity_avg(sent1, sent2):\n",
|
||||
" doc1 = nlp(sent1)\n",
|
||||
@@ -118,35 +103,46 @@
|
||||
" #cosine similarity\n",
|
||||
" from sklearn.metrics.pairwise import cosine_similarity\n",
|
||||
" return cosine_similarity([avg1], [avg2])[0][0]\n"
|
||||
],
|
||||
"id": "68a6757447e4a1c7",
|
||||
"outputs": [],
|
||||
"execution_count": 3
|
||||
]
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"cell_type": "markdown",
|
||||
"source": "SIF - Smooth Inverse Similarity",
|
||||
"id": "a9c3aa050f5bc0fe"
|
||||
"id": "a9c3aa050f5bc0fe",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"SIF - Smooth Inverse Similarity"
|
||||
]
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"cell_type": "code",
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"id": "c100956f89d9b581",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def sentence_similarity_sif(sent1, sent2):\n",
|
||||
" doc1 = nlp(sent1)\n",
|
||||
" doc2 = nlp(sent2)"
|
||||
],
|
||||
"id": "c100956f89d9b581"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"display_name": "Python 3 (ipykernel)"
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
109
research/Corpus_research.txt
Normal file
109
research/Corpus_research.txt
Normal file
@@ -0,0 +1,109 @@
|
||||
--- Datasets/Corpus ---
|
||||
-- Microsoft Research Paraphrase Corpus --
|
||||
|- Primary Usage -
|
||||
Paraphrase Identification
|
||||
|- Content -
|
||||
~6,000 sentence pairs from online news sources, labled (1,0)
|
||||
Relatively small and limited to news articles
|
||||
|- Best For -
|
||||
Initial development, benchmarking
|
||||
|
||||
-- PAN Plagiarism Detection Corpus --
|
||||
|- Primary Usage -
|
||||
Plagiarism Detection Research
|
||||
|- Content -
|
||||
Different years of competitions with various text types
|
||||
PAN-PC-10/11 External plagiarism detection
|
||||
PAN-SS-13 Single source plagiarism with complexity levels
|
||||
Academic texts, web content, student essays
|
||||
|- Best For -
|
||||
Advanced evluation on realistic plagiarism
|
||||
|
||||
-- Quora Question Pairs --
|
||||
|- Primary Usage -
|
||||
Duplicate Questions - (Paraphrased, likely unintentional)
|
||||
|- Content -
|
||||
+400,000 question pairs from Quora
|
||||
Labled duplicate and not duplicate
|
||||
Focuses on questions, not general statements
|
||||
|- Best For -
|
||||
Training data-intensive models
|
||||
|
||||
-- SemEval Paraphrase Datasets --
|
||||
|- Primary Usage -
|
||||
Paraphrase and semantic similarity
|
||||
|- Content -
|
||||
Datasets from SemEval competitions
|
||||
SemEval-2012 Task 6: Semantic textual similarity
|
||||
SemEval-2015 Task 1: Paraphrase & Semantic similarity
|
||||
SemEval-2017 Task 1: Semantic similarity
|
||||
News, headlines, image captions
|
||||
Well annotated, multiple languages
|
||||
Fragmented across "Tasks"
|
||||
|- Best For -
|
||||
Multi domain evaluation (different criteria)
|
||||
|
||||
-- P4P (Paraphrase for Plagiarism) Corpus --
|
||||
|- Primary Usage -
|
||||
Plagiarism detection with paraphrasing
|
||||
|- Content -
|
||||
Academic texts with paraphrased plagiarism
|
||||
Source-plagiarism mappings, paraphrase types
|
||||
Academic writing
|
||||
Limited availability + academic focus
|
||||
Access limited to requests
|
||||
|- Best For -
|
||||
Paraphrase-specific plagiarism research
|
||||
|
||||
-- ParaBank 2.0 --
|
||||
|- Primary Usage -
|
||||
Paraphrase generation and evaluation
|
||||
|- Content -
|
||||
Large-scale paraphrase pairs generated from parallel text
|
||||
Multiple paraphrase candidates per sentence
|
||||
Machine generated, may contain noise
|
||||
|- Best For -
|
||||
Large scale training and data augmentation
|
||||
|
||||
-- Twitter Paraphrase Corpus --
|
||||
|- Primary Usage -
|
||||
Short text paraphrase detection
|
||||
|- Content -
|
||||
tweet pairs annotated for paraphrase relationship
|
||||
Paraphrase scoresand binary lables
|
||||
~20,000 pairs
|
||||
Informal language, real-world usage
|
||||
Short text, informal grammer (difficult to parse)
|
||||
|- Best For -
|
||||
Informal language and social media applications
|
||||
|
||||
-- UW-Stanford Paraphrase Corpus --
|
||||
|- Primary Usage -
|
||||
Paraphrase detection
|
||||
|- Content -
|
||||
Sentence pairs from news & web text
|
||||
Paraphrase judgments
|
||||
~3,000 pairs (Very small dataset)
|
||||
High quality human judgments (good test set)
|
||||
|- Best For -
|
||||
High-precision evaluation, testing
|
||||
|
||||
-- Sheffield Plagiarism Corpus --
|
||||
|- Primary Usage -
|
||||
Academic plagiarism detection research
|
||||
|- Content -
|
||||
Original academic texts and publications
|
||||
Modified documents with various types of plagiarism
|
||||
Detailed Markups of plagiarised sections with source mappings
|
||||
Plagiarism types
|
||||
Verbatim copying, Parphrasing, Structural plagiarism
|
||||
Academic writing and student essays
|
||||
Realistic plagiarism + obfuscation types
|
||||
|- Best For -
|
||||
Evaluating real academic plagiarism detection
|
||||
|
||||
-- New PAN25 --
|
||||
|- 3 parts
|
||||
spot_check
|
||||
train
|
||||
validation
|
||||
75
research/Parser_research.txt
Normal file
75
research/Parser_research.txt
Normal file
@@ -0,0 +1,75 @@
|
||||
--- Parsers ---
|
||||
-- SpaCy --
|
||||
|- Philosophy -
|
||||
Fast, easy to use, Industrial Strength
|
||||
|- Models -
|
||||
Pre-trained, "en_core_web_trf", "en_core_web_sm/md/lg"
|
||||
|- Outputs -
|
||||
Provides universal dependancies (UD) labels by default
|
||||
|- Use -
|
||||
Very easy to use, a few lines of code to parse sentence and its dependencies
|
||||
|- Integration -
|
||||
Works very well with python data science stack, networkx integrates easily
|
||||
|- Performance -
|
||||
Larger models very accurate, smaller are very fast
|
||||
|
||||
-- Stanford Stanza --
|
||||
|- Philosophy -
|
||||
Pure python + modern version of Stanford Core NLP
|
||||
Research oriented, highly accurate
|
||||
|- Models -
|
||||
Pre-trained models on different treebanks can handle complex gramatical sctructures.
|
||||
|- Output -
|
||||
Universal dependencies
|
||||
|- Use -
|
||||
Good API, is clean and fits well with python
|
||||
|- Integration -
|
||||
Pure python integrated well with oher python libraries
|
||||
|- Performance -
|
||||
Accuracy is among the best available, Speed is slower than spacy non-transformer models
|
||||
|
||||
-- Allen NLP --
|
||||
|- Philosophy -
|
||||
Research first, built on python
|
||||
Designed for "state of the art" deep learning models in NLP "Go to choice if you plan to modify or train your own models"
|
||||
|- Models -
|
||||
Biffane dependancy parser is most widely used (highly accurate)
|
||||
|- Output -
|
||||
Universal Dependancies
|
||||
|- Use -
|
||||
More difficult than SpaCy or Stanza, requires better understanding of the libraries abstactions
|
||||
|- Integration -
|
||||
Excellent in the python ecosystem, for pre-trained model is overkill
|
||||
|- Performance -
|
||||
State-of-the-art accuracy, inference speed can be slower due to model complexity
|
||||
|
||||
-- Spark NLP ---
|
||||
|- Philosophy -
|
||||
Built on Apache Spark or scalable, distributed NLP processing
|
||||
For massive datasets in a distributed computing
|
||||
|- Models -
|
||||
Provided its own anotated models
|
||||
often transformer architecture
|
||||
|- Output -
|
||||
Universal Dependancies
|
||||
|- Use -
|
||||
Good if familiar with spark ML API
|
||||
Setup more involved than pure python libraries
|
||||
|- Integration -
|
||||
Ideal for big data pipelines
|
||||
Unnesisarily heavy for single corpus analysis
|
||||
|- Performance -
|
||||
Very high accuracy
|
||||
Designed for speed and scale on clusters
|
||||
|
||||
-- Overall --
|
||||
|- SpaCy or Stanze --
|
||||
SpaCy is much simpler to set up and use, - robust, highly accurate system to be set up quickly and relativly simply
|
||||
Stanza is more complex and requires more complex setup, - maximise baseline accuracy when parsing in exchange for speed and simlicity
|
||||
|
||||
-- Choice --
|
||||
|- SpaCy -
|
||||
Use SpaCy initially, if parsing errors appear will switch to Stanza to check issues
|
||||
|
||||
|
||||
|
||||
10
research/project_overview.txt
Normal file
10
research/project_overview.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
--- Project Overview ---
|
||||
3 Layer detection
|
||||
|- 1. Surface-Level similarity (Direct copying)
|
||||
|- 2. Text Analysis
|
||||
\_ (a) Semantic Similarity (Keywords/contextual meaning)
|
||||
\_ (b) Syntactic Similarity (gramatical structure)
|
||||
|- 3. Parapharse Detection
|
||||
|
||||
-- 0. Foundation --
|
||||
|
||||
Reference in New Issue
Block a user