Skip to content
This repository has been archived by the owner on Dec 25, 2023. It is now read-only.

Laboratory work #4, Marina Mokina - 22FPL1 #175

Closed
wants to merge 102 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
2b71dbe
add main
mmarina2004 Sep 15, 2023
be684a1
add git
mmarina2004 Sep 15, 2023
b6368f8
git commit
mmarina2004 Sep 20, 2023
3c9a3dd
Merge branch 'fipl-hse:main' into main
artyomtugaryov Sep 21, 2023
323fc7c
added fixes
mmarina2004 Sep 21, 2023
c44b914
Merge branch 'main' of https://github.com/mmarina2004/2023-2-level-labs
mmarina2004 Sep 21, 2023
01ea1a1
add fixes
mmarina2004 Sep 27, 2023
497bfeb
Merge branch 'fipl-hse:main' into main
artyomtugaryov Sep 28, 2023
05842a0
i start do lab
mmarina2004 Sep 29, 2023
5604898
i start do lab
mmarina2004 Sep 29, 2023
9af8720
added fixes
mmarina2004 Oct 2, 2023
4c329e5
added fixes
mmarina2004 Oct 3, 2023
78e5f4a
added fixes in start
mmarina2004 Oct 3, 2023
b8f85b9
added fixes
mmarina2004 Oct 3, 2023
b9f7567
added fixes in start
mmarina2004 Oct 3, 2023
85ff8e2
added fixes
mmarina2004 Oct 3, 2023
c0b809f
Delete requirements.txt
mmarina2004 Oct 3, 2023
9586a87
recovery
mmarina2004 Oct 3, 2023
928cc9c
recovery
mmarina2004 Oct 3, 2023
72e3aff
added fixes for 10
mmarina2004 Oct 4, 2023
e902027
mark 10
mmarina2004 Oct 4, 2023
81a86ba
added fixes for 10
mmarina2004 Oct 4, 2023
ae7ffdf
added fixes
mmarina2004 Oct 5, 2023
b7516ad
added fixes
mmarina2004 Oct 5, 2023
0fdf5bd
added fixed
mmarina2004 Oct 5, 2023
fd4ce28
added fixed
mmarina2004 Oct 5, 2023
3d3664d
added fixes
mmarina2004 Oct 5, 2023
ab3519f
added fixes
mmarina2004 Oct 5, 2023
0ce585b
added fixes
mmarina2004 Oct 5, 2023
516127c
start
mmarina2004 Oct 5, 2023
8a4856c
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 6, 2023
edd3e29
Merge remote-tracking branch 'origin/main' into HEAD
artyomtugaryov Oct 11, 2023
9de3541
checkout labs from the origin repository
artyomtugaryov Oct 11, 2023
813de00
lab2
mmarina2004 Oct 12, 2023
eb3cec0
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 13, 2023
3af2ea9
i start do lab
mmarina2004 Oct 13, 2023
7cc7340
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 19, 2023
61a1603
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 20, 2023
5b7b064
merge
mmarina2004 Oct 20, 2023
5186fbc
1 func
mmarina2004 Oct 20, 2023
3f6df3f
change for 6
mmarina2004 Oct 25, 2023
182aa44
score
mmarina2004 Oct 25, 2023
07eb01a
added fixes
mmarina2004 Oct 25, 2023
bb6b622
change for 8
mmarina2004 Oct 26, 2023
146a4e2
score 8
mmarina2004 Oct 26, 2023
4e5bc34
change for 10
mmarina2004 Oct 28, 2023
a9de02a
score 10
mmarina2004 Oct 28, 2023
0154867
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 31, 2023
f6ff2bb
revert practice
mmarina2004 Nov 1, 2023
c12b292
Merge branch 'main' of https://github.com/mmarina2004/2023-2-level-labs
mmarina2004 Nov 1, 2023
006ec4f
revert practice
mmarina2004 Nov 1, 2023
f2d0f46
changes for checks
mmarina2004 Nov 1, 2023
64399ca
start
mmarina2004 Nov 2, 2023
2664576
start
mmarina2004 Nov 2, 2023
de9a0c6
start
mmarina2004 Nov 2, 2023
2b5df06
add fixes
mmarina2004 Nov 2, 2023
29376f3
Merge remote-tracking branch 'origin/main' into HEAD
artyomtugaryov Nov 3, 2023
fdada45
checkout labs from the origin repository
artyomtugaryov Nov 3, 2023
e680ce3
checkout labs from the origin repository
artyomtugaryov Nov 3, 2023
5217fa8
changes for 4
mmarina2004 Nov 7, 2023
8426d68
start
mmarina2004 Nov 7, 2023
d396d1e
score
mmarina2004 Nov 7, 2023
f6ed07b
start
mmarina2004 Nov 7, 2023
fb0476f
Merge branch 'fipl-hse:main' into main
artyomtugaryov Nov 10, 2023
ed8da3e
added fixes
mmarina2004 Nov 10, 2023
4c5d0c6
Merge branch 'main' of https://github.com/mmarina2004/2023-2-level-labs
mmarina2004 Nov 10, 2023
baab5b6
added fixes
mmarina2004 Nov 15, 2023
b5a6ad7
score
mmarina2004 Nov 15, 2023
449f2ac
added fixes
mmarina2004 Nov 15, 2023
a63eab0
Merge branch 'fipl-hse:main' into main
artyomtugaryov Nov 17, 2023
99153b2
added fixes
mmarina2004 Nov 17, 2023
30c7852
added fixes
mmarina2004 Nov 21, 2023
6061770
changes for 8
mmarina2004 Nov 22, 2023
45bcbe2
mark 8
mmarina2004 Nov 22, 2023
3686071
some changes
mmarina2004 Nov 22, 2023
a9a62ee
added fixes
mmarina2004 Nov 22, 2023
03d9c2f
added fixes
mmarina2004 Nov 22, 2023
fd83d60
added fixes
mmarina2004 Nov 23, 2023
e85ad1d
added fixes
mmarina2004 Nov 23, 2023
8592c3b
start
mmarina2004 Nov 23, 2023
9067d6d
some changes
mmarina2004 Nov 23, 2023
3262c60
changes
mmarina2004 Nov 23, 2023
eb51ae3
Merge remote-tracking branch 'origin/main' into HEAD
artyomtugaryov Nov 24, 2023
034b0f8
second step
mmarina2004 Nov 29, 2023
2e8a2a3
Merge branch 'fipl-hse:main' into main
artyomtugaryov Dec 1, 2023
00824c5
Merge branch 'fipl-hse:main' into main
artyomtugaryov Dec 4, 2023
f3cd6f8
changes for 6
mmarina2004 Dec 6, 2023
4b796cf
Merge branch 'main' of https://github.com/mmarina2004/2023-2-level-labs
mmarina2004 Dec 6, 2023
3e3f063
score 6
mmarina2004 Dec 6, 2023
bc5e553
start
mmarina2004 Dec 6, 2023
9d6216b
Merge branch 'fipl-hse:main' into main
mmarina2004 Dec 8, 2023
5098f1a
changes for 8
mmarina2004 Dec 12, 2023
b6cff90
mark 8
mmarina2004 Dec 12, 2023
5fc6613
corrections
mmarina2004 Dec 12, 2023
ce22058
corrections
mmarina2004 Dec 12, 2023
f96b2ad
corrections
mmarina2004 Dec 12, 2023
67eb317
changes for 10
mmarina2004 Dec 16, 2023
8399714
mark 10
mmarina2004 Dec 16, 2023
165bf26
start
mmarina2004 Dec 17, 2023
b713d63
corrections
mmarina2004 Dec 17, 2023
0e7f98f
corrections
mmarina2004 Dec 17, 2023
a48fe26
corrections
mmarina2004 Dec 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions lab_4_fill_words_by_ngrams/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
Top-p sampling generation and filling gaps with ngrams
"""
# pylint:disable=too-few-public-methods, too-many-arguments
import json
import math
import random

from lab_3_generate_by_ngrams.main import (BeamSearchTextGenerator, GreedyTextGenerator,
NGramLanguageModel, TextProcessor)

Expand All @@ -28,6 +32,20 @@ def _tokenize(self, text: str) -> tuple[str, ...]: # type: ignore
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(text, str) or not text:
raise ValueError('Incorrect input')

tokens = []
punctuation = '!?.'
for word in text.lower().split():
if word[-1] in punctuation:
tokens.extend([word[:len(word) - 1], self._end_of_word_token])
else:
cleaned_word = [letter for letter in word if letter.isalpha()]
if cleaned_word:
tokens.append(''.join(cleaned_word))

return tuple(tokens)

def _put(self, element: str) -> None:
"""
Expand All @@ -39,6 +57,11 @@ def _put(self, element: str) -> None:
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(element, str) or not element:
raise ValueError('Incorrect input')

if element not in self._storage:
self._storage[element] = len(self._storage)

def _postprocess_decoded_text(self, decoded_corpus: tuple[str, ...]) -> str: # type: ignore
"""
Expand All @@ -56,6 +79,18 @@ def _postprocess_decoded_text(self, decoded_corpus: tuple[str, ...]) -> str: #
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(decoded_corpus, tuple) or not decoded_corpus:
raise ValueError('Incorrect input')

words_list = list(decoded_corpus)
sentences = (' '.join(words_list)).split(self._end_of_word_token)
decoded_text = ''
for i, sentence in enumerate(sentences):
sentence = sentence.strip().capitalize()
decoded_text += f'{sentence}. '
if decoded_corpus[-1] == self._end_of_word_token:
return decoded_text[:len(decoded_text) - 2].strip()
return decoded_text.strip()


class TopPGenerator:
Expand All @@ -80,6 +115,9 @@ def __init__(
word_processor (WordProcessor): WordProcessor instance to handle text processing
p_value (float): Collective probability mass threshold
"""
self._model = language_model
self._word_processor = word_processor
self._p_value = p_value

def run(self, seq_len: int, prompt: str) -> str: # type: ignore
"""
Expand All @@ -98,6 +136,36 @@ def run(self, seq_len: int, prompt: str) -> str: # type: ignore
or if sequence has inappropriate length,
or if methods used return None.
"""
if (not isinstance(seq_len, int) or not isinstance(prompt, str)
or seq_len <= 0):
raise ValueError("Incorrect input")
encoded = self._word_processor.encode(prompt)
if not encoded:
raise ValueError("Encoded is None")

for i in range(seq_len):
next_tokens = self._model.generate_next_token(encoded)
if next_tokens is None:
raise ValueError('Next tokens are None')
if not next_tokens:
break

sorted_dict = dict(sorted(next_tokens.items(),
key=lambda x: (x[1], x[0]), reverse=True))
probability = 0
possible_tokens = tuple()
for word, value in sorted_dict.items():
probability += value
possible_tokens += (word,)
if probability >= self._p_value:
break
encoded += (random.choice(possible_tokens),)

decoded = self._word_processor.decode(encoded)
if not decoded:
raise ValueError('Decoded is None')

return decoded


class GeneratorTypes:
Expand All @@ -114,6 +182,14 @@ def __init__(self) -> None:
"""
Initialize an instance of GeneratorTypes.
"""
self.greedy = 0
self.top_p = 1
self.beam_search = 2
self._types = {
self.greedy: 'Greedy Generator',
self.top_p: 'Top-P Generator',
self.beam_search: 'Beam Search Generator'
}

def get_conversion_generator_type(self, generator_type: int) -> str: # type: ignore
"""
Expand All @@ -125,6 +201,7 @@ def get_conversion_generator_type(self, generator_type: int) -> str: # type: ig
Returns:
(str): Name of the generator.
"""
return self._types[generator_type]


class GenerationResultDTO:
Expand All @@ -147,6 +224,9 @@ def __init__(self, text: str, perplexity: float, generation_type: int):
generation_type (int):
Numeric type of the generator for which perplexity was calculated
"""
self.__text = text
self.__perplexity = perplexity
self.__type = generation_type

def get_perplexity(self) -> float: # type: ignore
"""
Expand All @@ -155,6 +235,7 @@ def get_perplexity(self) -> float: # type: ignore
Returns:
(float): Perplexity value
"""
return self.__perplexity

def get_text(self) -> str: # type: ignore
"""
Expand All @@ -163,6 +244,7 @@ def get_text(self) -> str: # type: ignore
Returns:
(str): Text for which the perplexity was count
"""
return self.__text

def get_type(self) -> int: # type: ignore
"""
Expand All @@ -171,6 +253,7 @@ def get_type(self) -> int: # type: ignore
Returns:
(int): Numeric type of the generator
"""
return self.__type

def __str__(self) -> str: # type: ignore
"""
Expand All @@ -179,6 +262,9 @@ def __str__(self) -> str: # type: ignore
Returns:
(str): String with report
"""
return (f'Perplexity score: {self.__perplexity}\n'
f'{GeneratorTypes().get_conversion_generator_type(self.__type)}\n'
f'Text: {self.__text}\n')


class QualityChecker:
Expand All @@ -203,6 +289,9 @@ def __init__(
NGramLanguageModel instance to use for text generation
word_processor (WordProcessor): WordProcessor instance to handle text processing
"""
self._generators = generators
self._language_model = language_model
self._word_processor = word_processor

def _calculate_perplexity(self, generated_text: str) -> float: # type: ignore
"""
Expand All @@ -220,6 +309,27 @@ def _calculate_perplexity(self, generated_text: str) -> float: # type: ignore
or if methods used return None,
or if nothing was generated.
"""
if not isinstance(generated_text, str) or not generated_text:
raise ValueError('Incorrect input')

encoded = self._word_processor.encode(generated_text)
if not encoded:
raise ValueError('Encoded is None')

ngram_size = self._language_model.get_n_gram_size()
log_prob_sum = 0.0
for index in range(ngram_size - 1, len(encoded)):
context = tuple(encoded[index - ngram_size + 1: index])
next_tokens = self._language_model.generate_next_token(context)
if not next_tokens:
raise ValueError('Next_tokens is None')

prob = next_tokens.get(encoded[index])
if prob:
log_prob_sum += math.log(prob)
if not log_prob_sum:
raise ValueError('Log_prob_sum is None')
return math.exp(-log_prob_sum / (len(encoded) - ngram_size))

def run(self, seq_len: int, prompt: str) -> list[GenerationResultDTO]: # type: ignore
"""
Expand All @@ -239,6 +349,21 @@ def run(self, seq_len: int, prompt: str) -> list[GenerationResultDTO]: # type:
or if sequence has inappropriate length,
or if methods used return None.
"""
if not isinstance(seq_len, int) or seq_len < 0 or not isinstance(prompt, str) or not prompt:
raise ValueError('Incorrect input')

results = []
for num_type, generator in self._generators.items():
text = generator.run(prompt=prompt, seq_len=seq_len)
if not text:
raise ValueError('Text is None')

perplexity = self._calculate_perplexity(text)
if not perplexity:
raise ValueError('Perplexity is None')

results.append(GenerationResultDTO(text, perplexity, num_type))
return sorted(results, key=lambda item: (perplexity, num_type))


class Examiner:
Expand All @@ -258,6 +383,8 @@ def __init__(self, json_path: str) -> None:
Args:
json_path (str): Local path to assets file
"""
self._json_path = json_path
self._questions_and_answers = self._load_from_json()

def _load_from_json(self) -> dict[tuple[str, int], str]: # type: ignore
"""
Expand All @@ -273,6 +400,15 @@ def _load_from_json(self) -> dict[tuple[str, int], str]: # type: ignore
or if attribute _json_path has inappropriate extension,
or if inappropriate type loaded data.
"""
if (not isinstance(self._json_path, str) or not self._json_path
or self._json_path[-5:] != ".json"):
raise ValueError('Incorrect input')

with open(self._json_path, 'r', encoding='utf-8') as file:
question_and_answers = json.load(file)
if not isinstance(question_and_answers, list):
raise ValueError('Question_and_answers is None')
return {(i['question'], i['location']): i['answer'] for i in question_and_answers}

def provide_questions(self) -> list[tuple[str, int]]: # type: ignore
"""
Expand All @@ -282,6 +418,7 @@ def provide_questions(self) -> list[tuple[str, int]]: # type: ignore
list[tuple[str, int]]:
List in the form of [(question, position of the word to be filled)]
"""
return list(self._questions_and_answers.keys())

def assess_exam(self, answers: dict[str, str]) -> float: # type: ignore
"""
Expand All @@ -296,6 +433,13 @@ def assess_exam(self, answers: dict[str, str]) -> float: # type: ignore
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(answers, dict) or not answers:
raise ValueError('Incorrect input')

right_answers = ([key for key in self._questions_and_answers.keys()
if answers[key[0]] == self._questions_and_answers[key]])

return len(right_answers) / len(list(self._questions_and_answers.values()))


class GeneratorRuleStudent:
Expand All @@ -318,6 +462,11 @@ def __init__(
NGramLanguageModel instance to use for text generation
word_processor (WordProcessor): WordProcessor instance to handle text processing
"""
self._generator_type = generator_type
generators = (GreedyTextGenerator(language_model, word_processor),
TopPGenerator(language_model, word_processor, 0.5),
BeamSearchTextGenerator(language_model, word_processor, 5))
self._generator = generators[self._generator_type]

def take_exam(self, tasks: list[tuple[str, int]]) -> dict[str, str]: # type: ignore
"""
Expand All @@ -335,6 +484,20 @@ def take_exam(self, tasks: list[tuple[str, int]]) -> dict[str, str]: # type: ig
or if input argument is empty,
or if methods used return None.
"""
if not isinstance(tasks, list) or not tasks:
raise ValueError('Incorrect input')

answers = {}
for (question, position) in tasks:
next_sequence = self._generator.run(seq_len=1, prompt=question[:position])
if not next_sequence:
raise ValueError('Next sequence is None')

if next_sequence[-1] == '.':
next_sequence = next_sequence[:-1] + ' '
answers.update({question: next_sequence + question[position:]})

return answers

def get_generator_type(self) -> str: # type: ignore
"""
Expand All @@ -343,3 +506,5 @@ def get_generator_type(self) -> str: # type: ignore
Returns:
str: Generator type
"""
generator = GeneratorTypes()
return generator.get_conversion_generator_type(self._generator_type)
26 changes: 25 additions & 1 deletion lab_4_fill_words_by_ngrams/start.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Filling word by ngrams starter
"""
# pylint:disable=too-many-locals,unused-import
import lab_4_fill_words_by_ngrams.main as main_py


def main() -> None:
Expand All @@ -10,7 +11,30 @@ def main() -> None:
"""
with open("./assets/Harry_Potter.txt", "r", encoding="utf-8") as text_file:
text = text_file.read()
result = None
word_processor = main_py.WordProcessor('<eow>')
encoded_text = word_processor.encode(text)
model = main_py.NGramLanguageModel(encoded_text, 2)
model.build()
top_p = main_py.TopPGenerator(model, word_processor, 0.5)
top_p_result = top_p.run(51, 'Vernon')
print(top_p_result)
generator_types = main_py.GeneratorTypes()
generators = {generator_types.top_p: main_py.TopPGenerator(model, word_processor, 0.5),
generator_types.beam_search:
main_py.BeamSearchTextGenerator(model, word_processor, 5)}
quality_check = main_py.QualityChecker(generators, model, word_processor)
quality_result = quality_check.run(100, 'The')
print(quality_result)
examiner = main_py.Examiner('./assets/question_and_answers.json')
questions = examiner.provide_questions()
students = [main_py.GeneratorRuleStudent(i, model, word_processor) for i in range(3)]
for student in students:
answers = student.take_exam(questions)
result = examiner.assess_exam(answers)
generator_type = student.get_generator_type()
print('Type of generator:', generator_type)
print('Answers:', ''.join(answers.values()))
print('Accuracy:', result)
assert result


Expand Down
2 changes: 1 addition & 1 deletion lab_4_fill_words_by_ngrams/target_score.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0
10