Skip to content
This repository has been archived by the owner on Dec 25, 2023. It is now read-only.

Laboratory work #4, Anna Vorontsova - 22FPL2 #184

Closed
wants to merge 88 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
c1d362b
my first commit
vorontsann Sep 8, 2023
ac7d44b
my second commit
vorontsann Sep 8, 2023
d4d0735
my second commit
vorontsann Sep 9, 2023
3b28642
Merge branch 'fipl-hse:main' into main
artyomtugaryov Sep 15, 2023
3d2c99c
Merge branch 'fipl-hse:main' into main
artyomtugaryov Sep 21, 2023
7b609f5
file deleted
vorontsann Sep 22, 2023
833c00a
file deleted
vorontsann Sep 26, 2023
c5f68a1
Merge branch 'main' of https://github.com/vorontsann/2023-2-level-labs
vorontsann Sep 26, 2023
5d2bdbe
file deleted
vorontsann Sep 26, 2023
6bee9e2
Merge branch 'fipl-hse:main' into main
artyomtugaryov Sep 28, 2023
42feb0c
file deleted
vorontsann Sep 28, 2023
39c7792
Merge branch 'main' of https://github.com/vorontsann/2023-2-level-labs
vorontsann Sep 28, 2023
658e324
file deleted
vorontsann Sep 28, 2023
6e3a82b
calculated frequencies
vorontsann Oct 1, 2023
58f3cd5
profiles created
vorontsann Oct 1, 2023
fc862ff
calculate_frequencies fixed
vorontsann Oct 1, 2023
b3a2359
mse calculated
vorontsann Oct 2, 2023
c7fdd72
corrected conditions in compare_profiles
vorontsann Oct 2, 2023
4e69cd7
mentor's corrections fixed and some things changed
vorontsann Oct 3, 2023
573c087
trying to fix tests
vorontsann Oct 3, 2023
953323b
tokenize simplified
vorontsann Oct 5, 2023
01ba88e
added spaces
vorontsann Oct 5, 2023
965b5e0
added spaces
vorontsann Oct 5, 2023
74636cd
added spaces
vorontsann Oct 5, 2023
5a089b2
FIXED CALCULATE_FREQUENCIES
vorontsann Oct 6, 2023
2db5574
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 6, 2023
43ea98c
Merge remote-tracking branch 'origin/main' into HEAD
artyomtugaryov Oct 11, 2023
c049b1f
checkout labs from the origin repository
artyomtugaryov Oct 11, 2023
0b02ae3
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 13, 2023
16ed42b
code for 4
vorontsann Oct 18, 2023
f19e77e
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 19, 2023
9595d29
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 20, 2023
62fc719
fixed mentor's comments
vorontsann Oct 25, 2023
aaa23f3
fixed unittests
vorontsann Oct 25, 2023
2083fa5
code for 6
vorontsann Oct 30, 2023
fab91f4
forgot to change the score meow
vorontsann Oct 30, 2023
4ff90c2
fixed style i hope
vorontsann Oct 30, 2023
9ecb29a
fixed style i hope [2]
vorontsann Oct 30, 2023
246a888
Merge branch 'fipl-hse:main' into main
artyomtugaryov Oct 31, 2023
870ab1b
fixed part of comments
vorontsann Oct 31, 2023
9bf2ae8
Merge branch 'main' of https://github.com/vorontsann/2023-2-level-labs
vorontsann Oct 31, 2023
272d467
trying to fix all.............
vorontsann Nov 1, 2023
82363ad
trying to fix all.............
vorontsann Nov 2, 2023
454a136
fixed merge tokens
vorontsann Nov 2, 2023
4dd99da
Artem Mikhailovich do not worry please
vorontsann Nov 2, 2023
f6e49a9
no more bad check
vorontsann Nov 2, 2023
2275a7e
Merge remote-tracking branch 'origin/main' into HEAD
artyomtugaryov Nov 3, 2023
1cfb086
checkout labs from the origin repository
artyomtugaryov Nov 3, 2023
accf69e
checkout labs from the origin repository
artyomtugaryov Nov 3, 2023
e51b4a8
Merge branch 'fipl-hse:main' into main
artyomtugaryov Nov 10, 2023
5cd71b6
code for 4
vorontsann Nov 15, 2023
7cfc18d
Merge branch 'fipl-hse:main' into main
artyomtugaryov Nov 17, 2023
79ff708
fixed tests
vorontsann Nov 18, 2023
571c01f
Merge branch 'main' of https://github.com/vorontsann/2023-2-level-labs
vorontsann Nov 18, 2023
7908ca5
code dor 6
vorontsann Nov 18, 2023
f47f868
trying to fix and fix and fix.......
vorontsann Nov 19, 2023
aded7e1
fixed comments and tests (except filter im trying to deal with it.....)
vorontsann Nov 21, 2023
455e3fd
filter (finally!!) (im not sure but..)
vorontsann Nov 21, 2023
ca0b278
filter (finally!!) (im not sure but..)
vorontsann Nov 21, 2023
36ea7e8
code for 8
vorontsann Nov 22, 2023
c8f5553
code style fixing
vorontsann Nov 22, 2023
8e61901
code style and import fixing
vorontsann Nov 22, 2023
a4fd670
import style fixing
vorontsann Nov 22, 2023
79520f1
mypy fixing
vorontsann Nov 22, 2023
6fe7fdc
i dont understand....
vorontsann Nov 22, 2023
950d83b
mypy fixing
vorontsann Nov 22, 2023
51d425f
mypy fixing
vorontsann Nov 22, 2023
a114cdc
mypy fixing
vorontsann Nov 22, 2023
6948e04
mypy fixing
vorontsann Nov 23, 2023
908cacb
fixing all
vorontsann Nov 23, 2023
4be43b7
fixing start
vorontsann Nov 23, 2023
d1bc24d
fixing return
vorontsann Nov 24, 2023
43e0cea
fixing return
vorontsann Nov 24, 2023
aef5122
Merge remote-tracking branch 'origin/main' into HEAD
artyomtugaryov Nov 24, 2023
26c5222
checkout labs from the origin repository
artyomtugaryov Nov 24, 2023
3e9c76d
Merge branch 'fipl-hse:main' into main
artyomtugaryov Dec 1, 2023
c82abda
Merge branch 'fipl-hse:main' into main
artyomtugaryov Dec 4, 2023
90e62b1
Merge branch 'fipl-hse:main' into main
vorontsann Dec 8, 2023
2c8daca
code for 6
vorontsann Dec 10, 2023
7b1f90d
import style fixing
vorontsann Dec 10, 2023
342556d
import style fixing
vorontsann Dec 10, 2023
a421f39
import style fixing
vorontsann Dec 10, 2023
621dcf5
import style fixing
vorontsann Dec 10, 2023
a9a085d
import style fixing
vorontsann Dec 10, 2023
3000112
import style and tests fixing
vorontsann Dec 10, 2023
1aecdd0
import style fixing
vorontsann Dec 16, 2023
7daa0b5
tests fixing
vorontsann Dec 16, 2023
ba55b5a
fixing
vorontsann Dec 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions lab_4_fill_words_by_ngrams/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
Top-p sampling generation and filling gaps with ngrams
"""
# pylint:disable=too-few-public-methods, too-many-arguments
from random import choice

from lab_3_generate_by_ngrams.main import (BeamSearchTextGenerator, GreedyTextGenerator,
NGramLanguageModel, TextProcessor)

Expand All @@ -28,6 +30,19 @@ def _tokenize(self, text: str) -> tuple[str, ...]: # type: ignore
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(text, str) or not text:
raise ValueError('Type input is inappropriate or input argument is empty.')

tokens = []
punctuation_signs = '?!.'
for word in text.lower().split():
cleaned_word = [letter for letter in word if letter.isalpha()]
if not cleaned_word:
continue
tokens.append(''.join(cleaned_word))
if word[-1] in punctuation_signs:
tokens.append(self._end_of_word_token)
return tuple(tokens)

def _put(self, element: str) -> None:
"""
Expand All @@ -39,6 +54,11 @@ def _put(self, element: str) -> None:
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(element, str) or not element:
raise ValueError('Type input is inappropriate or input argument is empty.')

if element not in self._storage:
self._storage[element] = len(self._storage)

def _postprocess_decoded_text(self, decoded_corpus: tuple[str, ...]) -> str: # type: ignore
"""
Expand All @@ -56,6 +76,16 @@ def _postprocess_decoded_text(self, decoded_corpus: tuple[str, ...]) -> str: #
Raises:
ValueError: In case of inappropriate type input argument or if input argument is empty.
"""
if not isinstance(decoded_corpus, tuple) or not decoded_corpus:
raise ValueError('Type input is inappropriate or input argument is empty.')

words = " ".join(decoded_corpus)
sentences = words.split(self._end_of_word_token)
resulted_text = ". ".join([sentence.strip().capitalize() for sentence in sentences])

if resulted_text[-1] == ' ':
return resulted_text[:-1]
return f"{resulted_text}."


class TopPGenerator:
Expand All @@ -80,6 +110,9 @@ def __init__(
word_processor (WordProcessor): WordProcessor instance to handle text processing
p_value (float): Collective probability mass threshold
"""
self._model = language_model
self._word_processor = word_processor
self._p_value = p_value

def run(self, seq_len: int, prompt: str) -> str: # type: ignore
"""
Expand All @@ -98,6 +131,40 @@ def run(self, seq_len: int, prompt: str) -> str: # type: ignore
or if sequence has inappropriate length,
or if methods used return None.
"""
if not (isinstance(seq_len, int) and isinstance(prompt, str) and
seq_len > 0 and prompt):
raise ValueError('Type input is inappropriate or input argument is empty.')

encoded_prompt = self._word_processor.encode(prompt)
if encoded_prompt is None:
raise ValueError('None is returned')

encoded_list = list(encoded_prompt)
for i in range(seq_len):
candidates = self._model.generate_next_token(encoded_prompt)
if candidates is None:
raise ValueError('None is returned.')
if not candidates:
break
sorted_candidates = sorted(list(candidates.items()),
key=lambda pair: (pair[1], pair[0]), reverse=True)
sum_freq = 0
num_candidates = 0
for _, freq in sorted_candidates:
if sum_freq >= self._p_value:
break
sum_freq += freq
num_candidates += 1

random_token = choice(sorted_candidates[:num_candidates])[0]
encoded_list.append(random_token)
encoded_prompt = tuple(encoded_list)

decoded = self._word_processor.decode(encoded_prompt)
if decoded is None:
raise ValueError('None is returned')

return decoded


class GeneratorTypes:
Expand Down
11 changes: 10 additions & 1 deletion lab_4_fill_words_by_ngrams/start.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Filling word by ngrams starter
"""
# pylint:disable=too-many-locals,unused-import
from lab_4_fill_words_by_ngrams.main import NGramLanguageModel, TopPGenerator, WordProcessor


def main() -> None:
Expand All @@ -10,7 +11,15 @@ def main() -> None:
"""
with open("./assets/Harry_Potter.txt", "r", encoding="utf-8") as text_file:
text = text_file.read()
result = None
word_processor = WordProcessor("<eos>")
encoded_text = word_processor.encode(text)
lang_model = NGramLanguageModel(encoded_text, 2)
lang_model.build()

top_p_generator = TopPGenerator(lang_model, word_processor, 0.5)
result = top_p_generator.run(51, "Vernon")
print(result)

assert result


Expand Down
2 changes: 1 addition & 1 deletion lab_4_fill_words_by_ngrams/target_score.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0
6