diff --git a/.gitignore b/.gitignore index 7aa9cd7..d1eded8 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,6 @@ +/build/ +/venv/ +/dist/harmonized-company-names.json +/dist/.env .env harmonized-company-names.json diff --git a/README.md b/README.md index 47e57f7..476d399 100644 --- a/README.md +++ b/README.md @@ -1,82 +1,129 @@ -AIAutoRename -============ +# autorename-pdf -AIAutoRename is a Python script that automatically renames PDF files based on their content. It leverages the power of the OpenAI GPT Chat API to extract relevant information, such as the document date, company name, and document type, from the PDF's text. This tool is designed to simplify the organization and management of your PDF files by automating the renaming process. +**autorename-pdf** is a highly efficient tool designed to automatically rename and archive PDF documents based on their content. By leveraging OCR technology, it extracts critical information such as the company name, document date, and document type to create well-organized filenames. This tool simplifies document management and ensures consistency, especially for businesses handling large volumes of PDFs. -Installation ------------- +--- -To use AIAutoRename, you'll need Python 3.6 or later. You can download it from the [official Python website](https://www.python.org/downloads/) or the Microsoft Store. +## Features -1. Clone or download this repository and navigate to the root directory of the project in your terminal. +- **Automatic PDF Renaming**: Extracts metadata from PDFs (company name, date, document type) and renames them accordingly. +- **Organized Archiving**: Ensures consistency in document naming and file storage, streamlining archiving processes. +- **Batch Processing**: Rename multiple PDFs within a folder in one go. +- **Context Menu Integration**: Easily right-click on files or folders to trigger renaming actions. +- **Powerful OCR Support**: Uses Tesseract and advanced AI via OpenAI for highly accurate text recognition from scanned PDFs. - ``` - git clone https://github.com/ptmrio/AIAutoRename.git - cd AIAutoRename - ``` -2. Install the required python packages using the `requirements.txt` file: +--- +## Installation Guide + +### Prerequisites + +Ensure you have the following installed on your system: + +1. **Python (OPTIONAL)**: Download and install the latest version of Python 3.x (preferably the latest version of Python 3, like 3.11): + ```powershell + winget install Python.Python + ``` + + +2. **Chocolatey**: Required for installing dependencies on Windows. Install it using PowerShell (run as administrator): + ```powershell + Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1')) + ``` + +2. **Tesseract OCR**: Required for extracting text from images in PDFs. Install it using winget (preferred): + ```powershell + choco install tesseract + ``` + +3. **Poppler**: Required for converting PDF pages into images. Install via Chocolatey or manually: + ```powershell + choco install poppler ``` - pip install -r requirements.txt - ``` -3. Install [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/) for Windows by following the installation instructions on their GitHub page. After installation, add the folder of the installed Tesseract directory (typicalls `C:\Program Files\Tesseract-OCR`) to your PATH environment variable. - -4. Download and extract [poppler for Windows](https://github.com/oschwartz10612/poppler-windows). After installation, add the `bin` folder (e.g. `C:\poppler\Library\bin`) of the installed poppler directory to your PATH environment variable. +### Setup Instructions + +1. **Download or clone the Repository**: + ```cmd + git clone https://github.com/ptmrio/autorename-pdf.git + cd autorename-pdf + ``` -Here's a [guide](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/) on how to add directories to the PATH variable on Windows 10. - +2. **Edit the `.env` File**: + Configure your API key and company name by editing the `.env.example` file and move it into the dist folder as `.env.example`. Open it in any text editor and set the following: + - Add your OpenAI API key: + ``` + OPENAI_API_KEY=your-api-key + ``` + - Specify your preferred OpenAI model: + ``` + OPENAI_MODEL=gpt-4o + ``` + - Enter your company name (this prevents it from being extracted): + ``` + MY_COMPANY_NAME=your-company-name + ``` + Save the file as `.env` after making these changes. -Configuration -------------- +3. **Run the Context Menu Setup (Administrator Required)**: + The app includes pre-built executables, so no need to install dependencies. Simply add the app to your context menu by running the following command (make sure to **run as admin**): + ```cmd + add-to-context-menu.exe + ``` -AIAutoRename uses environment variables to configure the OpenAI API key and the name of your company. Before running the script, you'll need to create a file named `.env` in the root directory of the project and add the following lines: + This will add options to your right-click context menu for both individual PDFs and folders. -``` -OPENAI_API_KEY= -OPENAI_MODEL=gpt-3.5-turbo -MY_COMPANY_NAME= -``` +--- -Replace `` with your OpenAI API key, which can be obtained from the [OpenAI website](https://platform.openai.com/account/api-keys). Set `` to your company's name. This information will help the OpenAI API to better understand the context and decide whether to use the sender or recipient of the PDF document. +## Usage -Usage ------ +### Context Menu (Recommended) -### Renaming a single PDF file +After installation, autorename-pdf can be accessed by right-clicking files or folders: -To rename a single PDF file, run the following command in your terminal (cmd on Windows, terminal on Mac): +1. **Rename a Single PDF**: Right-click a PDF file and select `Auto Rename PDF` to automatically rename it. +2. **Batch Rename PDFs in Folder**: Right-click a folder and choose `Auto Rename PDFs in Folder` to process all PDFs within. +3. **Rename PDFs from Folder Background**: Right-click the background of a folder and select `Auto Rename PDFs in This Folder` to rename every PDF inside the folder. -``` -python autorename.py "C:\Users\username\Downloads\invoice123.pdf" -``` +### Command-Line Usage (Optional) -Replace `C:\Users\username\Downloads\invoice123.pdf` with the path to your PDF file. +If you prefer using the terminal, autorename-pdf can be executed as a command-line tool: -**Example:** +- **Rename a single PDF**: + ```bash + autorename-pdf.exe "C:\path\to\file.pdf" + ``` -Suppose your PDF file is named `invoice123.pdf` and is located in the `invoices` folder on your desktop. After running AIAutoRename, the file might be renamed to something like `20220101 ACME ER.pdf`, where `20220101` is the document date, `ACME` is the company name, and `ER` is the document type (incoming invoice). +- **Rename all PDFs in a folder**: + ```bash + autorename-pdf.exe "C:\path\to\folder" + ``` -### Renaming all PDF files in a folder +--- -To rename all PDF files in a folder and its subfolders, run the following command in your terminal: +## Examples -``` -python autorename.py "C:\Users\username\Downloads" -``` +Here are some real-world examples of how autorename-pdf can simplify your file management: -Replace `C:\Users\username\Downloads` with the path to your folder (no trailing slash). +1. **Input**: `invoice_123.pdf` + **Output**: `20230901 ACME ER.pdf` + - Explanation: The file is renamed using the date `20230901` (1st September 2023), `ACME` as the company name, and `ER` for an incoming invoice. -**Example:** +2. **Input**: `payment_invoice.pdf` + **Output**: `20231015 XYZ AR.pdf` + - Explanation: The system extracts `20231015` (15th October 2023), `XYZ` as the company, and `AR` for an outgoing invoice. -Suppose you downloaded a batch of documents into your `Downloads` folder. After running AIAutoRename on the folder, all PDF files within the folder will be renamed according to their content, such as document date, company name, and document type. For example, a file originally named `invoice123.pdf` might be renamed to `20220215 MegaCorp PO.pdf`, where `20220215` is the document date, `MegaCorp` is the company name, and `PO` is the document type (purchase order). +3. **Batch Renaming**: + - **Input**: A folder containing `invoice1.pdf`, `invoice2.pdf`, `invoice3.pdf`. + - **Output**: Renamed files inside the folder as: + - `20230712 CompanyA ER.pdf` + - `20230713 CompanyB AR.pdf` + - `20230714 CompanyC ER.pdf` -Contributing ------------- +--- -We welcome contributions from everyone! If you find a bug or have a feature request, please open an issue on our [GitHub repository](https://github.com/ptmrio/AIAutoRename). If you'd like to contribute code, please open a pull request with your changes. We appreciate your support in making AIAutoRename even better! +## Contribution and Support -Support -------- +We welcome contributions and feedback. If you have ideas or encounter issues, please submit a pull request or open an issue on [GitHub](https://github.com/ptmrio/autorename-pdf). -If you encounter any issues or need assistance using AIAutoRename, please don't hesitate to reach out by opening an issue on our [GitHub repository](https://github.com/ptmrio/AIAutoRename). We'll do our best to help you as soon as possible. \ No newline at end of file +For any questions or support, please reach out through our GitHub page. \ No newline at end of file diff --git a/add-to-context-menu.py b/add-to-context-menu.py new file mode 100644 index 0000000..779e4f1 --- /dev/null +++ b/add-to-context-menu.py @@ -0,0 +1,123 @@ +import os +import sys +import winreg as reg +import ctypes + +def is_admin(): + try: + return ctypes.windll.shell32.IsUserAnAdmin() + except: + return False + +def add_registry_entries(): + if not is_admin(): + print("This script requires administrator privileges. Please run as administrator.") + return + + # Get the current directory + current_directory = os.path.dirname(os.path.abspath(__file__)) + + # Check if we're running from source or as a built executable + if getattr(sys, 'frozen', False): + # We're running in a bundle (built executable) + current_directory = os.path.dirname(sys.executable) + main_script = os.path.join(current_directory, "autorename-pdf.exe") # autorename-pdf.exe should be alongside this executable + else: + # We're running in a normal Python environment + executable = os.path.join(current_directory, "venv", "Scripts", "python.exe") + main_script = os.path.join(current_directory, "autorename.py") + + # Command for folders (using the main script directly) + if getattr(sys, 'frozen', False): + autorename_command = f'"{main_script}" "%1"' + else: + autorename_command = f'"{executable}" "{main_script}" "%1"' + + # Confirm with the user + confirm = input("This will add 'Auto Rename PDF' to your context menus. Continue? (y/n): ") + if confirm.lower() != 'y': + print("Operation cancelled.") + return + + try: + # Add registry entries for PDFs (using the wrapper) + add_menu_for_file_type("SystemFileAssociations\\.pdf", "Auto Rename PDF", autorename_command) + + # Add registry entries for Folders (using the main script) + add_menu_for_folder("Auto Rename PDFs in Folder", autorename_command) + + # Add registry entries for Directory Background (using the main script) + add_menu_for_directory_background("Auto Rename PDFs in This Folder", autorename_command) + + print("Registry entries added successfully.") + except Exception as e: + print(f"An error occurred: {e}") + +def add_menu_for_file_type(file_type_key, menu_name, command): + key_path = f"{file_type_key}\\shell\\AutoRenamePDF" + key_command_path = f"{key_path}\\command" + + with reg.CreateKey(reg.HKEY_CLASSES_ROOT, key_path) as key: + reg.SetValueEx(key, None, 0, reg.REG_SZ, menu_name) + reg.SetValueEx(key, "Icon", 0, reg.REG_SZ, "shell32.dll,71") + + with reg.CreateKey(reg.HKEY_CLASSES_ROOT, key_command_path) as key: + reg.SetValueEx(key, None, 0, reg.REG_SZ, command) + +def add_menu_for_folder(menu_name, command): + key_path = r"Directory\shell\AutoRenamePDFs" + key_command_path = f"{key_path}\\command" + + with reg.CreateKey(reg.HKEY_CLASSES_ROOT, key_path) as key: + reg.SetValueEx(key, None, 0, reg.REG_SZ, menu_name) + reg.SetValueEx(key, "Icon", 0, reg.REG_SZ, "shell32.dll,71") + + with reg.CreateKey(reg.HKEY_CLASSES_ROOT, key_command_path) as key: + reg.SetValueEx(key, None, 0, reg.REG_SZ, command) + +def add_menu_for_directory_background(menu_name, command): + key_path = r"Directory\Background\shell\AutoRenamePDFs" + key_command_path = f"{key_path}\\command" + + with reg.CreateKey(reg.HKEY_CLASSES_ROOT, key_path) as key: + reg.SetValueEx(key, None, 0, reg.REG_SZ, menu_name) + reg.SetValueEx(key, "Icon", 0, reg.REG_SZ, "shell32.dll,71") + + with reg.CreateKey(reg.HKEY_CLASSES_ROOT, key_command_path) as key: + reg.SetValueEx(key, None, 0, reg.REG_SZ, command.replace('"%1"', '"%V"')) + +def remove_registry_entries(): + if not is_admin(): + print("This script requires administrator privileges. Please run as administrator.") + return + + confirm = input("This will remove 'Auto Rename PDF' from your context menus. Continue? (y/n): ") + if confirm.lower() != 'y': + print("Operation cancelled.") + return + + try: + # Remove entries for PDFs + reg.DeleteKey(reg.HKEY_CLASSES_ROOT, r"SystemFileAssociations\.pdf\shell\AutoRenamePDF\command") + reg.DeleteKey(reg.HKEY_CLASSES_ROOT, r"SystemFileAssociations\.pdf\shell\AutoRenamePDF") + + # Remove entries for Folders + reg.DeleteKey(reg.HKEY_CLASSES_ROOT, r"Directory\shell\AutoRenamePDFs\command") + reg.DeleteKey(reg.HKEY_CLASSES_ROOT, r"Directory\shell\AutoRenamePDFs") + + # Remove entries for Directory Background + reg.DeleteKey(reg.HKEY_CLASSES_ROOT, r"Directory\Background\shell\AutoRenamePDFs\command") + reg.DeleteKey(reg.HKEY_CLASSES_ROOT, r"Directory\Background\shell\AutoRenamePDFs") + + print("Registry entries removed successfully.") + except Exception as e: + print(f"An error occurred: {e}") + +if __name__ == "__main__": + action = input("Do you want to (a)dd or (r)emove registry entries? ").lower() + if action == 'a': + add_registry_entries() + elif action == 'r': + remove_registry_entries() + else: + print("Invalid option. Please choose 'a' to add or 'r' to remove.") \ No newline at end of file diff --git a/add-to-context-menu.spec b/add-to-context-menu.spec new file mode 100644 index 0000000..7581b42 --- /dev/null +++ b/add-to-context-menu.spec @@ -0,0 +1,38 @@ +# -*- mode: python ; coding: utf-8 -*- + + +a = Analysis( + ['add-to-context-menu.py'], + pathex=[], + binaries=[], + datas=[], + hiddenimports=[], + hookspath=[], + hooksconfig={}, + runtime_hooks=[], + excludes=[], + noarchive=False, + optimize=0, +) +pyz = PYZ(a.pure) + +exe = EXE( + pyz, + a.scripts, + a.binaries, + a.datas, + [], + name='add-to-context-menu', + debug=False, + bootloader_ignore_signals=False, + strip=False, + upx=True, + upx_exclude=[], + runtime_tmpdir=None, + console=True, + disable_windowed_traceback=False, + argv_emulation=False, + target_arch=None, + codesign_identity=None, + entitlements_file=None, +) diff --git a/autorename-pdf.spec b/autorename-pdf.spec new file mode 100644 index 0000000..f476380 --- /dev/null +++ b/autorename-pdf.spec @@ -0,0 +1,38 @@ +# -*- mode: python ; coding: utf-8 -*- + + +a = Analysis( + ['autorename.py'], + pathex=[], + binaries=[], + datas=[], + hiddenimports=[], + hookspath=[], + hooksconfig={}, + runtime_hooks=[], + excludes=[], + noarchive=False, + optimize=0, +) +pyz = PYZ(a.pure) + +exe = EXE( + pyz, + a.scripts, + a.binaries, + a.datas, + [], + name='autorename-pdf', + debug=False, + bootloader_ignore_signals=False, + strip=False, + upx=True, + upx_exclude=[], + runtime_tmpdir=None, + console=True, + disable_windowed_traceback=False, + argv_emulation=False, + target_arch=None, + codesign_identity=None, + entitlements_file=None, +) diff --git a/autorename.py b/autorename.py index e35eeaf..1c4c0df 100644 --- a/autorename.py +++ b/autorename.py @@ -1,171 +1,208 @@ -from jellyfish import jaro_winkler_similarity import os -from dotenv import load_dotenv import sys +import logging +from typing import Dict, Tuple, Optional +from jellyfish import jaro_winkler_similarity +from dotenv import load_dotenv from pdf2image import convert_from_path +import datetime import pytesseract -import openai +from openai import OpenAI import json import dateparser import re +from PIL import Image +import cv2 +import numpy as np +from pydantic import BaseModel, Field -load_dotenv() -openai.api_key = os.getenv("OPENAI_API_KEY") -openai_model = os.getenv("OPENAI_MODEL") -my_company_name = os.getenv("MY_COMPANY_NAME") - - -def pdf_to_text(pdf_path): - images = convert_from_path(pdf_path, first_page=1, last_page=1) - text = '' - for image in images: - text += pytesseract.image_to_string(image) - return text +# Constants +PDF_EXTENSION = ".pdf" +UNKNOWN_VALUE = "Unknown" +DEFAULT_DATE = "00000000" +CONFIDENCE_THRESHOLD = 0.85 +# Configure logging +logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') -def truncate_text(text, max_tokens=2048): - tokens = text.split() - truncated_text = ' '.join(tokens[:max_tokens]) - return truncated_text +if getattr(sys, 'frozen', False): + current_directory = os.path.dirname(sys.executable) # Path to the folder containing the .exe +else: + current_directory = os.path.dirname(os.path.abspath(__file__)) # Path to the script file +# Define the path to the .env file +env_path = os.path.join(current_directory, '.env') -def is_valid_filename(filename: str) -> bool: - forbidden_characters = r'[<>:"/\\|?*]' - return not re.search(forbidden_characters, filename) - +# Load environment variables from the .env file +load_dotenv(env_path) -def get_openai_response(text): - max_attempts = 3 - attempt = 0 +openai_model = os.getenv("OPENAI_MODEL") +my_company_name = os.getenv("MY_COMPANY_NAME") - while attempt < max_attempts: - print(f'Attempt {attempt+1}/{max_attempts}') - print('---------------------------------') +client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) - print('PDF text (preview):') - print({text[:100]}) - print('---------------------------------') +class DocumentResponse(BaseModel): + company_name: str = Field(..., description="Name of the company in the document") + document_date: str = Field(..., description="Date of the document in format dd.mm.yyyy") + document_type: str = Field(..., description="Type of the document (ER, AR, etc.)") - completion = openai.ChatCompletion.create( - model=openai_model, +def is_valid_filename(filename: str) -> bool: + forbidden_chars = r'[<>:"/\\|?*]' + + if re.search(forbidden_chars, filename): + return False + + if not filename or filename.isspace(): + return False + + if len(filename) > 255: + return False + + return True + +def preprocess_image(image): + # Convert to grayscale + gray = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2GRAY) + # Apply thresholding to preprocess the image + gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] + # Apply dilation to remove noise + kernel = np.ones((1, 1), np.uint8) + gray = cv2.dilate(gray, kernel, iterations=1) + return Image.fromarray(gray) + +def pdf_to_text(pdf_path: str, start_page: int = 1, end_page: int = 1) -> str: + """Convert a range of pages from a PDF to text using OCR with preprocessing.""" + try: + images = convert_from_path(pdf_path, first_page=start_page, last_page=end_page) + text = "" + for image in images: + # Preprocess the image + processed_image = preprocess_image(image) + # Perform OCR with specific configuration + page_text = pytesseract.image_to_string( + processed_image, + config='--psm 6 --oem 3 -c preserve_interword_spaces=1' + ) + text += page_text + "\n\n" + return text.strip() + except Exception as e: + logging.error(f"Error converting PDF to text: {e}") + return "" + +def get_openai_response(pdf_path: str) -> Dict[str, str]: + """Get structured information from OpenAI API, potentially using multiple pages.""" + text = pdf_to_text(pdf_path, start_page=1, end_page=1) + response = process_text_with_openai(text) + + if response['company_name'] == UNKNOWN_VALUE or response['document_date'] == DEFAULT_DATE or response['document_type'] == UNKNOWN_VALUE: + logging.info("Insufficient information from first page. Checking second page.") + text += "\n\n" + pdf_to_text(pdf_path, start_page=2, end_page=2) + response = process_text_with_openai(text) + + if response['company_name'] == UNKNOWN_VALUE or response['document_date'] == DEFAULT_DATE or response['document_type'] == UNKNOWN_VALUE: + logging.info("Still insufficient information. Checking third page.") + text += "\n\n" + pdf_to_text(pdf_path, start_page=3, end_page=3) + response = process_text_with_openai(text) + + return response + +def process_text_with_openai(text: str) -> Dict[str, str]: + """Process the extracted text with OpenAI API.""" + try: + response = client.chat.completions.create( + model=os.getenv("OPENAI_MODEL", "gpt-4o"), messages=[ { - "role": "system", - "content": - "You will be asked to extract the company name, document date, and document type from a PDF document." + - "Due to the nature of OCR, the text will be very noisy and might contain spelling errors, handle those as good as possible." + - "You will only return a JSON object with these properties only \"company_name\", \"document_date\", \"document_type\"." + + "role": "system", + "content": "You will extract the company name, document date, and document type from the following PDF text. " + + "Adhere to the following JSON format: company_name, document_date, document_type. " + "No additional text and no formatting. Only the JSON object." + - "If the text language is German, assume a European date format (dd.mm.YYYY or dd/mm/YYYY or reverse) in the text. Return format: dd.mm.YYYY" + + "Due to the nature of OCR Text detection, the text will be very noisy and might contain spelling errors, handle those as good as possible." + + "For the company_name you always strip the legal form (e.U., SARL, GmbH, AG, Lmt, Limited etc.)" + + "If the text language is German, assume a European date format (dd.mm.YYYY or dd/mm/YYYY or reverse) in the provided text. Return format: dd.mm.YYYY" + "Valid document types are: For incoming invoices (invoices my company receives) use the term 'ER' only, nothing more. For outgoing invoices (invoices my company sends) use the term 'AR', nothing more." + "For all other documents, find a short descriptive summary/subject in german language." + - "My company name is: \"" + my_company_name + "\", avoid using my company name as company_name in the response." + - "Here are three example responses for training purpose only:" + - "Example incoming invoice: {\"company_name\": \"ACME\", \"document_date\": \"01.01.2021\", \"document_type\": \"ER\"} " + - "Example outgoing invoice: {\"company_name\": \"ACME\", \"document_date\": \"01.01.2021\", \"document_type\": \"AR\"} " + - "Example document: {\"company_name\": \"ACME\", \"document_date\": \"01.01.2021\", \"document_type\": \"Angebot\"}" - "Example if date is unavailable: {\"company_name\": \"ACME\", \"document_date\": \"00.00.0000\", \"document_type\": \"Angebot\"}" - }, - {"role": "user", "content": f"Extract the \"company_name\", \"document_date\", \"document_type\" from this PDF document and return a JSON object:\n\n{text}"}, - ] + "If a value is not found, leave it empty." + + f"My company name is: \"{os.getenv('MY_COMPANY_NAME')}\", avoid using my company name as company_name in the response." + }, + {"role": "user", "content": f"Extract the information from the text:\n\n{text}"} + ], + response_format={ "type": "json_object" } ) + + content = response.choices[0].message.content + parsed_response = json.loads(content) + logging.info(f'API Extract Response: {parsed_response}') - response = completion.choices[0].message["content"] - - print('API Extract Response:') - print(response) - print('---------------------------------') - - try: - json_response = json.loads(response) - if ('company_name' in json_response and 'document_date' in json_response and 'document_type' in json_response): - company_name = json_response['company_name'] - document_date = json_response['document_date'] - document_type = json_response['document_type'] - - if (is_valid_filename(company_name) and is_valid_filename(document_type) and document_date): - break + company_name = parsed_response.get('company_name', UNKNOWN_VALUE) + document_date = parsed_response.get('document_date', DEFAULT_DATE) + document_type = parsed_response.get('document_type', UNKNOWN_VALUE) - except json.JSONDecodeError: - pass + if not is_valid_filename(company_name): + company_name = UNKNOWN_VALUE + if not is_valid_filename(document_type): + document_type = UNKNOWN_VALUE + if not is_valid_filename(document_date): + document_date = DEFAULT_DATE - attempt += 1 + return {"company_name": company_name, "document_date": document_date, "document_type": document_type} - if attempt == max_attempts: - return {"company_name": "Unknown", "document_date": "00000000", "document_type": "Unknown"} - - return json_response + except Exception as e: + logging.error(f"Error during OpenAI API call: {e}") + + return {"company_name": UNKNOWN_VALUE, "document_date": DEFAULT_DATE, "document_type": UNKNOWN_VALUE} -def harmonize_company_name(company_name): +def harmonize_company_name(company_name: str) -> str: + """Harmonize company name based on predefined mappings.""" company_name = company_name.strip() - if not os.path.exists("harmonized-company-names.json"): - print( - f'harmonized-company-names.json not found, using original name: {company_name}') + logging.warning(f'harmonized-company-names.json not found, using original name: {company_name}') return company_name with open("harmonized-company-names.json", "r", encoding='utf-8') as file: harmonized_names = json.load(file) - best_match = company_name - best_similarity = 0 + best_match = max( + ((harmonized_name, max(jaro_winkler_similarity(company_name.lower(), synonym.lower()) for synonym in synonyms)) + for harmonized_name, synonyms in harmonized_names.items()), + key=lambda x: x[1] + ) - for harmonized_name, synonyms in harmonized_names.items(): - for synonym in synonyms: - similarity = jaro_winkler_similarity( - company_name.lower(), synonym.lower()) - if similarity > best_similarity: - best_similarity = similarity - best_match = harmonized_name + if best_match[1] > CONFIDENCE_THRESHOLD: + logging.info(f'Using harmonized company name: {best_match[0]}') + return best_match[0] - confidence_threshold = 0.85 - if best_similarity > confidence_threshold: - print(f'Using harmonized company name: {best_match}') - return best_match - - print( - f'No harmonized company name found, using original name: {company_name}') + logging.info(f'No harmonized company name found, using original name: {company_name}') return company_name +def parse_openai_response(response: Dict[str, str]) -> Tuple[str, Optional[datetime.date], str]: + """Parse the OpenAI response and extract relevant information.""" + company_name = response.get('company_name', UNKNOWN_VALUE) + document_date = response.get('document_date', DEFAULT_DATE) + document_type = response.get('document_type', UNKNOWN_VALUE) -def parse_openai_response(response): - company_name = response.get('company_name', 'Unknown') - - document_date = response.get('document_date', '00000000') - if document_date is None or document_date.strip() == '' or document_date.strip().lower() == 'unbekannt': - document_date = "00000000" - - parsed_document_date = dateparser.parse(str(document_date), settings={ - 'DATE_ORDER': 'DMY' - }) - - if parsed_document_date is None: - document_date = dateparser.parse('00000000', settings={ - 'DATE_ORDER': 'DMY' - }) - else: - document_date = parsed_document_date - - document_type = response.get('document_type', 'Unknown') - - return company_name, document_date, document_type + parsed_date = dateparser.parse(document_date, settings={'DATE_ORDER': 'DMY'}) + if parsed_date is None: + parsed_date = dateparser.parse(DEFAULT_DATE, settings={'DATE_ORDER': 'DMY'}) + return company_name, parsed_date, document_type -def rename_invoice(pdf_path, company_name, document_date, document_type): - if document_date is not None: +def rename_invoice(pdf_path: str, company_name: str, document_date: Optional[datetime.date], document_type: str) -> None: + """Rename the document based on extracted information.""" + if document_date: base_name = f'{document_date.strftime("%Y%m%d")} {company_name} {document_type}' else: base_name = f'{company_name} {document_type}' - counter = 0 - new_name = base_name + '.pdf' + new_name = f"{base_name}.pdf" new_path = os.path.join(os.path.dirname(pdf_path), new_name) if pdf_path == new_path: - print(f'File "{new_name}" is already correctly named.') + logging.info(f'File "{new_name}" is already correctly named.') return + counter = 0 while os.path.exists(new_path): counter += 1 new_name = f'{base_name} ({counter}).pdf' @@ -173,41 +210,39 @@ def rename_invoice(pdf_path, company_name, document_date, document_type): try: os.rename(pdf_path, new_path) - print(f'Invoice renamed to: {new_name}') + logging.info(f'Document renamed to: {new_name}') except Exception as e: - print(f'Error renaming {pdf_path}: {str(e)}') - - -def process_folder(folder_path): - for root, _, files in os.walk(folder_path): - for file in files: - if file.lower().endswith(".pdf"): - pdf_path = os.path.join(root, file) - text = pdf_to_text(pdf_path) - openai_response = get_openai_response(text) - company_name, document_date, document_type = parse_openai_response( - openai_response) - company_name = harmonize_company_name(company_name) - rename_invoice(pdf_path, company_name, - document_date, document_type) - - -if __name__ == '__main__': + logging.error(f'Error renaming {pdf_path}: {str(e)}') + +def process_pdf(pdf_path: str) -> None: + """Process a single PDF file.""" + logging.info("---") + logging.info(f"Processing {pdf_path}") + openai_response = get_openai_response(pdf_path) + company_name, document_date, document_type = parse_openai_response(openai_response) + company_name = harmonize_company_name(company_name) + rename_invoice(pdf_path, company_name, document_date, document_type) + +def process_input(input_paths): + """Process multiple input paths, which can be files or folders.""" + for input_path in input_paths: + if os.path.isfile(input_path): + if input_path.lower().endswith(PDF_EXTENSION): + process_pdf(input_path) + else: + logging.warning(f"{input_path} is not a valid PDF.") + elif os.path.isdir(input_path): + for root, _, files in os.walk(input_path): + for file in files: + if file.lower().endswith(PDF_EXTENSION): + process_pdf(os.path.join(root, file)) + else: + logging.error(f"{input_path} is not a valid file or folder.") + +if __name__ == "__main__": if len(sys.argv) < 2: - print('Usage: python autorename.py or ') + logging.error('Usage: python autorename.py [ ...]') sys.exit(1) - input_path = sys.argv[1] - - if os.path.isfile(input_path) and input_path.lower().endswith('.pdf'): - text = pdf_to_text(input_path) - openai_response = get_openai_response(text) - company_name, document_date, document_type = parse_openai_response( - openai_response) - company_name = harmonize_company_name(company_name) - rename_invoice(input_path, company_name, document_date, document_type) - elif os.path.isdir(input_path): - process_folder(input_path) - else: - print('Invalid input. Please provide a path to a PDF file or a folder.') - sys.exit(1) + input_paths = sys.argv[1:] + process_input(input_paths) \ No newline at end of file diff --git a/dist/add-to-context-menu.exe b/dist/add-to-context-menu.exe new file mode 100644 index 0000000..025d26a Binary files /dev/null and b/dist/add-to-context-menu.exe differ diff --git a/dist/autorename-pdf.exe b/dist/autorename-pdf.exe new file mode 100644 index 0000000..2f8fe31 Binary files /dev/null and b/dist/autorename-pdf.exe differ diff --git a/doskey-alias.bat b/doskey-alias.bat new file mode 100644 index 0000000..b0b9ba1 --- /dev/null +++ b/doskey-alias.bat @@ -0,0 +1,18 @@ +@echo off +REM Batch file to create a doskey macro for running a Python script +REM Alias name: autorename +REM Usage: autorename TARGET_DIR +REM This command will run the Python script located at "G:\My Drive\System\Autorename" and pass the TARGET_DIR as an argument. + +doskey autorename=python "G:\My Drive\System\Autorename\autorename.py" $* + +REM Instructions to make this doskey macro permanent: +REM 1. Save this script as set_aliases.bat in a permanent location on your computer. +REM 2. Press Win + R, type regedit, and press Enter to open the Registry Editor. +REM 3. Navigate to the following key: +REM HKEY_CURRENT_USER\Software\Microsoft\Command Processor +REM 4. Right-click on the right pane and choose New > String Value. +REM 5. Name this new string value 'AutoRun'. +REM 6. Double-click on 'AutoRun' and set its value to the full path of your set_aliases.bat file, +REM for example, C:\Path\To\Your\set_aliases.bat. +REM 7. Close the Registry Editor and restart your Command Prompt to apply the changes. \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index b92c169..766fa16 100644 Binary files a/requirements.txt and b/requirements.txt differ