Company URL Finder is a robust Python application designed to help you efficiently search and extract company website URLs using multiple strategies. The project provides two main search approaches:
- Selenium Web Scraping: Uses Selenium WebDriver to perform direct Google searches
- Google Custom Search API: Leverages Google's official search API for precise URL retrieval
- Parallel processing of company searches
- Multiple search strategies
- Adaptive URL ranking algorithm
- Error handling and logging
- Flexible configuration options
- Python 3.8+
- Chrome Browser (for Selenium)
- ChromeDriver
Install the required dependencies using pip:
pip install -r requirements.txt
- Create a
.env
file in the project root - Add the following environment variables:
GOOGLE_CUSTOM_SEARCH_API_KEY=your_google_api_key CUSTOM_SEARCH_ENGINE_ID=your_custom_search_engine_id
-
Clone the repository:
https://github.com/XenosWarlocks/company-url-finder.git cd company-url-finder
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Install ChromeDriver:
- Download compatible with your Chrome browser version
- Add to system PATH or specify in script
Prepare an Excel file (companies.xlsx
) with a column named "Company Name" containing the list of companies you want to search.
python main.py
-
Selenium Google Search (Option 1):
- Faster, web-scraping approach
- Parallel processing
- Suitable for smaller lists
-
Google Custom Search API (Option 2):
- More precise results
- Limited by API quota
- Better for comprehensive searches
-
Combined Strategy (Option 3):
- First uses Selenium
- Then validates/processes with API
- Most thorough but slower
google_results.csv
: Successful company URL matchescant_find_urls.csv
: Companies without URL matchesapi_results.csv
: Custom Search API results
Customize in selenium_searcher.py
:
headless
: Run browser invisiblymax_workers
: Control parallel search threads
Adjust in google_algo.py
:
URL_COUNT_WEIGHT
URL_ORDER_WEIGHT
URL_LEN_WEIGHT
You can extend functionality by:
- Creating custom URL matching algorithms
- Adding more web scraping strategies
- Implementing additional ranking methods
Example extension structure:
class CustomURLFinder:
def __init__(self, parent_finder):
self.parent = parent_finder
def custom_url_matching_method(self, company, urls):
# Implement custom logic
pass
- Ensure ChromeDriver matches your Chrome version
- Check API key and Search Engine ID
- Verify input file format
- Monitor API usage quotas
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request