Python script for creating a dataset for AI, ML applications
(optional) These repositories need to be downloaded manually and placed alongside the DatasetCreator.py file if required.
-
ImageSetCleaner by Guillaume Erhard at https://github.com/GuillaumeErhard/ImageSetCleaner, licensed under GPL-3.0 license is used for semi-supervised image cleaning. For fine tuning the predictions, read the readme on the above link.
-
labelImg by Tzutalin at https://github.com/tzutalin/labelImg, licensed under MIT license is used for creating the bounding boxes on the images. For detailed instructions and shortcuts read the readme on the above link.
- Python >= 3.6.0, <=3.6.8 64-bit only
- Libraries: Pillow, selenium, requests, imagehash, tensorflow, numpy, matplotlib, PyQt5, six, scikit_learn, lxml
Note: the libraries can be installed by
$ pip install -r requirements.txt
# optional
$ git clone https://github.com/GuillaumeErhard/ImageSetCleaner
$ git clone https://github.com/tzutalin/labelImg
$ cd labelImg
$ pyrcc5 -o libs/resources.py resources.qrc
Set you preferences in settings.json file, then run the script as
python DatasetCreator.py
On the prompt for search term, enter the word/sentence as you would in a search bar.
Note: each term will be searched separately. Type ... to finish
Search Term: cat eating grass
Search Term: <- Lines with only spaces/empty lines will be ignored
Search Term: . <- Lines with only full stops are ignored
Search Term: .. <- Ignored
Search Term: why is a cat eating grass
Search Term: ... <- This signifies end of input, script will start to fetch images
- The script first accesses google.com and extracts the selenium object for each image thumbnail
- Then the url of each image is extracted from the thumbnail and downloaded to dataset/search_term/
- Hashes are calculated for each image using phash algorithm and the duplicates are deleted
- ImageSetCleaner is used to filter out bad images from the dataset (optional)
- The images are then converted to a square dimension while maintaining the aspect ratio
- The square images are resized,mirrored and distributed into train/valid/test folders in the dataset/search_term/ directory
- The images are renamed sequentially starting from 1 to n separately for each train, valid and test folder
- The images are then labelled in PASCAL VOC/YOLO format using labelImg (optional)
The settings can be changed via the settings.json file
Setting | Description |
---|---|
no_img | The number of images to download (approximately) |
target_url | This is the base url |
stealth | Spoof the user-agent as defined in the settings dictionary |
user_agent | UA to be used in stealth mode. Use any valid UA string you like |
image_dimension | Dimension of images in dataset if "resize_images" is True |
image_distribution | Ratio of images in train/valid/test set. ex: "70/15/15" |
driver | Path to the webdriver for the browser. ex: driver/geckodriver.exe for firefox |
logging | Enable logging of events and errors in log/run and log/err |
download_images | Weather to download images via browser. |
remove_duplicate | Delete duplicate images by phash algorithm |
clean_images | Use ImageSetCleaner by Guillaume Erhard to filter out bad images. (optional) |
resize_images | resize images to image_dimension*image_dimension pixels |
mirror_images | mirror every image in the dataset. (optional) |
move_images | distribute images in train/valid/test folder based on image_distribution value |
rename_images | rename images as 'first search term_(image_no)'. |
label_images | label images using labelImg by Tzutalin. (optional) |
- If you require images to be less than 300px, you can use Beautiful Soup instead of selenium for a much much faster execution. You need to change the code in 'fetch_img_urls' function.
- Incase of pre-downloaded images, place the folder containing the images in the 'dataset' folder and enter the folder name as the first search term. Set 'download_images' setting to false
Released under the GPL-3.0 license