-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PDF extractor #557
Add PDF extractor #557
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution as usual! Could you please add information about the feature to CHANGELOG.md?
from pdf2image import convert_from_path | ||
self._temp_directory = tempfile.mkdtemp(prefix='cvat-') | ||
super().__init__( | ||
source_path=source_path[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why source_path[0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the implementation for VideoExtractor
See: https://github.com/opencv/cvat/blob/5104cc08c13d325e172d8523b7d2b7347c61c87d/cvat/apps/engine/media_extractors.py#L111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why the source_path
is a list here, but I believe due to the implementation, this will get a list with a single item in it. I didn't dive into the overall architecture for custom extractors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benhoff , you specified in description of pdf extractor that multiple pdf documents can be uploaded. For video extractor unique flag is True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extractor's constructor always receive a list as source_path
argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.
'pdf': {
...
'unique': **False**
},
Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed unique
to True
. For my case, PDF's can have multiple pages and I want to be able to flip through them, like a video. Let me know if you think the implementation should be different.
@@ -72,6 +72,48 @@ def save_image(self, k, dest_path): | |||
image.close() | |||
return width, height | |||
|
|||
class PDFExtractor(MediaExtractor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would look at ArchiveExtractor implementation and inherit the class from DirectoryExtractor. Let's implement here _extract method ... What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using VideoExtractor
as a basis here because in my case, PDF's could have multiple pages. Is there a better way to handle multiple page PDF's?
Added! |
@azhavoro , please review the patch and leave your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@azhavoro what is
_dest_path
for?
https://github.com/opencv/cvat/blob/5104cc08c13d325e172d8523b7d2b7347c61c87d/cvat/apps/engine/media_extractors.py#L109
I'll fix it as soon as possible.
cvat/apps/engine/media_extractors.py
Outdated
return self._get_imagepath(k) | ||
|
||
def __len__(self): | ||
return len(os.listdir(self._temp_directory)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's calculate the length in the __init__ method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
from pdf2image import convert_from_path | ||
self._temp_directory = tempfile.mkdtemp(prefix='cvat-') | ||
super().__init__( | ||
source_path=source_path[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extractor's constructor always receive a list as source_path
argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.
'pdf': {
...
'unique': **False**
},
Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.
* develop: (112 commits) fixed attribute processing in auto_annotation (cvat-ai#577) CVAT.js API Tests (cvat-ai#578) Fixed exception in attribute annotation mode (cvat-ai#571) CVAT.js API methods were implemented (cvat-ai#572) Dashboard components basic styles (cvat-ai#574) Handle invalid json labelmap file case correctly during create/update DL model stage. (cvat-ai#573) Upgrade Numpy to avoid Arbitrary Code Execution. Upgrade Django to avoid MitM (cvat-ai#575) Run functional tests for REST API during a build (cvat-ai#506) CVAT.js other implemented API methods and bug fixes (cvat-ai#569) CVAT.js implemented API methods and bug fixes (cvat-ai#564) added in handeling for openvino 2019 (cvat-ai#545) added in command line auto annotation runner (cvat-ai#563) Fixed PDF extractor syntax error (cvat-ai#565) Update README.md added in pdf extractor (cvat-ai#557) Basic dashboard components (cvat-ai#562) Saving of annotations on the server (cvat-ai#561) Code was devided by files (cvat-ai#558) CVAT.js: Save and delete for shapes/tracks/tags (cvat-ai#555) Fixed '=' to '==' for numpy in requirments (cvat-ai#556) ... # Conflicts: # .gitignore
No description provided.