Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate script - S3 bucket support is no longer compatible with options for translating a file or sequence of files #685

Closed
mmartin9684-sil opened this issue Mar 24, 2025 · 11 comments · Fixed by #689
Assignees
Labels
bug Something isn't working pipeline 6: infer Issue related to using a trained model to translate.

Comments

@mmartin9684-sil
Copy link
Collaborator

With the move from Gutenberg to the S3 bucket, the translate script is no longer usable with the option for translating a single specific source file (--src and --trg) or a series of source files (--src-prefix, --trg-prefix, --start-seq, --end-seq).

  • Calls to copy_experiment_from_bucket are only coded to copy the model checkpoints from the S3 bucket to the local drive. The source files for these 2 translation options are not copied.
  • Calls to copy_experiment_to_bucket are only coded to copy *.SFM files from the local drive to the S3 bucket. No other files types (e.g., *.txt, *.docx) supported by these 2 translation options are not targeted.
  • Only files in the experiment folder are targeted for these copy operations. Source and target files specified with these command line parameters are not handled unless they happen to be in the experiment folder.

An additional improvement would be an intuitive scheme for specifying the source and target files on the command line, or help text of some kind to explain how to place the source files in the S3 bucket, where to find the target files once they are created, and how to specify these files as command line arguments for this command.

Note that this issue is not due to the introduction of MinIO.

@mmartin9684-sil mmartin9684-sil added bug Something isn't working pipeline 6: infer Issue related to using a trained model to translate. labels Mar 24, 2025
@ddaspit ddaspit assigned TaperChipmunk32 and unassigned ddaspit Mar 24, 2025
@ddaspit ddaspit moved this from 🆕 New to 🔖 Ready in SIL-NLP Research Mar 24, 2025
@ddaspit
Copy link
Collaborator

ddaspit commented Mar 24, 2025

@mmartin9684-sil Are you currently blocked, because of this issue?

@TaperChipmunk32 Can you take a look at this when you get a chance?

@TaperChipmunk32
Copy link
Collaborator

@mmartin9684-sil Are you currently blocked, because of this issue?

@TaperChipmunk32 Can you take a look at this when you get a chance?

Yes, I can look into this tomorrow.

@mmartin9684-sil
Copy link
Collaborator Author

@mmartin9684-sil Are you currently blocked, because of this issue?

@TaperChipmunk32 Can you take a look at this when you get a chance?

@ddaspit - at the moment, I just need support for the --src/--trg option for text files, and am trying workarounds with local changes, but have not been 100% successful.

@mmartin9684-sil
Copy link
Collaborator Author

It might be good to check the hybrid drafting feature and this issue. I believe the hybrid drafting feature modifies the file names of the target language drafts that get created, and it would be good to be sure that those target language drafts with those modified file names can get copied back back from the remote server.
Not a blocker though.

@benjaminking
Copy link
Collaborator

When we produce multiple translations, we add a draft number to the output file name. For example, if your single-draft output was 3JN.SFM, the multiple-draft output files will be called 3JN.1.SFM, 3JN.2.SFM, etc.

@TaperChipmunk32
Copy link
Collaborator

@mmartin9684-sil @ddaspit

The first two bullet points appear to be straight-forward to address, and would just be behind the scenes changes and not affect how you would run the command. But for the third, I have a question.

I believe I can get --src and --trg to be independent of the experiment directory specified with no new arguments. You would just need to specify the absolute path to the src and trg files in the bucket for them to be copied. Then, the results would be in the experiment directory. However, to make this work with --src-prefix, --trg-prefix, --start-seq, --end-seq, I would add a new flag, named something like --directories or --data-dirs to list every directory that the src and trg files are coming from for them to be copied. Does adding this extra flag sound intuitive?

Alternatively, we could just add a message specifying that all files must be in the --experiment directory.

@mmartin9684-sil
Copy link
Collaborator Author

If the --src and --trg option was set up to default to using the experiment folder for the source and target files, that would be helpful. With the current logic, I found that it was necessary to specify the full path to the source and target files.

For the --src-prefix et al option, it would be enough to plan for a single source/target directory; no need to support multiple directories. If you wanted to support an option for a different target directory (instead of placing the drafts in the same directory as the source files), that would be a nice flexibility.

@ddaspit
Copy link
Collaborator

ddaspit commented Mar 25, 2025

Could we make --src, --trg, --src-prefix, and --trg-prefix all relative to the root or the MT directory of the bucket?

@TaperChipmunk32
Copy link
Collaborator

If the --src and --trg option was set up to default to using the experiment folder for the source and target files, that would be helpful. With the current logic, I found that it was necessary to specify the full path to the source and target files.

For the --src-prefix et al option, it would be enough to plan for a single source/target directory; no need to support multiple directories. If you wanted to support an option for a different target directory (instead of placing the drafts in the same directory as the source files), that would be a nice flexibility.

I can do both of those things. I will also go ahead and support multiple directories, it does not add much complexity.

@TaperChipmunk32
Copy link
Collaborator

Could we make --src, --trg, --src-prefix, and --trg-prefix all relative to the root or the MT directory of the bucket?

Yes, I could do that

@TaperChipmunk32
Copy link
Collaborator

@mmartin9684-sil Do you have an example project/command I could use for testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pipeline 6: infer Issue related to using a trained model to translate.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants