Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing texts and invalid characters #3

Open
Dev-iL opened this issue Aug 22, 2017 · 2 comments
Open

Missing texts and invalid characters #3

Dev-iL opened this issue Aug 22, 2017 · 2 comments

Comments

@Dev-iL
Copy link
Contributor

Dev-iL commented Aug 22, 2017

Following this pull-request comment, please find attached two files (with renamed extensions), that can help reproduce/demonstrate the problem:

Some noticeable problems:

  1. The 1st subtitle (line 9 in the .caption; line 3 in the .srt) gets a line break at ].
  2. The 2nd subtitle (ln 13 / ln 7) has a b=+ that shouldn't be there.
  3. After subtitle 25 (103 / 109), the .srt file has some intermediate characters between the subtitles.
  4. (...and more of the above).
@Dev-iL
Copy link
Contributor Author

Dev-iL commented Aug 22, 2017

I have simplified the preparesrt logic somewhat, and now it seems to work properly. Before I submit a PR, here's my version:

        public string PrepareSrt()
        {
            const int METADATA_LINES = 7, CHARS_BEFORE_TIMESTAMP = 13, CHARS_AFTER_TIMESTAMP = 14;
            //read all file in memory
            string content = File.ReadAllText(filePath);

            // Discard the first lines, containing metadata used by Lynda desktop app to link subtitle to video:
            string output = RemoveFirstLines(content, METADATA_LINES);

            // Before every timestamp we have a constant amount of characters (starting by [NUL][SOH] and ending with a newline)
            output = Regex.Replace(output, @"\u0000\u0001[\s\S]{" + CHARS_BEFORE_TIMESTAMP + "}[\r\n]*", "");
            
            // After every timestamp we also have a constant amount of characters:
            output = Regex.Replace(output, @"(?<=\[\d\d:\d\d:\d\d\.\d\d\])[\s\S]{" + CHARS_AFTER_TIMESTAMP + "}", "");

            // Cleanup remaining non-UTF8 ASCII chars:            
            output = Regex.Replace(output, @"[^\u0020-\u007F \u000D\n\r]+", "");

            return output;
        }

@mdomnita
Copy link
Owner

Thanks, @Dev-iL . I stopped working on this because my Lynda free subscription ended a while ago and I don't get free subscription from my new workplace. Maybe I will make a new one on another email address and check out if it still works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants