Skip to content

Commit

Permalink
Add support for controlling how segments are combined
Browse files Browse the repository at this point in the history
- `combineEqualTimes`: Combine segments if the `startTime`, `endTime`, and `speaker` match between the current and prior segments. Resolves #19
- `speakerChange`: Only include `speaker` when speaker changes. Resolves #20
- `combineSegments`: Replaces `combineSingleWordSegments` function. Combine segments where speaker is the same and concatenated `body` fits in the `combineSegmentsLength`
  • Loading branch information
stevencrader committed May 16, 2023
1 parent 5735bb7 commit f316ebc
Show file tree
Hide file tree
Showing 31 changed files with 25,474 additions and 2,669 deletions.
56 changes: 50 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,25 +49,24 @@ yarn add transcriptator

There are three primary methods and two types. See the jsdoc for additional information.

The `convertFile` function accepts the transcript file data and parses it in to an array of `Segment`. If `transcriptFormat` is not defined, will use `determineFormat` to attempt to identify the type.
The `convertFile` function accepts the transcript file data and parses it in to an array of `Segment`.
If `transcriptFormat` is not defined, will use `determineFormat` to attempt to identify the type.

convertFile(data: string, transcriptFormat: TranscriptFormat = undefined): Array<Segment>

The `determineFormat` function accepts the transcript file data and attempts to identify the `TranscriptFormat`.

determineFormat(data: string): TranscriptFormat

The `combineSingleWordSegments` function is a helper function for combining the previously parsed `Segment` objects together. The only allowable use case is when the existing `Segment` only contain a single word in the `body`.

combineSingleWordSegments(segments: Array<Segment>, maxLength = 32): Array<Segment>

The `TranscriptFormat` enum defines the allowable transcript types supported by Transcriptator.

The `Segment` type defines the segment/cue of the transcript.

### Custom timestamp formatter

To change the way the `startTime` and `endTime` are formatted in `startTimeFormatted` and `endTimeFormatted`, register a custom formatter to be used instead.
To change the way the `startTime` and `endTime` are formatted in `startTimeFormatted` and `endTimeFormatted`,
register a custom formatter to be used instead.

The formatter function shall accept a single argument as a number and return the value formatted as a string.

```javascript
Expand All @@ -80,6 +79,51 @@ function customFormatter(timestamp) {
timestampFormatter.registerCustomFormatter(customFormatter)
```

### Options for segments

Additional options are available for combining or formatting two or more segments

To change the options, use the `Options.setOptions` function.

The options only need to be specified once and will be used when parsing any transcript data.

To restore options to their default value, call `Options.restoreDefaultSettings`.

The `IOptions` interface used by `Options` defines options for combining and formatting parsed segments.

- `combineEqualTimes`: boolean
- Combine segments if the `Segment.startTime`, `Segment.endTime`, and `Segment.speaker` match between the current and prior segments
- Cannot be used with `combineSegments` or `combineSpeaker`
- Default: false
- `combineEqualTimesSeparator`: string
- Character to use when `combineEqualTimes` is true.
- Default: `\n`
- `combineSegments`: boolean
- Combine segments where speaker is the same and concatenated `body` fits in the `combineSegmentsLength`
- Cannot be used with `combineEqualTimes` or `combineSpeaker`
- Default: false
- `combineSegmentsLength`: number
- Max length of body text to use when `combineSegments` is true
- Default: See `DEFAULT_COMBINE_SEGMENTS_LENGTH`
- `combineSpeaker`: boolean
- Combine consecutive segments from the same speaker.
- Note: this will override `combineSegments` and `combineSegmentsLength`
- Warning: if the transcript does not contain speaker information, resulting segment will contain entire transcript text.
- Default: false
- `speakerChange`: boolean
- Only include `Segment.speaker` when speaker changes
- May be used in combination with `combineEqualTimes` and `combineSegments`
- Default: false

```javascript
import { Options } from "transcriptator"

Options.setOptions({
combineSegments: true,
combineSegmentsLength: 32,
})
```

## Supported File Formats

### SRT
Expand Down
1 change: 1 addition & 0 deletions jest.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
module.exports = {
preset: "ts-jest",
testEnvironment: "node",
setupFilesAfterEnv: ["<rootDir>/test/setup.ts"],
coveragePathIgnorePatterns: ["test/*"],
}
7 changes: 5 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"scripts": {
"build": "run-s -cln 'build:*'",
"build:clean": "shx rm -rf dist",
"build:tsc": "tsc",
"build:tsc": "tsc --sourceMap false",
"build:cp": "shx cp package.json dist && shx cp README.md dist && shx cp LICENSE.md dist",
"build:replace": "shx sed -i 's/\"main\": \"index.ts\"/\"main\": \"index.js\"/g' dist/package.json > /dev/null",
"lint": "run-p -cln 'lint:*'",
Expand All @@ -30,7 +30,10 @@
"podcasting",
"transcript"
],
"author": "Steven Crader",
"author": {
"name": "Steven Crader",
"url": "https://steven.crader.co"
},
"license": "MIT",
"bugs": {
"url": "https://github.com/stevencrader/transcriptator/issues"
Expand Down
22 changes: 9 additions & 13 deletions src/formats/html.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { HTMLElement, parse } from "node-html-parser"

import { addSegment } from "../segments"
import { parseTimestamp, timestampFormatter } from "../timestamp"
import { Segment } from "../types"

Expand Down Expand Up @@ -87,25 +88,20 @@ const updateSegmentPartFromElement = (
*
* @param segmentPart HTML segment data
* @param lastSpeaker Name of last speaker. Will be used if no speaker found in `segmentLines`
* @returns Created {@link Segment} and updated speaker
* @returns Created segment
*/
const createSegmentFromSegmentPart = (
segmentPart: HTMLSegmentPart,
lastSpeaker: string
): { segment: Segment; speaker: string } => {
const createSegmentFromSegmentPart = (segmentPart: HTMLSegmentPart, lastSpeaker: string): Segment => {
const calculatedSpeaker = segmentPart.cite ? segmentPart.cite : lastSpeaker
const startTime = parseTimestamp(segmentPart.time)

const segment: Segment = {
return {
startTime,
startTimeFormatted: timestampFormatter.format(startTime),
endTime: 0,
endTimeFormatted: timestampFormatter.format(0),
speaker: calculatedSpeaker.replace(":", "").trimEnd(),
body: segmentPart.text,
}

return { segment, speaker: calculatedSpeaker }
}

/**
Expand All @@ -115,7 +111,7 @@ const createSegmentFromSegmentPart = (
* @returns Segments created from HTML data
*/
const getSegmentsFromHTMLElements = (elements: Array<HTMLElement>): Array<Segment> => {
const outSegments: Array<Segment> = []
let outSegments: Array<Segment> = []
let lastSpeaker = ""

let segmentPart: HTMLSegmentPart = {
Expand All @@ -142,19 +138,19 @@ const getSegmentsFromHTMLElements = (elements: Array<HTMLElement>): Array<Segmen
if (segmentPart.time === "") {
console.warn(`Segment ${count} does not contain time information, ignoring`)
} else {
const s = createSegmentFromSegmentPart(segmentPart, lastSpeaker)
lastSpeaker = s.speaker
const segment = createSegmentFromSegmentPart(segmentPart, lastSpeaker)
lastSpeaker = segment.speaker

// update endTime of previous Segment
const totalSegments = outSegments.length
if (totalSegments > 0) {
outSegments[totalSegments - 1].endTime = s.segment.startTime
outSegments[totalSegments - 1].endTime = segment.startTime
outSegments[totalSegments - 1].endTimeFormatted = timestampFormatter.format(
outSegments[totalSegments - 1].endTime
)
}

outSegments.push(s.segment)
outSegments = addSegment(segment, outSegments)
}

// clear
Expand Down
29 changes: 17 additions & 12 deletions src/formats/json.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { addSegment } from "../segments"
import { parseSpeaker } from "../speaker"
import { timestampFormatter } from "../timestamp"
import { Segment } from "../types"
Expand All @@ -17,7 +18,7 @@ export type JSONSegment = {
/**
* Name of speaker for `body`
*/
speaker: string
speaker?: string
/**
* Text of transcript for segment
*/
Expand Down Expand Up @@ -73,17 +74,20 @@ export const isJSON = (data: string): boolean => {
* @returns An array of Segments from the parsed data
*/
const parseDictSegmentsJSON = (data: JSONTranscript): Array<Segment> => {
const outSegments: Array<Segment> = []
let outSegments: Array<Segment> = []

data.segments.forEach((segment) => {
outSegments.push({
startTime: segment.startTime,
startTimeFormatted: timestampFormatter.format(segment.startTime),
endTime: segment.endTime,
endTimeFormatted: timestampFormatter.format(segment.endTime),
speaker: segment.speaker,
body: segment.body,
})
outSegments = addSegment(
{
startTime: segment.startTime,
startTimeFormatted: timestampFormatter.format(segment.startTime),
endTime: segment.endTime,
endTimeFormatted: timestampFormatter.format(segment.endTime),
speaker: segment.speaker,
body: segment.body,
},
outSegments
)
})

return outSegments
Expand Down Expand Up @@ -153,7 +157,7 @@ const getSegmentFromSubtitle = (data: SubtitleSegment): Segment => {
* @throws {TypeError} When item in `data` does not match the {@link SubtitleSegment} format
*/
const parseListJSONSubtitle = (data: Array<SubtitleSegment>): Array<Segment> => {
const outSegments: Array<Segment> = []
let outSegments: Array<Segment> = []

let lastSpeaker = ""

Expand All @@ -162,7 +166,8 @@ const parseListJSONSubtitle = (data: Array<SubtitleSegment>): Array<Segment> =>
if (subtitleSegment !== undefined) {
lastSpeaker = subtitleSegment.speaker ? subtitleSegment.speaker : lastSpeaker
subtitleSegment.speaker = lastSpeaker
outSegments.push(subtitleSegment)

outSegments = addSegment(subtitleSegment, outSegments)
} else {
throw new TypeError(`Unable to parse segment for item ${count}`)
}
Expand Down
23 changes: 9 additions & 14 deletions src/formats/srt.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { addSegment } from "../segments"
import { parseSpeaker } from "../speaker"
import { parseTimestamp, timestampFormatter } from "../timestamp"
import { PATTERN_LINE_SEPARATOR, Segment } from "../types"
Expand Down Expand Up @@ -91,23 +92,19 @@ export const parseSRTSegment = (lines: Array<string>): SRTSegment => {
*
* @param segmentLines Lines containing SRT data
* @param lastSpeaker Name of last speaker. Will be used if no speaker found in `segmentLines`
* @returns Created {@link Segment} and updated speaker
* @returns Created segment
*/
const createSegmentFromSRTLines = (
segmentLines: Array<string>,
lastSpeaker: string
): { segment: Segment; speaker: string } => {
const createSegmentFromSRTLines = (segmentLines: Array<string>, lastSpeaker: string): Segment => {
const srtSegment = parseSRTSegment(segmentLines)
const calculatedSpeaker = srtSegment.speaker ? srtSegment.speaker : lastSpeaker
const segment: Segment = {
return {
startTime: srtSegment.startTime,
startTimeFormatted: timestampFormatter.format(srtSegment.startTime),
endTime: srtSegment.endTime,
endTimeFormatted: timestampFormatter.format(srtSegment.endTime),
speaker: calculatedSpeaker,
body: srtSegment.body,
}
return { segment, speaker: calculatedSpeaker }
}

/**
Expand Down Expand Up @@ -136,7 +133,7 @@ export const parseSRT = (data: string): Array<Segment> => {
throw new TypeError(`Data is not valid SRT format`)
}

const outSegments: Array<Segment> = []
let outSegments: Array<Segment> = []
let lastSpeaker = ""

let segmentLines = []
Expand All @@ -146,9 +143,8 @@ export const parseSRT = (data: string): Array<Segment> => {
// handle consecutive multiple blank lines
if (segmentLines.length !== 0) {
try {
const s = createSegmentFromSRTLines(segmentLines, lastSpeaker)
lastSpeaker = s.speaker
outSegments.push(s.segment)
outSegments = addSegment(createSegmentFromSRTLines(segmentLines, lastSpeaker), outSegments)
lastSpeaker = outSegments[outSegments.length - 1].speaker
} catch (e) {
console.error(`Error parsing SRT segment lines (source line ${count}): ${e}`)
console.error(segmentLines)
Expand All @@ -164,9 +160,8 @@ export const parseSRT = (data: string): Array<Segment> => {
// handle data when trailing line not included
if (segmentLines.length !== 0) {
try {
const s = createSegmentFromSRTLines(segmentLines, lastSpeaker)
lastSpeaker = s.speaker
outSegments.push(s.segment)
outSegments = addSegment(createSegmentFromSRTLines(segmentLines, lastSpeaker), outSegments)
lastSpeaker = outSegments[outSegments.length - 1].speaker
} catch (e) {
console.error(`Error parsing final SRT segment lines: ${e}`)
console.error(segmentLines)
Expand Down
Loading

0 comments on commit f316ebc

Please sign in to comment.