Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clipper: Add capability of limiting downloads #9788

Merged
merged 17 commits into from
Mar 9, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
implementing downloadLimiter to extractNote
  • Loading branch information
pedr committed Jan 26, 2024
commit 11d2be78e367b22a716327c1f4c07a425f25fae2
1 change: 1 addition & 0 deletions .eslintignore
Original file line number Diff line number Diff line change
Expand Up @@ -677,6 +677,7 @@ packages/lib/database-driver-better-sqlite.js
packages/lib/database.js
packages/lib/debug/DebugService.js
packages/lib/dom.js
packages/lib/downloadController.js
packages/lib/errorUtils.js
packages/lib/errors.js
packages/lib/eventManager.js
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -657,6 +657,7 @@ packages/lib/database-driver-better-sqlite.js
packages/lib/database.js
packages/lib/debug/DebugService.js
packages/lib/dom.js
packages/lib/downloadController.js
packages/lib/errorUtils.js
packages/lib/errors.js
packages/lib/eventManager.js
Expand Down
49 changes: 31 additions & 18 deletions packages/lib/services/rest/routes/notes.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ const { MarkupToHtml } = require('@joplin/renderer');
const { ErrorNotFound } = require('../utils/errors');
import { fileUriToPath } from '@joplin/utils/url';
import { NoteEntity } from '../../database/types';
import { DownloadController, DummyDownloadController } from '../../../downloadController';
import { ErrorCode } from '../../../errors';
import { PromisePool } from '@supercharge/promise-pool';

const logger = Logger.create('routes/notes');

Expand Down Expand Up @@ -223,7 +226,7 @@ const isValidUrl = (url: string, isDataUrl: boolean, urlProtocol?: string, allow
return isAllowedProtocol;
};

export async function downloadMediaFile(url: string, fetchOptions?: FetchOptions, allowedProtocols?: string[]) {
export async function downloadMediaFile(url: string, downloadController: DownloadController, fetchOptions?: FetchOptions, allowedProtocols?: string[]) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the downloadController optional.

I know we have this rule to avoid optional parameters, but when it means having to create a dummy object on each call it's better to make it optional

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why do we need this dummy controller? Isn't it possible to make the code work with a simple null controller? I guess there will be a few null checks here and there.

Especially since fetchBlob is heavily used for sync it's better to avoid creating unnecessary objects on each call.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it's optional, make it part of the "fetchOptions" object

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move the downloadController to be a property of the other options object and instead of creating a dummy class I'm checking before using downloadController like you suggested.

logger.info('Downloading media file', url);

// The URL we get to download have been extracted from the Markdown document
Expand All @@ -247,7 +250,7 @@ export async function downloadMediaFile(url: string, fetchOptions?: FetchOptions
const localPath = fileUriToPath(url);
await shim.fsDriver().copy(localPath, mediaPath);
} else {
const response = await shim.fetchBlob(url, { path: mediaPath, maxRetry: 1, ...fetchOptions });
const response = await shim.fetchBlob(url, { path: mediaPath, maxRetry: 1, ...fetchOptions }, downloadController);

if (!fileExt) {
// If we could not find the file extension from the URL, try to get it
Expand All @@ -262,27 +265,29 @@ export async function downloadMediaFile(url: string, fetchOptions?: FetchOptions
}
}

async function downloadMediaFiles(urls: string[], fetchOptions?: FetchOptions, allowedProtocols?: string[]) {
const PromisePool = require('es6-promise-pool');

async function downloadMediaFiles(urls: string[], downloadController: DownloadController, fetchOptions: FetchOptions, allowedProtocols?: string[]) {
const output: any = {};

const downloadOne = async (url: string) => {
const mediaPath = await downloadMediaFile(url, fetchOptions, allowedProtocols);
downloadController.imagesCount += 1;
const mediaPath = await downloadMediaFile(url, downloadController, fetchOptions, allowedProtocols);
if (mediaPath) output[url] = { path: mediaPath, originalUrl: url };
};

let urlIndex = 0;
const promiseProducer = () => {
if (urlIndex >= urls.length) return null;

const url = urls[urlIndex++];
return downloadOne(url);
};
await PromisePool
.withConcurrency(10)
.for(urls)
.handleError(async (error: any, _url, pool) => {
if (error.code !== ErrorCode.DownloadLimiter) {
throw error;
}
logger.warn(error);
pool.stop();
})
.process(downloadOne);

const concurrency = 10;
const pool = new PromisePool(promiseProducer, concurrency);
await pool.start();
downloadController.imageCountExpected = urls.length;
downloadController.printStats(urls.length);

return output;
}
Expand Down Expand Up @@ -397,6 +402,7 @@ export const extractNoteFromHTML = async (
requestNote: RequestNote,
requestId: number,
imageSizes: any,
downloadController: DownloadController,
fetchOptions?: FetchOptions,
allowedProtocols?: string[],
) => {
Expand All @@ -406,7 +412,7 @@ export const extractNoteFromHTML = async (

logger.info(`Request (${requestId}): Downloading media files: ${mediaUrls.length}`);

const mediaFiles = await downloadMediaFiles(mediaUrls, fetchOptions, allowedProtocols);
const mediaFiles = await downloadMediaFiles(mediaUrls, downloadController, fetchOptions, allowedProtocols);

logger.info(`Request (${requestId}): Creating resources from paths: ${Object.getOwnPropertyNames(mediaFiles).length}`);

Expand Down Expand Up @@ -459,7 +465,14 @@ export default async function(request: Request, id: string = null, link: string
logger.info('Images:', imageSizes);

const allowedProtocolsForDownloadMediaFiles = ['http:', 'https:', 'file:', 'data:'];
const extracted = await extractNoteFromHTML(requestNote, requestId, imageSizes, undefined, allowedProtocolsForDownloadMediaFiles);
const extracted = await extractNoteFromHTML(
requestNote,
requestId,
imageSizes,
new DummyDownloadController,
undefined,
allowedProtocolsForDownloadMediaFiles,
);

let note = await Note.save(extracted.note, extracted.saveOptions);

Expand Down
7 changes: 6 additions & 1 deletion packages/lib/shim-init-node.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import * as fs from 'fs-extra';
import * as pdfJsNamespace from 'pdfjs-dist';
import { writeFile } from 'fs/promises';
import { ResourceEntity } from './services/database/types';
import { DownloadController, DummyDownloadController } from './downloadController';

const { FileApiDriverLocal } = require('./file-api-driver-local');
const mimeUtils = require('./mime-utils.js').mime;
Expand Down Expand Up @@ -471,9 +472,10 @@ function shimInit(options: ShimInitOptions = null) {
}, options);
};

shim.fetchBlob = async function(url: any, options) {
shim.fetchBlob = async function(url: any, options: any, downloadController?: DownloadController) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The options parameter by convention should be the last one, so it means adding downloadController as an option

if (!options || !options.path) throw new Error('fetchBlob: target file path is missing');
if (!options.method) options.method = 'GET';
if (!downloadController) downloadController = new DummyDownloadController();
// if (!('maxRetry' in options)) options.maxRetry = 5;

// 21 maxRedirects is the default amount from follow-redirects library
Expand Down Expand Up @@ -549,6 +551,9 @@ function shimInit(options: ShimInitOptions = null) {
});

const request = http.request(requestOptions, (response: any) => {

response.on('data', downloadController.handleNewChunk(request));

response.pipe(file);

const isGzipped = response.headers['content-encoding'] === 'gzip';
Expand Down
2 changes: 1 addition & 1 deletion packages/lib/shim.ts
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ const shim = {
throw new Error('Not implemented');
},

fetchBlob: function(_url: string, _options: any = null): any {
fetchBlob: function(_url: string, _options: any = null, _downloadLimiter: any = null): any {
throw new Error('Not implemented');
},

Expand Down