Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add synchronous hooks proposal #198

Merged
merged 10 commits into from
Nov 5, 2024
331 changes: 331 additions & 0 deletions doc/design/proposal-synchronous-hooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
# Universal, synchronous and in-thread loader hooks

For background and motivation, see https://github.com/nodejs/node/issues/52219. Prototype implementation is in https://github.com/joyeecheung/node/tree/sync-hooks.

TL;DR: the top priority of this proposal is to allow sun-setting CJS loader monkey patching as non-breakingly as possible (which is why it needs to be synchronous and in-thread because that's how the CJS loader and `require()` works). Then it's API consistency with the existing `module.register()` off-thread hooks.
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved

Existing users of CJS loader monkey patching to look into:

- pirates (used by Babel and nyc)
- require-in-the-middle (used by many tracing agents)
- Yarn PnP
- tsx
- ts-node
- proxyquire, quibble, mockery

## API design

A high-level overview of the API.

```js
const { registerHooks } = require('module');

function resolve(specifier, context, nextResolve) {
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
const resolved = nextResolve(specifier, context);
resolved.url = resolved.url.replaceAll(/foo/g, 'bar');
return resolved;
}

function load(url, context, nextLoad) {
const loaded = nextLoad(specifier, context);
loaded.source = loaded.source.toString().replaceAll(/foo/g, 'bar');
return loaded;
}

const hook = registerHooks({ resolve, load });
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
hook.deregister(); // Calls hook[Symbol.dispose]()
```

* The name `registerHooks` and `deregister` can be changed.

## Hooks

### `resolve`: from specifier to url

```js
/**
* @typedef {{
* parentURL?: string,
* conditions?: string[],
* importAttributes?: Record<string, string>,
* }} ModuleResolveContext
*/
/**
* @typedef {{
* url: string,,
* format?: string,
* shortCircuit?: boolean
* }} ModuleResolveResult
*/
/**
* @param {string} specifier
* @param {ModuleResolveContext} context
* @param {(specifier: string, context: ModuleResolveContext) => ModuleResolveResult} nextResolve
* @returns {ModuleResolveResult}
*/
function resolve(specifier, context, nextResolve) {
if (shouldOverride(specifier, context.parentURL)) {
const url = customResolve(specifier, context.parentURL);
return { url, shortCircuit: true };
}

const resolved = nextResolve(specifier, context);
if (resolved.url.endsWith('.zip')) {
resolved.format = 'zip';
return resolved;
}
return resolved; // no override
}
```

Notes:

1. Example use case: yarn pnp ([esm hooks](https://github.com/yarnpkg/berry/blob/master/packages/yarnpkg-pnp/sources/esm-loader/hooks/resolve.ts), [cjs hooks](https://github.com/yarnpkg/berry/blob/master/packages/yarnpkg-pnp/sources/loader/applyPatch.ts))
2. `importAttributes` are only available when the module request is initiated with `import`.
3. For the CJS loader, `Module._cache` is keyed using file names, and it's used in the wild (usually in the form of `require.cache`) so changing it to key on URL would be breaking. We'll need to figure out another way to map it with an url.
GeoffreyBooth marked this conversation as resolved.
Show resolved Hide resolved
1. It might be non-breaking to maintain an additional map on the side for mapping the same url with different searches and hashes to a different module instance, or leave `Module._cache` as an alias map for the first module instance mapped by the URL minus search/hash. If in the same realm, `Module._cache` gets directly accessed from the outside, we can emit a warning about the incompatibility. Down the road this may require making `Module._cache` a proxy to propagate delete requests to the actual URL-based map.
2. Changing CJS modules to be mapped with URL will have performance consequences from but might also help with the hot module replacement use case. Note that for parent packages with `exports` or `imports` in their `package.json` the URL conversion is already done (and not even cached) even in the CJS loader.

### `load`: from url to source code

When `--experimental-network-import` is enabled, the default `load` step throws an error when encountering network imports. Although, users can implement blocking network imports by spawning a worker and use `Atomics.wait` to wait for the results. An example for doing blocking fetch synchronously can be found [here](https://github.com/addaleax/synchronous-worker?tab=readme-ov-file#how-can-i-avoid-using-this-package).

```js
/**
* @typedef {{
* format?: string,
* conditions?: string[],
* importAttributes?: object[]
* }} ModuleLoadContext
*/
/**
* @typedef {{
* format?: string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is required, if we’re copying the existing load.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this bit seems problematic if we want to delegate ESM detection to the default load - if hooks don't care about the format but simply want to override the loading (e.g. don't load from disk), they need to detect the format themselves, in the absence of context.format and call to nextLoad(). But that's a pre-existing problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With format detection, can we make format optional, since we will have a source to perform the check on?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively if we want to be explicit, perhaps context.detectFormat(source)?

Copy link
Member Author

@joyeecheung joyeecheung Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They not only need to know how to detect the format, but also need to know when the detection is needed (for example, if there is a --no-detect-format, they need to take care of it). At that point we might as well just make the format optional, and Node.js will decide whether format detection is needed after the entire load chain finishes, not during any particular load step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all unrelated. Basically currently there’s no way for hooks to both transform a file and then have Node run detection on it; that’s perhaps something we want to enable somehow, but how to do so is a separate concern that isn’t very related to moving the hooks to be sync. One obvious way that comes to mind is to just expose the containsModuleSyntax method as a helper, that hooks can run themselves to take advantage of Node’s built-in detection ability.

But anyway I stand by my comment, that for the moment this is required and there’s a good chance it will remain so.

* source: string | Buffer
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
* }} ModuleLoadResult
*/
/**
* @param {string} url
* @param {ModuleLoadContext} context
* @param {(context: ModuleLoadContext) => {ModuleLoadResult}} nextLoad
* @returns {ModuleLoadResult}
*/

// Mini TypeScript transpiler:
function load(url, context, nextLoad) {
const loaded = nextLoad(context);
const { source: rawSource, format } = loaded;
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
if (url.endsWith('.ts')) {
const transpiled = ts.transpileModule(rawSource, {
compilerOptions: { module: ts.ModuleKind.NodeNext }
});

loaded.format = 'commonjs';
loaded.source = transpiled.outputText;
return loaded;
}

return loaded;
}
```

Notes:

1. `context.format` is only present when the format is already determined by Node.js or a previous hook.
2. It seems useful for the default load to always return a buffer, or add an option to `context` for the default hook to load it as a buffer, in case the resolved file point to a binary file (e.g. a zip file, a wasm, an addon). For the ESM loader it's (almost?) always a buffer. For CJS loader some changes are needed to keep the content in a buffer.
3. Some changes may be needed in both the CJS and ESM loader to allow loading arbitrary format in a raw buffer in a way that plays well with the internal cache and format detection.
4. This may allow us to finally deprecate `Module.wrap` properly.
5. It may be useful to provide the computed extension in the context. An important use case is module format override (based on extensions?).

Example mock of pirates:

```js
function addHook(hook, options) {
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
function load(url, context, nextLoad) {
const loaded = nextLoad(url, context);
const index = url.lastIndexOf('.');
const ext = url.slice(index);
if (!options.exts.includes(ext)) {
return loaded;
}
const filename = fileURLToPath(url);
if (!options.matcher(filename)) {
return loaded;
}
loaded.source = hook(loaded.source, filename);
return loaded;
}

const hook = module.registerHooks({ load });

return function revert() {
hook.deregister();
};
}
```

## `exports` (require-only): from compiled CJS wrapper to exports object
Copy link
Member Author

@joyeecheung joyeecheung Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some experiments with this, and I think this can probably be repurposed as an "execute"/"evaluate" hook.

/**
 * @typedef {{
 *  thisValue: object,
 *  parameters: { exports: object, require: function, module: object, __filename: string, __dirname: string }
 * }} ModuleExecuteContext
 */
/**
 * @typedef {{
*   exports: object,
*   shortCircuit?: boolean
* }} ModuleExecuteResult
*/
function execute(url, context, nextExecute) {
  // ...
}

This would still work basically the same for the example mock of require-in-the-middle below, but it also allows users to override how the CJS wrapper gets invoked. This is still CJS-only, because in the case of ESM, V8 doesn't give us a way to hook into the evaluation of every module in the graph - we only get to control the evaluation of the root module.

In any case this might need a bit more thought, so I'll keep it off the initial PR.


This only runs for `require()` including `require(esm)`. It can only be meaningfully implemented for modules loaded by `require()`, since for ESM Node.js only gets to control the timing of the evaluation of the root module. The inner module evaluation is currently completely internal to V8. It may be possible to upstream a post-evaluation hook to V8, but calling from C++ to JS would also incur a non-trivial performance cost, especially if it needs to be done for every single module. Also, since the ESM namespace is specified to be immutable, what users can do after ESM evaluation is very limited - they cannot replace anything in the namespace or switch it to a different module. That's why the `link` hook was devised below to better work with the design of ESM (before linking completes it's possible to swap the module resolved by `import` to a different module).

Example mocking require-in-the-middle:

```js
/**
* @typedef {{
* format: string,
* source: string | Buffer,
* }} ModuleExportsContext
*/
/**
* @typedef {{
* exports: any
GeoffreyBooth marked this conversation as resolved.
Show resolved Hide resolved
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
* }} ModuleExportsResult
*/
/**
* @param {string} url
* @param {ModuleExportsContext} context
* @param {(context: ModuleExportsContext) => {ModuleExportsResult}} nextExports
* @returns {ModuleExportsResult}
*/
function exports(url, context, nextExports) {
let exported = nextExports(url, context);
const stats = getStats(url);
if (stats.name && modules.includes(stats.name)) {
const newExports = userCallback(originalExports, stats.name, stats.basedir);
return { exports: newExports }
}
return {
exports: originalExports
};
}
```

If the module loaded is CJS (`context.format` is `"commonjs"`), If the default step is run via `nextExports` run but the `exports` returned later is not reference equal to the original exports object, it will only affect later `module.exports` access in the original module, but does not affect contextual `exports.foo` accesses within the module. There are two options to address this:

1. We allow the default `exports` step to take an optional `context.exports`. If it's configured, it will be used during the wrapper invocation, which unifies the contextual access. Hooks will need to use a proxy if they want the new exports object to be based on the result from default evaluation (since by the time the new overridden exports object is created, the module hasn't been evaluated yet).
2. We expose the compiled CJS wrapper function in `context` so that users get to invoke it themselves and skip the default `exports` step.

1 may provide better encapsulation, because with 2 would mean that changes to the CJS wrapper function parameters becomes breaking.

If the module loaded is ESM (`context.format` is `"module"`) all the direct modification to `exports` are no-ops because ESM namespaces are not mutable. Returning a new `exports` for `require(esm)` only affects the caller of `require(esm)`, unless the hook returns a proxy to connect the changes between the new exports object and the module namespace object.

## `link` (import-only): from source code to compiled module records
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved

This needs to work with experimental vm modules for passing module instances around.

Example mock of import-in-the-middle:

```js
/**
* @typedef {{
* specifier: string,
* source: string | Buffer,
* }} ModuleLinkContext
*/
/**
* @typedef {{
* module: vm.Module
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
* }} ModuleLinkResult
*/
/**
* @param {string} url
* @param {ModuleLinkContext} context
* @param {(context: ModuleLinkContext) => {ModuleLinkResult}} nextLink
* @returns {ModuleLinkResult}
*/
function link(url, context, nextLink) {
const linked = nextLink(url, context);
if (!shouldOverride(url, context.specifier)) {
return linked;
}
assert.strictEqual(linked.module.status, 'linked'); // Original module is linked at this point
let source = `import * as original from 'original';`;
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
source += `import { userCallback, name, basedir } from 'util'`;
source += `const exported = {}`;
for (const key of linked.module.namespace) {
source += `let $${key} = original.${key};`;
source += `export { $${key} as ${key} }`;
source += `Object.defineProperty(exported, '${key}', { get() { return $${key}; }, set (value) { $${key} = value; }});`;
}
source += `userCallback(exported, name, basedir);`;
// This is not yet implemented but should be trivial to implement.
linked.module = new vm.SourceTextModuleSync(source, function linkSync(specifier) {
if (specifier === 'original') return linked;
// Contains a synthetic module with userCallback, name & basedir computed from url
if (specifier === 'util') return util;
});
return linked;
}
```

## Invocation order of `resolve` and `load` in ESM

With synchronous hooks, when overriding loading in a way may take a significant amount of time (for example, involving network access), the module requests are processed sequentially, losing the ability to make processing concurrent with a thread pool or event loop, which was what the design of ESM took care to make possible thanks to the `import` syntax. One trick can be used to start concurrent loading requests in `resolve` before blocking for the results in the `load` hook.
GeoffreyBooth marked this conversation as resolved.
Show resolved Hide resolved

```js
function resolve(specifier, context, nextResolve) {
const resolved = nextResolve(specifier, context);
if (resolved.url.startsWith('http')) {
// This adds a fetch request to a pool
const syncResponse = startFetch(resolved.url);
responseMap.set(url, syncResponse);
}
return resolved;
}

function load(url, context, nextLoad) {
const syncResponse = responseMap.get(url);
if (syncResponse) {
const source = syncResponse.wait();
return { source, shortCircuit: true };
}
return nextLoad(url, context);
}
```

However, to reuse `resolve` for concurrent loading, we need to implement it to be run in BFS order - that is, for a module like this:

```js
import 'a';
import 'b';
import 'c';
```

Instead of running the hooks as:

1. `resolve(a)` -> `load(a)` -> `link(a)`
2. `resolve(b)` -> `load(b)` -> `link(a)`
3. `resolve(c)` -> `load(c)` -> `link(a)`

We need to run the hook as:

1. `resolve(a)` -> `resolve(b)` -> `resolve(c)` (starts the off-thread fetching requests concurrently)
2. `load(a)` -> `load(b)` -> `load(c)` (block on the concurrently fetched results)
3. `link(a)` -> `link(b)` -> `link(c)`

Otherwise, we need to invent another hook that gets run in BFS order before both `resolve` and `load`, but in that case, the hook could only get the specifiers in its arguments, which may or may not be a problem. For network imports it's okay, because the specifiers are supposed to be URLs already, but other forms of concurrent processing (e.g. parallel transpilation with a worker pool) may not be possible without the result of `resolve`.

On the other hand, for CJS we have to run the hooks in DFS order since that's how CJS module requests have to be resolved.

## Child worker/process hook inheritance

Unlike the old `module.register()` design, this new API takes an object with functions so that it's possible to migrate existing monkey-patching packages over. One reason for the `path`/`url` parameter in the `module.register()` and the `initialize` was to aid hook inheritance in child worker and processes. With the new API, we can compose it differently by adding a new API specifically for registering a script that should be preloaded by child workers or processes. Hook authors can either put the call together with the hook registration (so that the inheritance is recursive automatically for grandchildren etc.), or put the hook registration code in a different file and configure a preload of that file.

```js
process.preloadInChildren(__filename); // Or `import.meta.filename`
joyeecheung marked this conversation as resolved.
Show resolved Hide resolved
module.registerHooks({ load });
GeoffreyBooth marked this conversation as resolved.
Show resolved Hide resolved
```

If the hook doesn't want it to be inherited in child workers or processes, it can opt out by not calling the preload API, or only call it under certain conditions.

## Additional use cases to consider

### Source map support

In most cases, source maps for Node.js apps are consumed externally (e.g. in a client to the inspector protocol) instead of in the Node.js instance that loads the modules, with the notable exception being error stack trace translation, enabled via `--enable-source-maps`. In this case, Node.js loads the source maps from disk using the URLs from the magic comments (currently, some of these are extracted using regular expressions, some are parsed by V8. It was discussed to rely on V8 instead of parsing them ourselves). To allow hooks to customize this behavior, we would need a hook that runs after module parsing and ideally before evaluation (otherwise, if an error is thrown during evaluation, the hook won't be able to run in time to affect the stack traces of the error being thrown). This can be done in the `link` or `exports` hooks proposed above if they take an optional property for source maps in the hook result, and the implementation can put the overriden sourcemap into the internal cache for later source position translations.

### Cache manipulation

It's common for users of existing CJS hooks to wipe the `require.cache` in their tests. Some tools may also update the `require.cache` for use cases like hot module replacement. This should probably be invented as a separate API to manipulate the cache (both the CJS and the ESM one).