Skip to content

refactor!(splitter): abstract io deps#197

Open
ShaMan123 wants to merge 11 commits intoThatOpen:mainfrom
ShaMan123:refactor/splitter#180
Open

refactor!(splitter): abstract io deps#197
ShaMan123 wants to merge 11 commits intoThatOpen:mainfrom
ShaMan123:refactor/splitter#180

Conversation

@ShaMan123
Copy link
Copy Markdown
Contributor

@ShaMan123 ShaMan123 commented Apr 30, 2026

Description

This PR is the first step outlined by #180 (comment)
It removes path/fs deps interface in favor of a lightweight specialized io interface based on ReadableStream/WritableStream.
The goal is to create an agnostic API that can run both in node and the browser.
Hence the usage of ReadableStream/WritableStream and EventTarget.

All console logs have been removed.
console.error now throws.
console.warn emits a warning event.
console.time emits a progress event.
There is another data event that is emitted once all groups have been constructed, see IfcSplitterEvents.
The rest have been removed.

Standalone functions have been scoped under IfcSplitter for ease of use and overall code quality.

BREAKING changes

This change is breaking:
split is now async and has changed signature to align with environment agnostic requirements.
extract is now async.

Additional context


What is the purpose of this pull request?

  • Bug fix
  • New Feature
  • Documentation update
  • Other

Before submitting the PR, please make sure you do the following:

  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Follow the Conventional Commits v1.0.0 standard for PR naming (e.g. feat(examples): add hello-world example).
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. fixes #123).
  • Ideally, include relevant tests that fail without this PR but pass with it.

@ShaMan123 ShaMan123 changed the title refactor(splitter): abstract io deps refactor(splitter)!: abstract io deps Apr 30, 2026
@ShaMan123 ShaMan123 force-pushed the refactor/splitter#180 branch 2 times, most recently from 0652ae8 to cc77210 Compare April 30, 2026 14:08
@ShaMan123 ShaMan123 changed the title refactor(splitter)!: abstract io deps refactor!(splitter): abstract io deps Apr 30, 2026
@ShaMan123 ShaMan123 force-pushed the refactor/splitter#180 branch from 3a3cda4 to 92414d6 Compare April 30, 2026 14:39
@ShaMan123 ShaMan123 force-pushed the refactor/splitter#180 branch from 92414d6 to eb4073f Compare May 1, 2026 02:47
Copy link
Copy Markdown
Contributor Author

@ShaMan123 ShaMan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is ready for review @agviegas .

A few things:

  • extract and split differ in logging. split is more verbose. I wish to unify them. Which do you prefer? Search for occurrences of IfcSplitter#mark.
  • I think we should extract forEachLine to a super class IfcStream, perhaps we can offer an abstraction for writeLine as well. It is a useful util.
  • I wish to remove bitwise operations from the code. It is an unnecessary complication with no real gain in this case.
  • I wish to add unit tests, how do I do that. I see jest as a dep but it doesn't seem used. I would suggest dropping it in favor of vitest. A simple snapshot test of the data event can be very valuable for us now that we plan to change behavior. This can be done in a follow up since I have run a snapshot test locally comparing both the data event (groupsData) and the split file contents.
  • Should we extract all helpers to a standalone util file?
  • Warning events currently fire only from extract. I could type it so it is clear.
  • I scoped emitSplitLine and emitExtractLine under IfcSplitter but after completing work I see it is not a must.

What do you think?

let tail = "";
const readableStream = await this.io.readableStream(filePath);

for await (const chunk of streamAsyncIterator(readableStream)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for await (const chunk of streamAsyncIterator(readableStream)) {
for await (const chunk of readableStream) {

This should be sufficient though TS throws:

 error TS2504: Type 'ReadableStream<any>' must have a '[Symbol.asyncIterator]()' method that returns an async iterator.

I am not sure why TS throws, maybe because of the baseline (but that seems widely adopted). Updating @types/node made no difference.
So this is good enough IMO.

const groupElementIds = groups[g];
if (groupElementIds.size === 0) {
groupsData.push(null);
console.log(` Group ${g + 1}: SKIPPED (empty)`);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this become a warning?

Comment on lines -1181 to 1185
if (groupElementIds.size > requestedIds.size) {
console.log(
` Expanded to ${groupElementIds.size} elements (void/fill + aggregation coupling)`,
);
return { header, footer, index };
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this become a warning?
I decided to return the extracted ids instead

return this.specialRaws.get(id);
}

getAll(types: Set<string>) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future proofing by passing types

includeSet: Set<number>,
rewrittenLines: Map<number, string>,
): Promise<void> {
if (raw.charCodeAt(0) !== 35) return; // '#'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability should I:

Suggested change
if (raw.charCodeAt(0) !== 35) return; // '#'
if (raw.charCodeAt(0) !== '#'.charCodeAt(0)) return; // '#'

Or even better

Suggested change
if (raw.charCodeAt(0) !== 35) return; // '#'
if (!raw.startsWith('#')) return;

There are a few occurrences. Machine code 👎

if (groupSizes[g] < groupSizes[minIdx]) minIdx = g;

this.emitProgressEvent("resolve");
this.dispatch("data", groupsData);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is not great since this event is only dispatched from split.

Suggested change
this.dispatch("data", groupsData);
this.dispatch("splits", groupsData);

@ShaMan123 ShaMan123 marked this pull request as ready for review May 1, 2026 03:41
@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123 commented May 1, 2026

Expanding on IfcStream
I wish to refactor forEachLine into a stream transformer so that anyone could

const stream = (await openAsBlob(path, { type: "text/plain" }))
    .stream()
    .pipeThrough(new IfcDecoderStream());

for await (const line of stream) {
  // actual raw line
}

Claude said:

class IfcDecoderStream extends TransformStream<Uint8Array, string> {
  constructor(encoding = "utf-8") {
    let tail = "";
    const decoder = new TextDecoder(encoding);

    super({
      transform(chunk, controller) {
        const text = decoder.decode(chunk, { stream: true });
        if (!text) return;
        let start = 0;
        let idx = text.indexOf("\n");

        if (idx !== -1) {
          let end = idx;
          if (end > 0 && text.charCodeAt(end - 1) === 13) end--;
          controller.enqueue(
            tail ? tail + text.substring(start, end) : text.substring(start, end),
          );
          tail = "";
          start = idx + 1;
        } else {
          tail += text;
          return;
        }

        while ((idx = text.indexOf("\n", start)) !== -1) {
          let end = idx;
          if (end > start && text.charCodeAt(end - 1) === 13) end--;
          controller.enqueue(text.substring(start, end));
          start = idx + 1;
        }

        if (start < text.length) tail = text.substring(start);
      },

      flush(controller) {
        const remaining = decoder.decode();
        const full = tail + remaining;
        if (full) controller.enqueue(full);
      },
    });
  }
}

we could even enqueue a parsed line instead of only raw text

@ShaMan123 ShaMan123 force-pushed the refactor/splitter#180 branch from bab6690 to fc382e2 Compare May 1, 2026 05:58
@ShaMan123
Copy link
Copy Markdown
Contributor Author

I really like the stream transformer.
Should I commit build changes?

@ShaMan123
Copy link
Copy Markdown
Contributor Author

I see I missed this spec

2. Lift logging out of the deps object entirely and onto an EventEmitter-style API on the splitter itself: splitter.onProgress.add(...), splitter.onItemProcessed.add(...). That would match how the rest of the engine surfaces lifecycle events (Hoverer, Highlighter, Views, etc.) and avoids the "is this a console or a logger" awkwardness.

Awaiting your review before acting

@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123 commented May 1, 2026

Could we use web-ifc classes to parse lines?
I am thinking of another stream transformer on top of the raw one that accepts the raw line and returns a web-ifc class.
I will definitely use it.

      const stream = (await openAsBlob(path, { type: "text/plain" }))
        .stream()
        .pipeThrough(new IfcDecoderStream())
        .pipeThrough(new IfcParserStream());

      for await (const ifcEntity of stream) {
        // parsed line entity, flat
      }

@ShaMan123 ShaMan123 force-pushed the refactor/splitter#180 branch from fc382e2 to 3dde676 Compare May 2, 2026 05:35
@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123#1

I want to merge that on top of this. Should we first merge this and then open a dedicated PR?

@agviegas
Copy link
Copy Markdown
Contributor

agviegas commented May 3, 2026

Direction is great. Stripping the Node fs/path injection in favor of web streams unlocks the browser, and the EventTarget class is a much better story than scattered console calls. Few things I'd want sorted before this lands.

Bugs that need fixing

  1. performance.measure(stage) with one argument doesn't do what you think. The spec says one-arg measure runs from navigationStart to now, not from the matching mark. So timeElapsed is wall-clock since the page loaded, not the stage duration. Want performance.measure(stage, stage) or performance.now() deltas.

    Worth clearing the marks and measures after each emit too, otherwise the buffer just grows on repeated calls.

  2. The final close loop in writeSplitOutput drops its inner promises:

    await Promise.all(
      writers.map(async (writer) => {
        if (!writer) return;
        writer.write(footerStr);
        writer.close();
      }),
    );

    The inner async returns instantly, Promise.all resolves before bytes flush. Callers see truncated files. Both write and close need an await.

  3. extract no longer creates the output directory. Old code did mkdirSync({ recursive: true }); the new path calls open(path, "w") straight away and fails if the dir's missing. Either bake the mkdir into the Node writableStream impl, or document it loud and clear.

Worth a second look

  1. extract skips mark("parse") before parseIfc. split has it. Easy inconsistency to fix by moving the mark inside parseIfc.

  2. The IO interface is asymmetric: ReadableStream<Uint8Array> for reads, WritableStream<string> for writes. Node's Writable.toWeb happens to accept strings, but that's not portable across all WritableStream implementations. Either take Uint8Array on both sides and pipe through TextEncoderStream inside the splitter, or encode before writing. Mirrors the read path's "pipe through IfcDecoderStream" pattern nicely.

  3. Writable.toWeb is still flagged experimental in Node. Worth setting an engines.node floor, or noting the minimum version in the README.

Polish

  1. IfcDecoderStream swapped charCodeAt(i) === 13 for charAt(i) === "\r". The old version was numeric and free; the new one allocates a one-char string per check. In the hottest loop in the splitter, that adds up:

    if (end > 0 && text.charCodeAt(end - 1) === 13) end--;
  2. data is awfully generic as an event name. groups or groupsResolved would age better.

  3. Per-line await inside forEachLine adds a microtask hop per line vs the old sync loop. Probably unavoidable now that reading is streamed, but worth measuring on a big IFC.

Nits

  1. JSDoc on IfcDecoderStream has an empty @example block followed by @example node. Second tag isn't standard JSDoc and won't render cleanly in TypeDoc.
  2. IfcSplitterIO doesn't say what should happen if the path doesn't exist. Worth a sentence on the interface.
  3. CHANGELOG mentions the async/signature breaks but not the "no more console.*, listen to events" behavior shift, which is also user-visible.

Bugs 1, 2, 3 are blocking from my side. The rest are suggestions. Let me know when you push the next round and I'll take another pass.

@agviegas
Copy link
Copy Markdown
Contributor

agviegas commented May 3, 2026

Sorry, missed your questions on the first pass. Going through them:

Logging unification. With the event API the verbosity isn't really logging anymore. Apps subscribe to whatever they care about. So I'd unify by having both split and extract emit progress for whichever stages they actually run, using the same stage names where they overlap. Kill the asymmetry. The "one is verbose, one isn't" feel is a leftover from the console.* era.

IfcStream super class for forEachLine / writeLine. Makes sense as soon as there's a second consumer. If the splitter is the only one today, I'd defer until something else shows up that wants the same abstraction. Easy to extract later, hard to undo if we abstract too early.

Remove bitwise ops. I'd push back. The Uint32Array bitmask is doing real O(1) membership checks in the write loop across up to 32 groups. Replacing with Set<number>[] per group costs N hash lookups per line and bumps memory. Worth keeping unless we benchmark and find the difference is in the noise. The 32-group ceiling is a separate thing to document.

Tests, drop jest for vitest. Yes. Vitest is the right pick today. A snapshot of the data event plus content snapshots of the output files (which you've already got locally) is exactly what we want. Follow-up PR is fine.

Extract helpers to a standalone util file. Yes. The pure functions (extractRefs, splitIfcArgs, extractArgsString, etc.) don't need to live in the splitter. An ifc-parsing-utils.ts next to ifc-stream.ts is a good home.

Warning event typing. A discriminated union or just a typed comment that says "fires from extract when an id is missing or wrong type" is enough for now.

emit*Line scoped under IfcSplitter. Doesn't matter much. If they touch this (events / IO), keep them as methods. If they don't, hoist. Not worth a long discussion either way.

Stream transformer expansion. Love what you've got. IfcDecoderStream is exactly the right shape. I'd hold off on IfcParserStream backed by web-ifc for a follow-up PR: pulling web-ifc into the splitter's parse path is a bigger architectural decision and deserves its own thread.

Build changes commit. No on dist artifacts. Yes on tooling configs (vitest, lint, etc.).

EventTarget vs Event<T>. The rest of the engine uses Event<T> (Hoverer, Highlighter, Views). For consistency I'd flip to that: splitter.onProgress.add(...), splitter.onWarning.add(...), splitter.onGroupsResolved.add(...). Native EventTarget is web-standard but reads inconsistent next to the rest of the codebase. Not blocking, but a sharp diff.

web-ifc parser stream. Cool idea, but its own PR. Pinning the splitter to web-ifc's class shapes is worth a deliberate decision.

Merge order. This one first, your next as a dedicated PR. Single-purpose merges.

What do you think?

@ShaMan123
Copy link
Copy Markdown
Contributor Author

I agree with everything.
Writable.toWeb is great, but there are backward compat polyfills. Whatever you think, I am not a fan of backward compat artifacts.

The final close loop in writeSplitOutput drops its inner promises:

Lint should catch these kind of errors.

5. The IO interface is asymmetric: ReadableStream<Uint8Array> for reads, WritableStream<string> for writes. Node's Writable.toWeb happens to accept strings, but that's not portable across all WritableStream implementations. Either take Uint8Array on both sides and pipe through TextEncoderStream inside the splitter, or encode before writing. Mirrors the read path's "pipe through IfcDecoderStream" pattern nicely.

What is better? Uint8Array or string? I guess Uint8Array to abstract handling from the consumer. Will do.

@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123 commented May 3, 2026

What is better? Uint8Array or string? I guess Uint8Array to abstract handling from the consumer. Will do.

Let's think about this. What use cases will the splitter cater for? Will a consumer want to run another stream transformer after the splitter?
If so we might want to stick to string

@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123 commented May 3, 2026

I am rethinking

export interface IfcSplitterIO {
  /**
   * Streams ifc lines
   * @param path
   * @throws if {@link path} doesn't exist
   */
  readableStream(path: string): Promise<ReadableStream<string>>;

  /**
   * Streams ifc lines
   * @param path
   */
  writableStream(path: string): Promise<WritableStream<string>>;
}

Maybe on web we won't use the path arg. I think I will try to move it up to the caller.

export interface IfcSplitterIO {
  /**
   * Streams ifc lines
   * @param path
   * @throws if {@link path} doesn't exist
   */
  readableStream(): Promise<ReadableStream<string>>;

  /**
   * Streams ifc lines
   * @param path
   */
  writableStream(groupId: number): Promise<WritableStream<string>>;
}

@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123 commented May 3, 2026

I was toying with an idea to stream the ifc to IndexedDB, setting each localId/line as a key/value, before you suggested OPFS.
Not sure if it is viable but it would benefit a string stream. I was considering it also because it has built in indexing that can replace web-ifc queries.

@ShaMan123
Copy link
Copy Markdown
Contributor Author

regarding mkdir, not sure. What do you prefer/think?
It is easy enough:

export class IfcSplitterNode extends IfcSplitter {
  constructor() {
    super({


      writableStream: async (path) => {
+       await mkdir(dirname(path), { recursive: true });
        const fileHandle = await open(path, "w");
        const nodeWritable = fileHandle.createWriteStream();
        return Writable.toWeb(nodeWritable) as WritableStream<string>;
      },
    });
  }
}

ShaMan123 and others added 2 commits May 3, 2026 17:21
Co-authored-by: Copilot <copilot@github.com>
@ShaMan123
Copy link
Copy Markdown
Contributor Author

ShaMan123 commented May 3, 2026

9. Per-line await inside forEachLine adds a microtask hop per line vs the old sync loop. Probably unavoidable now that reading is streamed, but worth measuring on a big IFC.

forEachLine has degraded (145.8 MB ifc file):

pass sync async
1 (index) 805 1785
2 (write splits) 1505 5120

Copy link
Copy Markdown
Contributor Author

@ShaMan123 ShaMan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • extracted emit*Line to top level
  • refactored events
  • fix + standardize progress events: extract fires resolve before relations, not sure what to do, please take a look. The code is very fragile so I don;t want to move around stuff without tests.

ready for another review

Comment on lines 1005 to +1041
@@ -1057,11 +1038,12 @@ export class IfcSplitter {
}
}
}
this.emitProgressEvent("relations", relationsStart);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants