fix: support new webpack chunk format for ondemand.s lookup#416
fix: support new webpack chunk format for ondemand.s lookup#416steverex169 wants to merge 1 commit intod60:mainfrom
Conversation
Reviewer's GuideUpdates ondemand JavaScript asset discovery to handle Twitter/X's new webpack chunk mapping format while keeping backward compatibility with the previous inline hash format. Class diagram for ClientTransaction get_indices hash resolutionclassDiagram
class ClientTransaction {
+home_page_response
+get_indices(home_page_response, session, headers)
}
class RegexUtilities {
+ON_DEMAND_FILE_REGEX
+CHUNK_NAME_REGEX
+INDICES_REGEX
}
ClientTransaction ..> RegexUtilities : uses
Flow diagram for ondemand.s hash resolution in get_indicesflowchart TD
A["Start get_indices"] --> B["Validate response and select home_page_response"]
B --> C["Convert response to string response_str"]
C --> D["Search response_str with ON_DEMAND_FILE_REGEX"]
D --> E{Old format match?}
E -->|Yes| F["Extract file_hash from on_demand_file.group(1)"]
F --> M["Build ondemand.s URL with file_hash"]
E -->|No| G["Search response_str with CHUNK_NAME_REGEX"]
G --> H{Chunk ID match?}
H -->|No| L["file_hash remains None"]
H -->|Yes| I["Extract chunk_id from chunk_id_match.group(1)"]
I --> J["Compile hash_pattern using chunk_id"]
J --> K["Iterate all hash_pattern matches in response_str"]
K --> N{Valid hash candidate?}
N -->|Yes| O["Set file_hash to candidate value"]
N -->|No| P["Continue iterating matches"]
P --> K
L --> Q{file_hash is set?}
O --> Q
F --> Q
Q -->|No| R["Abort: cannot resolve ondemand.s hash"]
Q -->|Yes| M
M --> S["GET ondemand.s file via session.request"]
S --> T["Extract key_byte_indices with INDICES_REGEX"]
T --> U["Return key_byte_indices"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughUpdated webpack manifest parsing in the transaction module to handle a new response format. Added a chunk-name regex and fallback logic in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Twitter changed the x.com HTML structure from the old format: 'ondemand.s': 'hash' to a new webpack chunk map format: chunk_id:"ondemand.s" (name map) chunk_id:"hash" (separate hash map) The old ON_DEMAND_FILE_REGEX no longer matches, causing "Couldn't get KEY_BYTE indices" on every API call. This fix detects both formats: tries the old regex first, then falls back to extracting the chunk ID from the name map and resolving its hash from the separate hash map. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f367875 to
a7099a7
Compare
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new chunk/hash extraction logic relies on a fairly loose
hash_patternthat will match any{chunk_id:"..."}pair; consider tightening this (e.g., via surrounding context or restricting the object scope) to reduce the risk of accidentally picking up unrelated values. - The heuristic
val != 'ondemand' and len(val) <= 12is somewhat opaque and fragile; extracting these constants into named variables or adding a small helper with a descriptive name would make the intent and constraints clearer and easier to adjust when Twitter changes formats again. - The code currently recompiles
hash_patternon every call toget_indices; if this pattern is stable, precompiling it (or using a function that builds it once perchunk_id) would avoid repeated compilation and make the code more consistent with the other module-level regexes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new chunk/hash extraction logic relies on a fairly loose `hash_pattern` that will match any `{chunk_id:"..."}` pair; consider tightening this (e.g., via surrounding context or restricting the object scope) to reduce the risk of accidentally picking up unrelated values.
- The heuristic `val != 'ondemand' and len(val) <= 12` is somewhat opaque and fragile; extracting these constants into named variables or adding a small helper with a descriptive name would make the intent and constraints clearer and easier to adjust when Twitter changes formats again.
- The code currently recompiles `hash_pattern` on every call to `get_indices`; if this pattern is stable, precompiling it (or using a function that builds it once per `chunk_id`) would avoid repeated compilation and make the code more consistent with the other module-level regexes.
## Individual Comments
### Comment 1
<location path="twikit/x_client_transaction/transaction.py" line_range="59-64" />
<code_context>
+ if chunk_id_match:
+ chunk_id = chunk_id_match.group(1)
+ hash_pattern = re.compile(rf'{chunk_id}:"([\w]+)"')
+ all_matches = list(hash_pattern.finditer(response_str))
+ file_hash = None
+ for m in all_matches:
+ val = m.group(1)
+ if val != 'ondemand' and len(val) <= 12:
+ file_hash = val
+ break
+ else:
</code_context>
<issue_to_address>
**suggestion (performance):** Collecting all matches into a list is unnecessary and slightly wasteful for large responses.
Because you only need the first matching `val` that satisfies `val != 'ondemand' and len(val) <= 12`, you can iterate directly over `hash_pattern.finditer(response_str)` and break on the first suitable match instead of building `all_matches` as a list. This avoids the intermediate list and reduces work/memory usage for large `response_str` values.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| all_matches = list(hash_pattern.finditer(response_str)) | ||
| file_hash = None | ||
| for m in all_matches: | ||
| val = m.group(1) | ||
| if val != 'ondemand' and len(val) <= 12: | ||
| file_hash = val |
There was a problem hiding this comment.
suggestion (performance): Collecting all matches into a list is unnecessary and slightly wasteful for large responses.
Because you only need the first matching val that satisfies val != 'ondemand' and len(val) <= 12, you can iterate directly over hash_pattern.finditer(response_str) and break on the first suitable match instead of building all_matches as a list. This avoids the intermediate list and reduces work/memory usage for large response_str values.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@twikit/x_client_transaction/transaction.py`:
- Line 18: CHUNK_NAME_REGEX is too strict and only matches unquoted, no-space
forms like 20113:"ondemand.s"; update CHUNK_NAME_REGEX to allow optional
single/double quotes around the numeric key and the value and permit arbitrary
spacing around the colon (e.g. use a pattern like
r'["\']?(\d+)["\']?\s*:\s*["\']?ondemand\.s["\']?' as the new regex), and apply
the same tolerant regex update to the other similar regexes/usages referenced
around lines 55-58 so all key/value formatting variants (quoted keys, spaces)
are matched.
- Around line 61-64: The loop that assigns file_hash is rejecting candidates by
a hard-coded length check ("len(val) <= 12"), which can drop valid webpack chunk
hashes; remove that arbitrary constraint in the block that iterates over
all_matches (the for m in all_matches loop) and instead accept any
non-'ondemand' match (val != 'ondemand') or replace the check with a proper
validation (e.g., match against a hex/base62 regex or a configurable
max_hash_length) before assigning file_hash; update references to file_hash
accordingly so downstream logic performs definitive validation rather than
relying on the 12-character heuristic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1aafa861-b8d8-4e35-8df6-540f0cdd32d9
📒 Files selected for processing (1)
twikit/x_client_transaction/transaction.py
| ON_DEMAND_FILE_REGEX = re.compile( | ||
| r"""['|\"]{1}ondemand\.s['|\"]{1}:\s*['|\"]{1}([\w]*)['|\"]{1}""", flags=(re.VERBOSE | re.MULTILINE)) | ||
| # New webpack format: chunk ID maps to name, separate hash map | ||
| CHUNK_NAME_REGEX = re.compile(r'(\d+):"ondemand\.s"') |
There was a problem hiding this comment.
Make chunk-ID regex tolerant to key/value formatting variants.
The current pattern only matches 20113:"ondemand.s" exactly. If the runtime emits quoted keys or spacing (e.g., "20113": "ondemand.s"), this will fail and break index resolution again.
Proposed robust pattern update
-CHUNK_NAME_REGEX = re.compile(r'(\d+):"ondemand\.s"')
+CHUNK_NAME_REGEX = re.compile(
+ r"""['"]?(\d+)['"]?\s*:\s*['"]ondemand\.s['"]"""
+)
...
- hash_pattern = re.compile(rf'{chunk_id}:"([\w]+)"')
+ hash_pattern = re.compile(
+ rf"""['"]?{re.escape(chunk_id)}['"]?\s*:\s*['"]([A-Za-z0-9]+)['"]"""
+ )Also applies to: 55-58
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@twikit/x_client_transaction/transaction.py` at line 18, CHUNK_NAME_REGEX is
too strict and only matches unquoted, no-space forms like 20113:"ondemand.s";
update CHUNK_NAME_REGEX to allow optional single/double quotes around the
numeric key and the value and permit arbitrary spacing around the colon (e.g.
use a pattern like r'["\']?(\d+)["\']?\s*:\s*["\']?ondemand\.s["\']?' as the new
regex), and apply the same tolerant regex update to the other similar
regexes/usages referenced around lines 55-58 so all key/value formatting
variants (quoted keys, spaces) are matched.
| for m in all_matches: | ||
| val = m.group(1) | ||
| if val != 'ondemand' and len(val) <= 12: | ||
| file_hash = val |
There was a problem hiding this comment.
Avoid hard-coding max hash length (<= 12) for candidate selection.
Webpack chunk hashes are not guaranteed to stay at or below 12 chars. This heuristic can silently reject valid hashes and reintroduce the "Couldn't get KEY_BYTE indices" failure.
Safer candidate filter
- if val != 'ondemand' and len(val) <= 12:
+ # prefer hex-like hash candidates; tolerate future length changes
+ if re.fullmatch(r"[0-9a-fA-F]{6,64}", val):
file_hash = val
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@twikit/x_client_transaction/transaction.py` around lines 61 - 64, The loop
that assigns file_hash is rejecting candidates by a hard-coded length check
("len(val) <= 12"), which can drop valid webpack chunk hashes; remove that
arbitrary constraint in the block that iterates over all_matches (the for m in
all_matches loop) and instead accept any non-'ondemand' match (val !=
'ondemand') or replace the check with a proper validation (e.g., match against a
hex/base62 regex or a configurable max_hash_length) before assigning file_hash;
update references to file_hash accordingly so downstream logic performs
definitive validation rather than relying on the 12-character heuristic.
Problem
Twitter changed the structure of
x.com's HTML, breaking theClientTransaction.get_indices()method for all users.The old format was:
The current format uses a webpack chunk map split across two objects:
The existing
ON_DEMAND_FILE_REGEXno longer matches, causing this error on every single API call:Fix
CHUNK_NAME_REGEXto extract the chunk ID from the name mapTesting
Verified locally against live
x.com— list scraping, tweet fetching, and search all work correctly after the fix.Summary by Sourcery
Bug Fixes:
Summary by CodeRabbit