Speedup gem loading#8010
Open
byroot wants to merge 1 commit into
Open
Conversation
byroot
commented
Jun 9, 2026
| .idea | ||
| benchmark/ | ||
| lib/linguist/samples.json | ||
| lib/linguist/samples_data.rb |
Author
There was a problem hiding this comment.
NB: since the code is pretty printed, it could make sense to commit it, with some tests to ensure it's upda to date.
github-linguist is pretty slow to load, in large part because it
has to parse a pretty large json blob (~35ms on my machine).
Instead of shipping this data as JSON, we could ship it directly
as Ruby code. By default, the Ruby parser perform about as well
as JSON & Yajl:
```
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +YJIT +PRISM [arm64-darwin25]
Calculating -------------------------------------
json 27.363 (± 3.7%) i/s (36.55 ms/i) - 138.000 in 5.043304s
yajl 28.111 (± 3.6%) i/s (35.57 ms/i) - 142.000 in 5.051466s
load 27.719 (± 3.6%) i/s (36.08 ms/i) - 140.000 in 5.050659s
Comparison:
json: 27.4 i/s
yajl: 28.1 i/s - same-ish: difference falls within error
load: 27.7 i/s - same-ish: difference falls within error
```
But with Bootsnap, which can probably be assumed of many users,
it's over 4 times faster to load the data:
```
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +YJIT +PRISM [arm64-darwin25]
Calculating -------------------------------------
json 28.350 (± 3.5%) i/s (35.27 ms/i) - 142.000 in 5.008777s
yajl 30.316 (± 3.3%) i/s (32.99 ms/i) - 152.000 in 5.013793s
load+bootsnap 128.326 (± 4.7%) i/s (7.79 ms/i) - 650.000 in 5.065217s
Comparison:
json: 28.4 i/s
load+bootsnap: 128.3 i/s - 4.53x faster
yajl: 30.3 i/s - same-ish: difference falls within error
```
This approach could even be taken faster by directly generating Ruby code that
calls `Language.create` witht he relevant arguments, but I decided to scope
this change to the smallest possible one as to test the waters.
Benchmark:
```ruby
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'yajl-ruby'
gem 'json'
gem 'benchmark-ips'
gem 'bootsnap'
end
require 'json'
require 'yajl'
require 'benchmark/ips'
if ENV["BOOTSNAP_CACHE_DIR"]
require 'bootsnap/setup'
end
Benchmark.ips do |x|
x.report('json') { JSON.parse(File.read("lib/linguist/samples.json"))}
x.report('yajl') { Yajl.load(File.read("lib/linguist/samples.json"))}
x.report(ENV["BOOTSNAP_CACHE_DIR"] ? 'load+bootsnap' : 'load') { mod = Module.new; load("lib/linguist/samples_data.rb", mod); mod::DATA }
x.compare!(order: :baseline)
end
```
byroot
commented
Jun 9, 2026
Comment on lines
-27
to
-33
| serializer = defined?(Yajl) ? Yajl : JSON | ||
| data = serializer.load(File.read(PATH, encoding: 'utf-8')) | ||
| # JSON serialization does not allow integer keys, we fix them here | ||
| for lang in data['centroids'].keys | ||
| fixed = data['centroids'][lang].to_a.map { |k,v| [k.to_i, v] } | ||
| data['centroids'][lang] = Hash[fixed] | ||
| end |
Author
There was a problem hiding this comment.
This is no longer needed, and was about 5% of load time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
github-linguist is pretty slow to load, in large part because it has to parse a pretty large json blob (~35ms on my machine).
Instead of shipping this data as JSON, we could ship it directly as Ruby code. By default, the Ruby parser perform about as well as JSON & Yajl:
But with Bootsnap, which can probably be assumed of many users, it's over 4 times faster to load the data:
This approach could even be taken faster by directly generating Ruby code that calls
Language.createwitht he relevant arguments, but I decided to scope this change to the smallest possible one as to test the waters.Benchmark