Skip to content

Speedup gem loading#8010

Open
byroot wants to merge 1 commit into
github-linguist:mainfrom
byroot:json-to-ruby
Open

Speedup gem loading#8010
byroot wants to merge 1 commit into
github-linguist:mainfrom
byroot:json-to-ruby

Conversation

@byroot

@byroot byroot commented Jun 9, 2026

Copy link
Copy Markdown

github-linguist is pretty slow to load, in large part because it has to parse a pretty large json blob (~35ms on my machine).

Instead of shipping this data as JSON, we could ship it directly as Ruby code. By default, the Ruby parser perform about as well as JSON & Yajl:

ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +YJIT +PRISM [arm64-darwin25]
Calculating -------------------------------------
                json     27.363 (± 3.7%) i/s   (36.55 ms/i) -    138.000 in   5.043304s
                yajl     28.111 (± 3.6%) i/s   (35.57 ms/i) -    142.000 in   5.051466s
                load     27.719 (± 3.6%) i/s   (36.08 ms/i) -    140.000 in   5.050659s

Comparison:
json:       27.4 i/s
yajl:       28.1 i/s - same-ish: difference falls within error
load:       27.7 i/s - same-ish: difference falls within error

But with Bootsnap, which can probably be assumed of many users, it's over 4 times faster to load the data:

ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +YJIT +PRISM [arm64-darwin25]
Calculating -------------------------------------
                json     28.350 (± 3.5%) i/s   (35.27 ms/i) -    142.000 in   5.008777s
                yajl     30.316 (± 3.3%) i/s   (32.99 ms/i) -    152.000 in   5.013793s
       load+bootsnap    128.326 (± 4.7%) i/s    (7.79 ms/i) -    650.000 in   5.065217s

Comparison:
         json:       28.4 i/s
load+bootsnap:      128.3 i/s - 4.53x  faster
         yajl:       30.3 i/s - same-ish: difference falls within error

This approach could even be taken faster by directly generating Ruby code that calls Language.create witht he relevant arguments, but I decided to scope this change to the smallest possible one as to test the waters.

Benchmark

# frozen_string_literal: true

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'yajl-ruby'
  gem 'json'
  gem 'benchmark-ips'
  gem 'bootsnap'
end

require 'json'
require 'yajl'
require 'benchmark/ips'
if ENV["BOOTSNAP_CACHE_DIR"]
  require 'bootsnap/setup'
end

Benchmark.ips do |x|
  x.report('json') { JSON.parse(File.read("lib/linguist/samples.json"))}
  x.report('yajl') { Yajl.load(File.read("lib/linguist/samples.json"))}
  x.report(ENV["BOOTSNAP_CACHE_DIR"] ? 'load+bootsnap' : 'load') { mod = Module.new; load("lib/linguist/samples_data.rb", mod); mod::DATA }
  x.compare!(order: :baseline)
end

@byroot byroot requested a review from a team as a code owner June 9, 2026 11:45
Comment thread .gitignore
.idea
benchmark/
lib/linguist/samples.json
lib/linguist/samples_data.rb

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: since the code is pretty printed, it could make sense to commit it, with some tests to ensure it's upda to date.

github-linguist is pretty slow to load, in large part because it
has to parse a pretty large json blob (~35ms on my machine).

Instead of shipping this data as JSON, we could ship it directly
as Ruby code. By default, the Ruby parser perform about as well
as JSON & Yajl:

```
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +YJIT +PRISM [arm64-darwin25]
Calculating -------------------------------------
                json     27.363 (± 3.7%) i/s   (36.55 ms/i) -    138.000 in   5.043304s
                yajl     28.111 (± 3.6%) i/s   (35.57 ms/i) -    142.000 in   5.051466s
                load     27.719 (± 3.6%) i/s   (36.08 ms/i) -    140.000 in   5.050659s

Comparison:
json:       27.4 i/s
yajl:       28.1 i/s - same-ish: difference falls within error
load:       27.7 i/s - same-ish: difference falls within error
```

But with Bootsnap, which can probably be assumed of many users,
it's over 4 times faster to load the data:

```
ruby 4.0.5 (2026-05-20 revision 64336ffd0e) +YJIT +PRISM [arm64-darwin25]
Calculating -------------------------------------
                json     28.350 (± 3.5%) i/s   (35.27 ms/i) -    142.000 in   5.008777s
                yajl     30.316 (± 3.3%) i/s   (32.99 ms/i) -    152.000 in   5.013793s
       load+bootsnap    128.326 (± 4.7%) i/s    (7.79 ms/i) -    650.000 in   5.065217s

Comparison:
         json:       28.4 i/s
load+bootsnap:      128.3 i/s - 4.53x  faster
         yajl:       30.3 i/s - same-ish: difference falls within error
```

This approach could even be taken faster by directly generating Ruby code that
calls `Language.create` witht he relevant arguments, but I decided to scope
this change to the smallest possible one as to test the waters.

Benchmark:

```ruby

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'yajl-ruby'
  gem 'json'
  gem 'benchmark-ips'
  gem 'bootsnap'
end

require 'json'
require 'yajl'
require 'benchmark/ips'
if ENV["BOOTSNAP_CACHE_DIR"]
  require 'bootsnap/setup'
end

Benchmark.ips do |x|
  x.report('json') { JSON.parse(File.read("lib/linguist/samples.json"))}
  x.report('yajl') { Yajl.load(File.read("lib/linguist/samples.json"))}
  x.report(ENV["BOOTSNAP_CACHE_DIR"] ? 'load+bootsnap' : 'load') { mod = Module.new; load("lib/linguist/samples_data.rb", mod); mod::DATA }
  x.compare!(order: :baseline)
end
```
Comment thread lib/linguist/samples.rb
Comment on lines -27 to -33
serializer = defined?(Yajl) ? Yajl : JSON
data = serializer.load(File.read(PATH, encoding: 'utf-8'))
# JSON serialization does not allow integer keys, we fix them here
for lang in data['centroids'].keys
fixed = data['centroids'][lang].to_a.map { |k,v| [k.to_i, v] }
data['centroids'][lang] = Hash[fixed]
end

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed, and was about 5% of load time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant