rubygem index scaling evolution

Evolution of Rubygem index from Marshal.4.8.gz, specs.4.8.gz, latest_specs.4.8.gz, Bundler API to Compact Index

Summary

Ruby programming language has a rich ecosystem of software libraries, known as Rubygems, which can be re-used to significantly speed up development time.

Ruby software library management tools like ‘gem’, and ‘bundler’ make using Rubygems simple; they ensure the dependencies of your Rubygems are taken care of without you having to lift a finger. However, behind the scene, these tools do a lot of heavy lifting and rely on Rubygem repository indexes to understand:

  • What Rubygems are available in a repository

  • What versions of a specific Rubygem are available in the repository

  • The dependencies of each Rubygem version

Keeping the interaction between Rubygem management tools and Rubygem repositories crisp is critical for swift software development cycle and deployments. This becomes an interesting challenge over time as the number of Rubygems in each repository, and the number of Rubygems that each piece of software depends on, increases.

This need has driven the evolution of Rubygem index:

  • Marshal.4.8.z

  • specs.4.8.z

  • latest-specs.4.8.z

  • Bundler API

  • Compact index

This article shares some insights into the Rubygem index evolution.

Packagecloud supports all of these, with the Compact Index support launched in 30 Aug 2023. If you are using 'bundler', you will see significant improvement when you do `bundle install` for Rubygems stored in Packagecloud. Don't just take our word for it! See if for yourself by signing up for an account.

Rubygem Dependencies

 

If you use Ruby, then you have used `gem install <rubygem>` or `bundle install` (both classified as Rubygem management tools) to install Ruby software packages / artifacts (known as Rubygems), which are reusable building blocks for Ruby software development. The official Rubygem repository is https://rubygems.org.

 

Each Rubygem may, in turn, be composed of (thus depend on) other Rubygems, making the Rubygem ecosystem an intricate web of acyclic dependencies.

Thus the somewhat innocuous looking `gem install <rubygem>` or `bundle install` triggers a huge amount work behind the scene to ‘resolve’ the dependencies required not only by each Rubygem you want to install, but the Rubygems already in your system. Resolving dependencies means the Rubygem management tool recursively find the dependencies required by the Rubygems in these two sets, and for every dependency uncovered, the tool chooses a specific version of that dependency if specified by you or the highest version of that dependency that will not conflict with any other Rubygems' dependency requirements. Failing which, you will get a dependency error message.

To resolve dependencies, the Rubygem management tool communicates with Rubygem repositories to retrieve the index generated by each repository, which contains various information about the Rubygems they store:

  • The Rubygem names

  • The available versions of each specific Rubygem

  • Each Rubygem version’s dependency requirements

 

Fast dependency resolution is paramount for quick software development cycle, and deployments. Thus the design decision on the structure of Rubygem index is critical to enable Rubygem index to continue to satisfy this need for speed in the future as:

  • The number of Rubygems in repositories increase

  • The number of Rubygems depended on by software increases as software complexity increases

 

The 3 factors that Rubygem index design decision have been tweaked over time to meet this need of speed are:

  • Split boundaries of Rubygem information into multiple payloads

  • Payload size of each Rubygem information transmitted

  • Number of round-trip fetches required to retrieve all Rubygem information payload to completely resolve dependency 

 

Let us analyze the evolution of Rubygem index across these 3 factors over time by referencing the image below: 

  • Split boundaries of Rubygem information into multiple payloads

    • In general, the Rubygem information is split in to versions information (lightweight), and dependency information (heavyweight) so that they can be transmitted separately

  • Payload size

    • The amount of versions information and dependency information transmitted can be tweaked to find the sweet spot between transmitting minimal-size just-sufficient payload on-demand in multiple round-trip fetches versus a reduced number of round-trip fetches but with bigger overfetched payloads

  • Round-trip fetches

    • See the need to balance with payload size above

The image below depicts how the split of Rubygem information payload, the payload size, and the round-trip fetches required to retrieve all Rubygem information for complete dependency resolution has evolved over time.

undefined

Based on the image, the Rubygem index for 'Bundler API' likely has the smallest and fewest payloads required to retrieve all required information for dependency resolution. However 'Bundler API' index has the unfortunate side effect of overloading the servers with API requests and CPU usage. 'Compact Index' Rubygem index is currently a good compromise between small-ish payloads (one overfetched Rubygem versions big-ish payload with multiple on-demand small payloads for each dependency requirement) and few-ish round-trip fetches. In the next, section we cover the details of each Rubygem Index. Feel free to use that as reference and skip to the conclusion.

Marshal.4.8.gz Index

Rubygem Information Boundary Split

  • None

  • A single index containing all information for each Rubygem is stuffed into payload

 

Examples

  • This is a Gem::Specification for a single Rubygem, e.g., RedCloth, and there are as many of these in Marshal.4.8.gz as Rubygem versions in the repository

Gem::Specification.new do |s|
  s.name = "RedCloth"
  s.version = Gem::Version.new("4.2.1")
  s.installed_by_version = Gem::Version.new("0")
  s.authors = ["Jason Garber"]
  s.date = Time.utc(2009, 12, 3)
  s.description = "RedCloth-4.2.1 - Textile parser for Ruby.\nhttp://redcloth.org/"
  s.email = "redcloth-upwards@rubyforge.org"
  s.homepage = "http://redcloth.org"
  s.metadata = nil
  s.require_paths = ["lib"]
  s.required_ruby_version = Gem::Requirement.new([">= 1.8.4"])
  s.required_rubygems_version = Gem::Requirement.new([">= 1.2"])
  s.rubygems_version = "1.3.5"
  s.specification_version = 3
  s.summary = "RedCloth-4.2.1 - Textile parser for Ruby. http://redcloth.org/"
  end

Dependency Resolution Flow

  • With a single download, the Rubygem management tool has all the versions and dependency information required to completely resolve dependencies

Pros

  • With a single round-trip fetch, all the Rubygem information can be retrieved, and dependency resolution can be completed

Cons

  • There is superfluous information in Marshal.4.8.gz that is not needed for dependency resolution
  • The single overfetched payload makes up a bulk of the time to perform dependency resolution, and this will be exacerbated as the number of Rubygem versions in the repository increases

  • The CPU/memory required to process the big payload may slow down the system and negate speed gains from doing a single round-trip fetch

Notes

In 2014, it was reported that there were 100,000 Rubygems in the official Rubygems repository and by 13 Sep 2023, the Rubygem count has ballooned to 190,000. This will increase further over time.

specs.4.8.gz Index

Information Boundary Split

  • A single versions index with all Rubygem versions info

  • A Rubygem specification file for each Rubygem version

Examples

  • The versions index containing all the Rubygem versions of each Rubygem in the repository

[
  ["rack", Gem::Version.new("1.0.0"), "ruby"],
  ["rack", Gem::Version.new("2.0.0"), "ruby"],
  ["rubocop", Gem::Version.new("3.2.0"), "ruby"],
]
  • Gem::Specification for every single Rubygem version in the repository, e.g., RedCloth, which can be retrieved on-demand

Gem::Specification.new do |s|
  s.name = "RedCloth"
  s.version = Gem::Version.new("4.2.1")
  s.installed_by_version = Gem::Version.new("0")
  s.authors = ["Jason Garber"]
  s.date = Time.utc(2009, 12, 3)
  s.description = "RedCloth-4.2.1 - Textile parser for Ruby.\nhttp://redcloth.org/"
  s.email = "redcloth-upwards@rubyforge.org"
  s.homepage = "http://redcloth.org"
  s.metadata = nil
  s.require_paths = ["lib"]
  s.required_ruby_version = Gem::Requirement.new([">= 1.8.4"])
  s.required_rubygems_version = Gem::Requirement.new([">= 1.2"])
  s.rubygems_version = "1.3.5"
  s.specification_version = 3
  s.summary = "RedCloth-4.2.1 - Textile parser for Ruby. http://redcloth.org/"
end

Dependency Resolution Flow

  • Download the versions index

  • Recursively download each required Rubygem dependency file

  • Resolve all versions dependency with both pieces of information

Pros

  • With a single download of versions information, and a handful of downloads of the relevant Rubygem dependency specifications, dependency resolution can be completed

Cons

  • The single download of the trimmed down (compared to Marshal.4.8.gz) versions information Rubygem index is still too big and makes up a bulk of the time to perform dependency resolution

latest_specs.4.8.gz

Information Boundary Split

  • A single versions index with ONLY the latest Rubygem versions info for each Rubygem

  • Rubygem specification file for each Rubygem version

Examples

  • Versions index (only one entry per Rubygem - the latest version of that Rubygem)

[
  ["rack", Gem::Version.new("1.0.0"), "ruby"],
  ["rubocop", Gem::Version.new("3.2.0"), "ruby"],
]
  • Gem::Specification for every single Rubygem version in the repository, e.g., RedCloth, which can be retrieved on-demand
Gem::Specification.new do |s|
  s.name = "RedCloth"
  s.version = Gem::Version.new("4.2.1")
  s.installed_by_version = Gem::Version.new("0")
  s.authors = ["Jason Garber"]
  s.date = Time.utc(2009, 12, 3)
  s.description = "RedCloth-4.2.1 - Textile parser for Ruby.\nhttp://redcloth.org/"
  s.email = "redcloth-upwards@rubyforge.org"
  s.homepage = "http://redcloth.org"
  s.metadata = nil
  s.require_paths = ["lib"]
  s.required_ruby_version = Gem::Requirement.new([">= 1.8.4"])
  s.required_rubygems_version = Gem::Requirement.new([">= 1.2"])
  s.rubygems_version = "1.3.5"
  s.specification_version = 3
  s.summary = "RedCloth-4.2.1 - Textile parser for Ruby. http://redcloth.org/"
end

Dependency Resolution Flow

  • Download the versions index

  • Recursively download each Rubygem dependency file

  • Resolve all versions (only latest version of each Rubygem is available) dependency, if possible, with both pieces of information

Pros

  • With a single download of versions information, and a handful of downloads of the relevant Rubygem dependency specifications, dependency resolution can be completed

Cons

  • Limited use because it assumes that the user wants to install only the latest versions of Rubygems, which is rarely the case, and is impossible if a Rubygem you want to install, or its dependencies, has a dependency requirement that mandates an older version of another Rubygem

Bundler API

Information Boundary Split

  • On-demand Rubygem version info and dependency info

Examples

  • Rubygem version info and dependency info for a single Rubygem requested on demand

[
  {
    "name":"rubygem-test-package",
    "number":"0.1.0",
    "platform":"ruby",
    "rubygems_version":"2.5.1",
    "ruby_version":"\u003e=0",
    "checksum":"89d17a80f4e79d8f8918ffcced10f792428d73a7ab881b75eaeb22c28d1aa2f4",
    "created_at":"2022-04-15T03:53:45.000Z",
    "dependencies":[]
  }
]

Dependency Resolution Flow

  • Get a list of names of all Rubygems to be install

  • Make a single on-demand call to Bundler API to retrieve both the versions and dependencies for the entire list of Rubygem names

  • Repeat recursively by building a list of names of all the dependencies found, and making an on-demand call to Bundler API

  • Resolve all version dependencies based on all the retrieved information

Pros

  • Retrieves only required information

Cons

  • The returned contents from Bundler API are not cache-friendly resulting in the Rubygem repository being overburdened by lots of these on-demand requests

Notes

Due to the CPU/memory burden Bundler API generated on rubygems.org, this has been deprecated in 2014 in favor of Compact Index described next.

Compact Index

Information Boundary Split

  • All versions info are trimmed down to only versions, excluding platform, etc.

  • On-demand Rubygem dependency info for each Rubygem

Examples

  • '/versions' API endpoint returns the versions information containing all the Rubygem versions of each Rubygem in the repository

rack 0.9.2,1.0.0,1.0.1,1.1.0
sinatra 1.0,1.0.1,1.0.1-jruby,1.1
  • '/info/<GEM_NAME>' API endpoint returns Rubygem dependency info, e.g., Rubygem nokogiri, which can be retrieved on-demand

1.1.5 |checksum:abc123
1.1.6 rake:>= 0.7.1,activesupport:= 1.3.1|ruby:> 1.8.7,checksum:bcd234
1.1.7.rc2 rake:>= 0.7.1|ruby:>= 1.8.7,rubygems:> 1.3.1,checksum:cde345
1.1.7.rc3 |rubygems:> 1.3.1,checksum:def456
1.2.0-java mini_portile:~> 0.5.0|checksum:fgh567

Pros

  • Version information Index is appended thus the combination through a combination of HTTP headers cache-control with ‘Etag’ and ‘Range’ allows client to only retrieve new parts of version info index that has been appended since last retrieval

  • Instead of returning the full Gem::Specification '/info/<GEM_NAME>' API returns just the relevant dependency information, which makes the payload smaller

Cons

  • It requires more round-trip fetches to retrieve all information to completely perform dependency resolution

Notes

Replacement for Bundler API, which imposes high CPU/memory load on rubygems.org. Outperforms the earlier indexes because the Rubygem versions information is trimmed, cacheable and incrementally retrievable, making paylods small, and each Rubygem dependency information is retrievable on-demand and trimmed down to only dependency information.

Conclusion

As the number of Rubygems grows, the Rubygem index has to continue to evolve to enable Rubygem management tools to quickly resolve Rubygem dependencies. In the evolution thus far, the maintainers have carefully tweaked the boundaries of Rubygem index information transmitted in each payload to balance between the payload size (only-required-information vs overfetching-information) and round-trip fetch counts needed to retrieve all information to completely resolve dependencies. In the future, perhaps a hierarchical Rubygem index or a distributed Rubygem index to improve speed and resiliency may be on the cards.

 

You might also like other posts...