On "Service Model"

helping librarians to preserve all human knowledge

August 23, 2024 · Dmytro Shteflyuk

While reading “Service Model”, a book by Adrian Tchaikovsky, I was first fascinated by the librarians, and then later awestruck by the elegance and silliness of their idea. This post will explore a potential implementation of the algorithm, a little bit of exploration of Async in Ruby, and some spoilers for that part of the book. You have been warned.

“All human knowledge,” the Chief Librarian declaimed, “retrieved from the fallen world outside by our gallant librarians, restored, decoded, and read in the First Hall. Rendered into a common binary code, then placed before the copyists, who translate it into our secure system here. Our master server, which catalogues and files each individual bit of information, to be stored in order within our archive for ease of cross-reference and retrieval. Is this not the greatest endeavor of all human history?”

Service Model by Adrian Tchaikovsky

What exactly are we building?

To understand the main idea, we have to read a bit further. In the book, librarians went out into the world to collect material they wanted to preserve (which is all digital knowledge). They brought it back to the library, where it was loaded or read in the First Hall on computers not connected to the internal network to ensure the archive’s safety. It was then copied by copyists from the monitor into a secure system, where it was sorted in the most literal sense of that word: all the bits were sorted in order before being stored in the archive. And here is the main idea why doing that:

Instead, we determined that if we filed all our data in this universal manner, then the Library could become more than the sum of the information placed within it. Our Archive not only preserves all the learning that we have encoded into it. Because our zeroes and ones may be retrieved in any order, rather than simply that of the original documents, it contains all possible knowledge! We here preserve every conceivable book, manual, tract, recording, and program that could ever have been created, not merely all those that were. We are the greatest repository of potential knowledge in the history of history itself.

Let’s help the librarians and write a script that archives itself along with all necessary dependencies so that we can restore it in case of a disaster. Here is the plan:

We will write a single-file script.
It will declare all necessary dependencies as an inline bundle.
It will archive its own source and the binary versions of all dependencies.
The main character in the book asked to see data at 25% and 75% of the archive, so we will do the same.

Implementing archiver v1, single-threaded

We will start by declaring the necessary dependencies. We will use Bundler to manage them, benchmark to measure steps (after all, we want the best performance!), faraday to abstract the HTTP client, and zlib for giggles — just to see how well our pure archive compresses.

require "bundler/inline"

gemfile do
  source "https://rubygems.org"

  gem "benchmark"
  gem "faraday"
  gem "zlib"
end

Reading our own source is as simple as File.read(__FILE__). We can enumerate all loaded dependencies via Gem::Specification, and then download them from https://rubygems.org via API:

def download_gem(name)
  response = Faraday.get(
    "https://rubygems.org/downloads/#{name}.gem",
    {},
    user_agent: "ServiceModel-Librarian/1.0"
  )
  raise "Failed to download: #{spec.full_name}" unless response.success?

  response.body
end

data = [File.read(__FILE__)]

Gem::Specification.each do |spec|
  data << download_gem(spec.full_name)
end

So far, pretty straightforward. Let’s sort the data into the archive! We will need to sort the bits, so it is crucial to extract them from the data. We can use pack and unpack with the B modifier:

"hello".unpack("B*")
# => ["0110100001100101011011000110110001101111"]
["0110100001100101011011000110110001101111"].pack("B*")
# => "hello"

All we need to do now is to sort those bits:

"01101000".chars.sort.join
# => "00000111"

The full sorting algorithm looks like this:

sorted = data.join
  .unpack("B*")               # take bits
  .map { _1.chars.sort.join } # sort them bits
  .pack("B*")                 # pack into bytes

Let’s put it all together and add some benchmarking:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"

  gem "benchmark"
  gem "faraday"
  gem "zlib"
end

def download_gem(name)
  response = Faraday.get(
    "https://rubygems.org/downloads/#{name}.gem",
    {},
    user_agent: "ServiceModel-Librarian/1.0"
  )
  raise "Failed to download: #{spec.full_name}" unless response.success?

  response.body
end

data = [File.read(__FILE__)]
sorted = nil

Benchmark.bm do |x|
  x.report("download:") do
    Gem::Specification.each do |spec|
      data << download_gem(spec.full_name)
    end
  end

  x.report("    sort:") do
    sorted = data.join
      .unpack("B*")               # take bits
      .map { _1.chars.sort.join } # sort them bits
      .pack("B*")                 # pack to bytes
  end
end

puts <<~MSG
  \nResults:
      Original: #{data.sum { _1.size }} bytes in #{data.count} pieces
        Sorted: #{sorted.size} bytes
      Archived: #{Zlib.deflate(sorted).size} bytes
           25%: #{sorted[(sorted.size * 0.25).to_i].unpack("B*").first}
           75%: #{sorted[(sorted.size * 0.75).to_i].unpack("B*").first}
MSG

Here is what we see when the script is run:

$ ruby service-model-archive-v1.rb
               user     system      total        real
download:  0.021359   0.010195   0.031554 (  0.296913)
    sort:  0.562677   0.110775   0.673452 (  0.706862)

Results:
    Original: 674458 bytes in 9 pieces
      Sorted: 674458 bytes
    Archived: 679 bytes
         25%: 00000000
         75%: 11111111

The script works, but it does not scale too well, and does not model what is happening in the book too well. There were multiple librarians collecting materials in the world, and in our case, only a single thread is downloading the dependencies. Let’s fix that.

Implementing archiver v2, async fibers

We will parallelize the collection using the amazing async gem. The simplest solution would be to spawn a fiber for each dependency:

Sync do |parent|
  gems = Gem::Specification.map do |spec|
    parent.async do
      download_gem(spec.full_name)
    end
  end.map(&:wait)
  data.concat(gems)
end

This works perfectly fine, but the more dependencies we have, the more fibers will spawn and more network connections will open. Normally, we do not want an unbounded number of tasks spawned. To introduce a limit, we can use the Async::Semaphore primitive:

Sync do
  semaphore = Async::Semaphore.new(10)

  gems = Gem::Specification.map do |spec|
    semaphore.async do
      download_gem(spec.full_name)
    end
  end.map(&:wait)
  data.concat(gems)
end

Waiting for the tasks to complete is pretty trivial: we just have to call wait on them in order. A more elegant solution is to use the Async::Barrier primitive:

Sync do
  barrier = Async::Barrier.new
  semaphore = Async::Semaphore.new(10, parent: barrier)

  Gem::Specification.each do |spec|
    semaphore.async do
      data << download_gem(spec.full_name)
    end
  end

  barrier.wait
end

The full script:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"

  gem "async", require: ["async", "async/barrier", "async/semaphore"]
  gem "benchmark"
  gem "faraday"
  gem "zlib"
end

def download_gem(name)
  response = Faraday.get(
    "https://rubygems.org/downloads/#{name}.gem",
    {},
    user_agent: "ServiceModel-Librarian/1.0"
  )
  raise "Failed to download: #{spec.full_name}" unless response.success?

  response.body
end

data = [File.read(__FILE__)]
sorted = nil

Benchmark.bm do |x|
  x.report("download:") do
    Sync do
      barrier = Async::Barrier.new
      semaphore = Async::Semaphore.new(10, parent: barrier)

      Gem::Specification.each do |spec|
        semaphore.async do
          data << download_gem(spec.full_name)
        end
      end

      barrier.wait
    end
  end

  x.report("    sort:") do
    sorted = data.join
      .unpack("B*")               # take bits
      .map { _1.chars.sort.join } # sort them bits
      .pack("B*")                 # pack to bytes
  end
end

puts <<~MSG
  \nResults:
      Original: #{data.sum { _1.size }} bytes in #{data.count} pieces
        Sorted: #{sorted.size} bytes
      Archived: #{Zlib.deflate(sorted).size} bytes
           25%: #{sorted[(sorted.size * 0.25).to_i].unpack("B*").first}
           75%: #{sorted[(sorted.size * 0.75).to_i].unpack("B*").first}
MSG

Let’s see how it performs:

$ ruby service-model-archive-v2.rb
               user     system      total        real
download:  0.029931   0.009220   0.039151 (  0.160382)
    sort:  0.643403   0.164919   0.808322 (  0.845208)

Results:
    Original: 868246 bytes in 16 pieces
      Sorted: 868246 bytes
    Archived: 867 bytes
         25%: 00000000
         75%: 11111111

We have archived more data faster, which will definitely speed up our initiative to preserve all knowledge. Most of the time is still spent on sorting, which can theoretically be optimized simply by counting bits, but that was not the point of the article. To be fair, this article has no point, but I hope you had at least a fraction of the fun I had writing it (of which I had a lot).

“Every bit of information duly filed, in order,” the Chief Librarian announced proudly. “An Archive perfect in its logical construction.”

You will have to read the book to see how this plan plays out for them :)

The full source code can be found in the blog repository.

Table of contents

What exactly are we building?

Implementing archiver v1, single-threaded

Implementing archiver v2, async fibers