Packagecloud logo

Python PyPI repository internals

TL;DR

This blog post dives in to the Python PyPI repository metadata format. We’ll cover the different metadata files that make up a PyPI repository, what the files mean, and show how a user can inspect metadata themselves.

 

What is a Python PyPI repository?

A Python PyPI repository is a collection of Python packages and metadata that is readable by the pip command line tool. Most python programmers will be familiar with running pip install for installing python packages.

Python PyPI root repository metadata

Python PyPI repository metadata is a set of two simple HTML documents describing the available Python packages and the versions of each package.

The first metadata file is located at the /simple/ endpoint of a PyPI repository URL. This document contains an API version number and a list of HTML links for each available package in the repository.

For example, a PyPI repository on packagecloud would have its main metadata available at https://packagecloud.io/joe/hi/pypi/simple/.

You can use curl to download the metadata and examine it. For example:

$ curl -L https://packagecloud.io/joe/hi/pypi/simple
<!DOCTYPE html>
<html>
  <head>
    <title>Simple Index</title>
    <meta name="api-version" value="2" />
  </head>
  <body>
    <a href="/simple/packagecloud-test">packagecloud-test</a>
  </body>
</html>

This repository has a single package called packagecloud-test and an HTML link pointing to the path of a document listing the versions of the packagecloud-test package that are available.

The metadata format is relatively simple and is documented in the Warehouse docs.

This documentation shows that a response header named X-PyPI-Last-Serial should be included from the server. It appears that this option was originally added to support a PyPI mirroring tool called bandersnatch.

The need for this option was later removed from bandersnatch. As far as we can tell, this option is not necessary for any other application, but is probaly necessary for backward compatibility with older versions of bandersnatch.

 

Python PyPI package metadata

Individual Python package metadata can also be retrieved by following the links in the root metadata or by constructing the metadata URL manually. Package metadata is located at the URL endpoint /simple/packagename.

Following our previous example, we can request the metadata for the Python package packagecloud-test by using curl:

$ curl -L https://packagecloud.io/joe/hi/pypi/simple/packagecloud-test
<!DOCTYPE html>
<html>
  <head>
    <title>Links for packagecloud-test</title>
    <meta name="api-version" value="2" />
  </head>
  <body>
    <h1>Links for packagecloud-test</h1>
    <a rel="internal" href="/joe/hi/pypi/packages/packagecloud_test-0.9.7b1-py2.py3-none-any.whl#sha1=25416aa58f9a5e7a6be23e084c6cbfdd392ab3b2">packagecloud_test-0.9.7b1-py2.py3-none-any.whl</a>
  </body>
</html>

The metadata for individual packages is relatively straightforward, as well. The metadata contains a list of HTML links for each version of the available package. The link to the package can also include a checksum that clients like pip can verify. According to the documentation, the supported checksum algorithms are: md5, sha1, sha224, sha256, sha384, and sha512.

The rel attribute has special meaning. Per the documentation, internal refers to a URL that is a direct package link. There are other possible values for rel which include things like a homepage for the project.

Similarly, this page includes the X-PyPI-Last-Serial header, as well.

There are a few important things to note when examining PyPI metadata: Python PyPI metadata URLs are case insensitive and should treat hyphens and underscores as interchangeable.

This means that a request for https://packagecloud.io/joe/hi/pypi/simple/packagecloud-test and https://packagecloud.io/joe/hi/pypi/simple/packagecloud-TEST should result in the same metadata being output.

This and other important information about Python package metadata is documented in PEP 0426.

 

Conclusion

Python PyPI metadata is comprised of a set two of HTML documents. The HTML documents describe the packages available, the versions of those packages, and a URL for downloading each of the package files.

Python PyPI metadata can be manually retrieved and examined on the command line using curl. This is useful if you need to debug some sort of issue with your repository or are curious about the inner workings of Python PyPI repositories.

You might also like other posts...