Automatically parsing metadata when distributing packages

Mar 11, 2022 · 11 min read

Introduction

Packages are reusable software modules bundled with metadata and configuration information that enable users to exploit them. Packages help developers to avoid the effort of reinventing the wheel. To accomplish this, packages must be distributed in a way that the developer population can understand what the package is about and how to use them.

Package installers or package management systems act as a bridge between developers who are searching for packages and the distributors who make them available. They keep track of all the information regarding a package and automate the installation process. Package managers are usually tightly coupled with languages and frameworks.

Thus they can only work with packages meant for those specific languages. If you want a package management system that can work with multiple languages, frameworks, and operating systems, check out Packagecloud here. This post will guide you on automatically parsing package metadata while distributing packages.

What is Package Distribution?

Package distribution is the process of making packages available for the general public in a formal way. This involves properly documenting the packages, adding metadata, and then bundling all these together in a format that can be understood by the package installers. Since hackers often use malicious packages to gain access to secure systems, it is very important to ensure the security of packages. Refer to our article on dependency confusion here, if you are not clued up on the subject. Package distribution plays a big role in ensuring this.

Different languages and frameworks, specify different rules and standards to distribute packages, focus on making the information about packages available as accurately as possible, and help the package installers automate the search and install process. For example, in the case of Debian Linux distribution, packages are distributed as .deb files that contain the actual software, the metadata regarding them, and how to use the packages. The Packages are then distributed through APT repositories that are maintained by the community. The package installation is then done using the apt-get install command or the dpkg utility.

In the case of community-maintained public package repositories, the job of ensuring security and making all information about the package available then becomes the responsibility of the community. With packages being a very vulnerable point of attack to gain entry to a software system, enterprise organizations generally maintain private package repositories so that the content can be internally verified and curated.

For organizations using multiple languages and technologies for developing their systems, this will be a serious effort. They will have to maintain multiple repositories catering to different languages and frameworks unless they use a repository manager like Packagecloud that can handle packages from multiple languages at the same time.

What is Package Metadata?

Other than maintaining security, another important responsibility of package managers is to process the information regarding them. This information is generally termed package metadata. A reusable package is only useful if people who want to use them are able to find them easily. Hence package metadata is a very important aspect of streamlining the package distribution process.

Package metadata helps the users understand the purpose of the package, how to use it, and the prerequisites of using the packages. While different languages have varied requirements about the must-have metadata that must be included while publishing a package, most of them have the following attributes.

1. Package name: A meaningful name that depicts what the package does.

2. Description: Description helps users to understand what the package does.

3. Version: The version number of the package. This is very important since it is closely tied with the specific functionalities that are supported by your packages. As packages evolve, functionalities may get added or removed and versions are the only way to track the supported functions.

4. Installation requirements: These are prerequisites of the working environment where the package will work. This may also include OS versions, required auxiliary services, and framework versions.

5. Build dependencies: Most packages depend on other packages or need other packages to be built as independent executables. Since most packages will use other commonly used packages as dependencies, these are not bundled with each package to avoid duplication.

6. Suggested dependencies: This section includes dependencies that are not required by the package for normal operation, but can help the package perform better in specific functionalities. For example, there may be packages that, if installed, will make specific functionalities faster, but is not mandatory for the typical operation of the package.

7. License: Details about who can use the package without legal issues.

8. Author: The development organization or developer who implemented the package. This is to enable the users to have a support contact in case the packages do not work as expected.

Now that we understand what the package metadata is, let us focus on the need for automatically parsing them. Parsing the metadata happens in two places. First while creating the distribution module and second while installing the package. While creating the packages, the package distribution script needs access to the metadata information so that it can be validated and bundled with the distribution file. Build and distribution scripts of the technology stack will generally have a configuration section that will allow the developer to specify this. Most build utilities will autogenerate a template for adding this information. We will examine how this is done by some well-known distribution scripts later in the blog.

While installing the packages, the installation script needs the package metadata to search and find the right one. It then uses this information during installation and makes it available for the developer through specific methods. The script fetches all the unmet dependencies and installs them so that the package has the required working environment. In case of unmet dependencies, it throws an error and fails gracefully. When the developer starts using the packages, he can use the OS commands or language-specific constructs to query the metadata regarding the packages that are imported to the project.

How package-metadata is handled using distribution

Different languages and frameworks handle package metadata differently. Broadly every language requires you to have a file in a specific format with must-have attributes. The build and distribution systems verify the presence of these must-have attributes before distribution. Let’s examine how the package metadata is handled from the perspective of two popular languages - Python and R.

Handling Metadata in Python Packages

In the case of Python, the recommended way of distributing a package is by using setuptools. It is built on top of distutils and bundles everything that is required by the package manager to install and configure the package. Python uses PIP as its package manager.

The most important part of the Python packaging process is the project configuration that includes metadata. This configuration is done by the setup.py script that exists in the root directory of the project you want to build and distribute as a package. The setup.py contains a setup function that takes the metadata attributes. The commonly used attributes are explained below.

- name

- version

- description

- author

- license

- classifiers: This section generally includes the keys that denote development maturity, intended audience, other details of license, etc.

- install _requires: Lists the packages that are mandatory for the package to function properly.

- python_requires: Provides the minimum version of Python on which the package can operate properly.

- project_urls: List of URLs where you have provided package information.

- package_data: In case you need to bundle data files along with python packages, specify it here.

- data_files: This is used in case data files external to the package are used by the package functions.

Python packaging also includes the below more files other than setup.py

- setup.cfg : This file is used to specify default that can be used in case setup.py is present.

- README.md: The file contains information about the packages and how to use it.

- MANIFEST.in : This is needed only if you need to package additional files not specified in setup.py.

- LICENSE.txt: Full license text that specifies who is authorized to use the package.

Once all the files are set up, you can package it by using the below command.

python3 -m build --sdist

If you have an account, you can then upload the packages to PyPi.

Handling metadata in R packages

R is a very popular programming language and software environment for statistical computing. R’s package repository is called Comprehensive R Archive Network ( CRAN) It provides a built-in function called install.packages to install packages from CRAN. For an R package, metadata is specified through a DESCRIPTION file. This file is the identity of every page and consists of mandatory attributes that help the package manager and the user to set up the package. To create a package, R provides a template function called create_package. This function when executed creates a bare-bones description file.

R’s description file consists of the following key attributes.

- Package: Name of the package

- Title: Usually a single that describes the function of the package.

- Version

- License

- Author: This section can take a name, email id, and a three-letter key denoting the role of the author. The role can be author, creator, contributor, or copyright holder.

- Description: A paragraph description of what the package does.

- Imports: The list of other packages that are required for the package to function.

- Suggests: The list of packages that the package can take advantage of. This is different from the ‘Imports’ attribute in the sense that packages mentioned here are not a must-have for the normal operation of the package. The package installer will not install the packages in this list automatically.

Similar to Python’s setuptools, R offers a package called devtools that provide options to set up all the above attributes. The devtools package will then generate the description file and can generate a build file. This build file can then be uploaded to CRAN, which is intelligent enough to parse the description file.

Conclusion

Automatically parsing the package metadata is an important functionality of build tools and the package repository managers. Without metadata, the developers will not be able to find what they want. For repositories that cater to specific languages, this is easier to accomplish since the parsing logic can be tailor-made for specific package formats. Repository managers like PackageCloud handle packages from multiple languages and frameworks. Hence they are capable of automatically parsing the metadata from multiple package formats.

This ability is key in enabling support for multiple CI/CD utilities and package formats. If you are looking for a package repository manager that can handle all your package hosting needs securely, check out Packagecloud here.