A comparison between different dependency management tools for Python projects
Introduction
Dependency management is an important aspect of any programming project that requires us to use functionalities in external libraries and packages (libraries and packages that do not come with our programming environment by default). Developing complex applications with the Python programming language is one such case. There are various tools out there that manage dependencies for Python projects, such as Pip, Conda and even Poetry. While it is understandable that different people have different preferences for dependency management tools, I would like to share more about why you might be better off using Poetry instead of Pip or Conda. I only include Pip, Conda and Poetry in this article as these are the tools that I have personally used.
Starters
Before we go further, I think it would be good to understand a bit more about the different dependency management tools that we will be comparing.
- Conda: Conda is a dependency management tool that comes with Anaconda. Anaconda is typically used by beginners in data science who are starting out in Python programming, and do not want to worry too much about about installing the common dependencies needed for data science work, such as
numpy
,pandas
,jupyter
andscikit-learn
. More information about Conda can be found in its documentation, while Anaconda has its own homepage too. - Pip: Pip is a dependency management tool that comes together with the standard Python installation for Windows, and can be installed via Homebrew for MacOS and the distribution app manager for Linux systems (e.g. apt for Debian and Ubuntu). This article from Real Python gives a pretty good description how one could get started with Pip.
- Poetry: Poetry is a newer dependency management tool that is gaining visibility and popularity for Python users. The use of
pyproject.toml
andpoetry.lock
files make it similar to the way the Node Package Manager (npm) for Node.js works. More information about Poetry can be found in its documentation.
The Fun Part
Now this is where I try to convince you that Poetry is the best choice out of the 3 dependency management tools I described earlier.
Ditch Conda totally if you plan to do Python programming for production
I know, “ditch xx totally” is a pretty strong phrase right there, but I have my reasons for saying this.
Pre-installed packages in Anaconda
If we ran the command conda list
in the base environment, we might see something like this in the terminal:
(base) user:~$ conda list
# packages in environment at /home/user/anaconda3:
#
# Name Version Build Channel
_ipyw_jlab_nb_ext_conf 0.1.0 py39h06a4308_0
_libgcc_mutex 0.1 main
_openmp_mutex 4.5 1_gnu
alabaster 0.7.12 pyhd3eb1b0_0
...
jupyter 1.0.0 py39h06a4308_7
...
numpy 1.20.3 py39hf144106_0
...
pandas 1.3.4 py39h8c16a72_0
...
scikit-learn 0.24.2 py39ha9443f7_0
...
This is useful for a beginner who has just started learning how to use Python to do data science related work, and might not want to spend too much time trying to figure out how to get the dependencies they need. However, when it comes to an actual project in production, we typically do not want to have so many packages that we do not actually need. Having all these unnecessary packages will only take up extra disk space and memory, which could be better used for something more important. Of course, there is also Miniconda, which uses Conda underneath too without all these additional packages, but…
Unnecessary packages added during installation
For example, if we ran the command conda install -c conda-forge numpy==1.22.3
, we would see the following in the terminal:
The following NEW packages will be INSTALLED:
_libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
bzip2 conda-forge/linux-64::bzip2-1.0.8-h7f98852_4
ca-certificates conda-forge/linux-64::ca-certificates-2021.10.8-ha878542_0
ld_impl_linux-64 conda-forge/linux-64::ld_impl_linux-64-2.36.1-hea4e1c9_2
libblas conda-forge/linux-64::libblas-3.9.0-14_linux64_openblas
libcblas conda-forge/linux-64::libcblas-3.9.0-14_linux64_openblas
libffi conda-forge/linux-64::libffi-3.4.2-h7f98852_5
libgcc-ng conda-forge/linux-64::libgcc-ng-11.2.0-h1d223b6_15
libgfortran-ng conda-forge/linux-64::libgfortran-ng-11.2.0-h69a702a_15
libgfortran5 conda-forge/linux-64::libgfortran5-11.2.0-h5c6108e_15
libgomp conda-forge/linux-64::libgomp-11.2.0-h1d223b6_15
liblapack conda-forge/linux-64::liblapack-3.9.0-14_linux64_openblas
libnsl conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
libopenblas conda-forge/linux-64::libopenblas-0.3.20-pthreads_h78a6416_0
libstdcxx-ng conda-forge/linux-64::libstdcxx-ng-11.2.0-he4da1e4_15
libuuid conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
libzlib conda-forge/linux-64::libzlib-1.2.11-h166bdaf_1014
ncurses conda-forge/linux-64::ncurses-6.3-h27087fc_1
numpy conda-forge/linux-64::numpy-1.22.3-py310h45f3432_2
openssl conda-forge/linux-64::openssl-3.0.2-h166bdaf_1
pip conda-forge/noarch::pip-22.0.4-pyhd8ed1ab_0
python conda-forge/linux-64::python-3.10.4-h2660328_0_cpython
python_abi conda-forge/linux-64::python_abi-3.10-2_cp310
readline conda-forge/linux-64::readline-8.1-h46c0cb4_0
setuptools conda-forge/linux-64::setuptools-62.1.0-py310hff52083_0
sqlite conda-forge/linux-64::sqlite-3.38.2-h4ff8645_0
tk conda-forge/linux-64::tk-8.6.12-h27826a3_0
tzdata conda-forge/noarch::tzdata-2022a-h191b570_0
wheel conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
xz conda-forge/linux-64::xz-5.2.5-h516909a_1
zlib conda-forge/linux-64::zlib-1.2.11-h166bdaf_1014
I am pretty sure that libraries like ca-certificates
and openssl
are not actually needed by numpy
, but let us see what we get when we try to install numpy
in a fresh virtual environment using Pip instead. I will also run the pip list
command to see what packages get installed together with numpy
:
user:~$ pip install numpy==1.22.3
Collecting numpy==1.22.3
Downloading numpy-1.22.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
|████████████████████████████████| 16.8 MB 10.2 MB/s
Installing collected packages: numpy
Successfully installed numpy-1.22.3
user:~$ pip list
Package Version
---------- -------
numpy 1.22.3
pip 21.2.2
setuptools 57.4.0
wheel 0.36.2
Wow, that is nowhere near the number of additional packages Conda attempted to install when we tried to install numpy
. This dependency bloat is one big turn-off for me when it comes to dependency management in production code.
Weird behavior when displaying installed dependencies
For example, if I installed numpy
through via Pip in a fresh Conda virtual environment, I would see something like this:
(test-env) user:~$ conda list
# packages in environment at /home/user/anaconda3/envs/test-env:
#
# Name Version Build Channel
(test-env) user:~$ pip list
Package Version
---------- -------
numpy 1.22.3
pip 21.2.2
setuptools 57.4.0
wheel 0.36.2
Hmm, strange. numpy
does not show when we run the conda list
command, but shows up when we run the pip list
command.
But that is not all. if I now install pandas
via conda-forge, we will see something like this when we run the conda list
and pip list
commands:
(test-env) user:~$ conda list
# packages in environment at /home/user/anaconda3/envs/test-env:
#
# Name Version Build Channel
...
numpy 1.22.3 pypi_0 pypi
...
pandas 1.4.2 pypi_0 pypi
...
(test-env) user:~$ pip list
Package Version
--------------- -------
numpy 1.22.3
pandas 1.4.2
...
It seems that packages installed via Pip only show up in the output of the conda list
command if some other package using this package was installed via conda-forge. This can be rather confusing for users, as:
- We need to use two different commands to check all of our dependencies
- We might miss out on some dependencies if we only use
conda list
to check for our dependencies
I hope these 3 points are sufficient in helping you understand why we should just ditch Conda completely if we plan to do some serious work in production code. Having gotten that out of the way, we can now focus on the comparison between Poetry and Pip.
Why not Pip then?
I understand that there are people who prefer using Pip for various reasons. For instance, Pip has been around longer compared to Poetry, and people tend to stick to something more familiar to them. Here, I would like to share some observations I have, and hopefully that will convince you that Poetry is the better choice here.
Better handling of dependency conflicts
As a example, let us see what happens when we use Pip to install a different version of numpy
in a virtual environment that has pandas
installed:
(test-env) user:~$ pip list
Package Version
--------------- -------
numpy 1.22.3
pandas 1.4.2
pip 21.1.1
python-dateutil 2.8.2
pytz 2022.1
setuptools 56.0.0
six 1.16.0
(test-env) user:~$ pip install "numpy<1.18.5"
Collecting numpy<1.18.5
Downloading numpy-1.18.4-cp38-cp38-manylinux1_x86_64.whl (20.7 MB)
|████████████████████████████████| 20.7 MB 10.9 MB/s
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.22.3
Uninstalling numpy-1.22.3:
Successfully uninstalled numpy-1.22.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas 1.4.2 requires numpy>=1.18.5; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10", but you have numpy 1.18.4 which is incompatible.
Successfully installed numpy-1.18.4
(test-env) user:~$ pip list
Package Version
--------------- -------
numpy 1.18.4
pandas 1.4.2
pip 21.1.1
python-dateutil 2.8.2
pytz 2022.1
setuptools 56.0.0
six 1.16.0
Pip complains that the version of numpy
to be installed has a conflict with the dependency requirements specified in the pandas
library, but still goes ahead to install that version of numpy
anyway. This could cause bugs to occur during runtime, which is definitely not what we want.
Let us now see what happens if we try to do something similar with Poetry:
user:~$ poetry show
numpy 1.22.3 NumPy is the fundamental package for array computing with Python.
pandas 1.4.2 Powerful data structures for data analysis, time series, and statistics
python-dateutil 2.8.2 Extensions to the standard Python datetime module
pytz 2022.1 World timezone definitions, modern and historical
six 1.16.0 Python 2 and 3 compatibility utilities
user:~$ poetry add "numpy<1.18.5"
Updating dependencies
Resolving dependencies... (53.1s)
SolverProblemError
Because pandas (1.4.2) depends on numpy (>=1.18.5)
and no versions of pandas match >1.4.2,<2.0.0, pandas (>=1.4.2,<2.0.0) requires numpy (>=1.18.5).
So, because dependency-manager-test depends on both pandas (^1.4.2) and numpy (<1.18.5), version solving failed.
at ~/.local/share/pypoetry/venv/lib/python3.8/site-packages/poetry/puzzle/solver.py:241 in _solve
237│ packages = result.packages
238│ except OverrideNeeded as e:
239│ return self.solve_in_compatibility_mode(e.overrides, use_latest=use_latest)
240│ except SolveFailure as e:
→ 241│ raise SolverProblemError(e)
242│
243│ results = dict(
244│ depth_first_search(
245│ PackageNode(self._package, packages), aggregate_package_nodes
user:~$ poetry show
numpy 1.22.3 NumPy is the fundamental package for array computing with Python.
pandas 1.4.2 Powerful data structures for data analysis, time series, and statistics
python-dateutil 2.8.2 Extensions to the standard Python datetime module
pytz 2022.1 World timezone definitions, modern and historical
six 1.16.0 Python 2 and 3 compatibility utilities
Before installing or updating any libraries, Poetry will check the dependency requirements of all the existing libraries that are installed, and any dependency conflict that is discovered would cause the installation process to be stopped. Although this could mean a bit more initial effort to resolve conflicting versions of libraries, this also ensures that there are no dependency conflicts within the project that could lead to bugs later on.
Easier to organize dependencies for development and production
It is not unusual to have dependencies used in development but not in production. For example, we might use libraries like black
or isort
to reformat our code and make it more readable. Or we might have libraries like pytest
that we use for unit testing. Typically these libraries are not used in production, so we do not want to have them installed during the production runtime.
If we are to organize our dependencies in a suitable way for Pip, we will need to use two different requirements text file to record our development and production dependencies, which may look something like this:
# requirements.txt
numpy
pandas
# requirements-dev.txt
-r requirements.txt
black
isort
pytest
We would then use the following commands to install our dependencies:
# Installing only production dependencies
(test-env) user:~$ pip install -r requirements.txt
# Installing both development and production dependencies
(test-env) user:~$ pip install -r requirements-dev.txt
This setup makes it tedious to manage dependencies, as we need to keep track of dependencies in two different files. We might even accidentally include the same libraries in both files without noticing it.
Poetry, however, makes it much easier to organize dependencies for development and production. The example below shows how easy it is to do this in the pyproject.toml
file:
...
[tool.poetry.dependencies]
numpy = "^1.22.3"
pandas = "^1.4.2"
[tool.poetry.dev-dependencies]
black = "^21.7b0"
isort = "^5.9.3"
pytest = "^6.0"
...
You can see how easy it is to tell from pyproject.toml
that the production libraries we want are numpy
and pandas
, and the ones for development only are black
, isort
and pytest
. We could also use the following commands to install the libraries:
# Installing only production dependencies
poetry install --no-dev
# Installing both development and production dependencies
poetry install
We could also specify whether the library to be installed should be a development dependency or not if using the poetry add
command, as shown:
# Install production dependency
poetry add numpy
# Install development dependency
poetry add pytest --dev
Reproducible library installations
Imagine this scenario. You installed pandas
via Pip using the requirements.txt
file shown below (using the command pip install -r requirements.txt
):
pandas==1.4.2
When you first run the pip install
command, you notice that version 1.20.0 of numpy
was installed together with pandas
. However, when you run the pip install
command again half a year later, you find that the version of numpy
installed has changed to 1.22.3, even though you are using the same requirements.txt
file. This could potentially cause dependency conflicts if your project contains other dependencies that use numpy
too (e.g. scikit-learn
, tensorflow
).
We could of course use the command pip freeze > requirements.txt
to persist the metadata of installed dependencies (i.e. package names and version numbers) to the requirements.txt
file, but this can get rather tedious as we start to use more dependencies for our projects. Also, since Pip does not handle dependency conflicts that well (as mentioned earlier), we might end up persisting dependencies with conflicts between one another.
With Poetry, we have the poetry.lock
file, which basically stores only the metadata of dependencies that do not have conflicts with one another. A poetry.lock
looks something like this:
[[package]]
name = "numpy"
version = "1.22.3"
description = "NumPy is the fundamental package for array computing with Python."
category = "main"
optional = false
python-versions = ">=3.8"
[[package]]
name = "pandas"
version = "1.4.2"
description = "Powerful data structures for data analysis, time series, and statistics"
category = "main"
optional = false
python-versions = ">=3.8"
[package.dependencies]
numpy = [
{version = ">=1.18.5", markers = "platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10""}, {version = ">=1.19.2", markers = "platform_machine == "aarch64" and python_version < "3.10""}, {version = ">=1.20.0", markers = "platform_machine == "arm64" and python_version < "3.10""}, {version = ">=1.21.0", markers = "python_version >= "3.10""},
]
python-dateutil = ">=2.8.1"
pytz = ">=2020.1"
[package.extras]
test = ["hypothesis (>=5.5.3)", "pytest (>=6.0)", "pytest-xdist (>=1.31)"]
...
The poetry.lock
file is created automatically when we run poetry install
for the first time. This file is also updated automatically whenever we run poetry add
to install new dependencies, poetry update
to update dependency versions, or poetry lock
to check for conflicts in the dependencies listed in pyproject.toml
. With the poetry.lock
file, we can be sure that we are always installing the same versions of libraries whenever we run the poetry install
command.
Better support for installing libraries from private repositories
In some projects, we might need to install libraries that are published only in private repositories (basically repositories aside from PyPI that store Python libraries). With Pip, what we could do is to use the --index-url
or --extra-index-url
option of the pip install
command to specify the URL of the private repository. The command would look something like this:
pip install --index-url url-of-private-repo library-name
This would work fine if we were installing our Python libraries one by one, but it could cause some issues if we were installing dependencies using the requirements.txt
file, and we will see why in a moment.
If we used the pip freeze
command to persist the installed dependencies, we would see something like this in the requirements.txt
file:
library-1==0.1.0
library-2==1.1.0
...
If library-1
only exists in a private repository, we would get an error like this if we tried to install the libraries without specifying the --index-url
option:
(test-env) user:~$ pip install -r requirements.txt
ERROR: Could not find a version that satisfies the requirement library-1==0.1.0 (from versions: none)
ERROR: No matching distribution found for library-1==0.1.0
We will need to add the --index-url
option to the command above to be able to install library-1
from the private repository. Now, to complicate the story further, let us say that version 1.1.0 of library-2
exists in PyPI but not in the private repository (which is totally possible if the private repository is not updated with the latest libraries from PyPI). If we tried to run the command pip install -r requirements.txt --index-url url-of-private-repo
, we would get the same error message as seen previously. This happens because Pip will search for and install libraries from only the private repository if the --index-url
option is used, and search within and install from only PyPI if the --index-url
option is not used. This can be quite troublesome, as we would need to check our private repository to find the library versions that are available if we decided to install libraries from there.
A better way would be to install libraries from both PyPI and the private repository, depending on where each library can be found. This is something that Poetry can do but Pip cannot. In Poetry, we could specify the following configuration in the pyproject.toml
file to tell Poetry to search within both PyPI and the private repository.
[[tool.poetry.source]]
name = "name-of-private-repo"
url = "url-of-private-repo"
secondary = true
The secondary
parameter basically tells Poetry to search for and install libraries from PyPI first, and only go to the private repository if some of the libraries cannot be found in PyPI. This helps us have the best of both worlds, to be able to tap on the wide range of libraries out there in PyPI while also getting access to niche libraries that might only be available in a private repository.
Conclusion
With all the details I have mentioned above, I hope it is clear about why I think Poetry is the better choice compared to Pip or Conda for production code. I encourage you to try out Poetry if you have not used it before, and I will love to hear more about what you think in the comments. Happy programming 😃