Fed up with managing your host OS for your docker environment? Try booting your containers directly from a light-weight initramfs! Flash a USB pendrive with the kernel and initramfs, or netboot it locally or from the internet, configure it from the kernel command line. Bonus: It also supports syncing volumes with S3-compatible cloud storages, making provisioning and back-ups a breeze!
Containers have been an effective way to share reproducible environments for services, CI pipelines, or even user applications.
In the high availability world, orchestration can then be used to run multiple instances of the same service. However, if your goal is to run these containers on your local machines, you would first need to provision them with an operating system capable of connecting to the internet, and then downloading, extracting, and running the containers. This operating system would then need to be kept up to date across all your machines which is error-prone and can lead to subtle differences in the run environment which may impact your services.
In order to lower this maintenance cost and improve the reproducibility of the run environment, it would be best if we could drop this Operating System and directly boot the containers you want to run. With newer versions of podman, it is even painless to run systemd as the entrypoint, so why not create an initramfs that would perform the simple duty of connecting to the internet, and download a “root” container which can be shared between all the machines? If the size could be kept reasonable, both the kernel and initramfs could then be downloaded at boot time via iPXE either locally via PXE or from the internet.
This is with this line of reasoning that we started working on a new project called boot2container which would receive its configuration via the kernel command line and construct a pipeline of containers. Additionally, we added support for volumes, optionally synced with any S3-compatible cloud storages.
This project was then used in a bare-metal CI, both for the test machines and the gateways connecting them to the outside world. There, boot2container helps to provide the much-needed reproducibility of the test environment while also making it extremely easy to replicate this infrastructure in multiple locations to maximize availability.
The traditional approach of testing software on real hardware usually involves creating a rootfs which contains the test suites that need to be run, along with its run-time dependencies (network, mounting drives, time synchronization, …).
Maintaining one rootfs per test suite is a significant packaging burden, but also prevents running multiple test suites back to back which slows down testing. The alternative is also not a clear win as this makes the creation of the rootfs harder when having conflicting requirements between test suites or if a test suite silently modifies some configuration which would impact other test suites, potentially leading to test failures being mis-attributed.
Fortunately, Linux namespaces and OCI containers are now becoming commonplace and can now be used to package our test suites along with their dependencies, without having to integrate them all in one image. Provided that you have a well configured host kernel and OS, this enable running test suites in relative isolation thus reducing the chances of interference between test suites. Finally, the packaging problem can be alleviated by having the test suites provide releases as containers, thus allowing re-use in many CI systems without modifications.
In this presentation, we will further present the benefits of containers, and introduce boot2container: A podman-powered initramfs that gets configured declaratively using the kernel command line, is deployable via PXE thanks to its small size (<20 MB), and that makes it easy to share files with/from the test machine via an S3-compatible object storage.
With Freedesktop’s move to Gitlab every project not only got access to a lot of machine time, but they also got all the infrastructure to automate their runs, inspect the results, and provide automated testing reports of merge requests. This has led to a lot of projects adopting it to reduce regressions and maintenance costs to the point of almost bankrupting Freedesktop.org! The only downside of the current testing infrastructure is that it is meant to run in the cloud, not on the GPUs we develop drivers for! Of course, some efforts are underway to make even the DRM subsystem testable in the cloud (VKMS) but if we are to prevent regressions through pre-merge testing, we need at some point to run on the real hardware!
Hardware-testing labs do exists, but they rarely seem to happen without a corporation to back them up as only they have the resources to pay for the development of the system interfacing with the hardware, its hosting, and its maintenance. In order to be within the reach of hobbyist projects, we estimate the cost should be limited to $1kUSD, one week-end of hardware set up time, and a couple of evenings of tweaking before reaching stability, and no more than an hour per week of maintenance after that. To reach this goal, we need to make the deployment as easy as assembling plastic bricks, keep maintenance costs down through self-configuration/healing, and running Gitlab CI jobs in the farm as easy as inheriting from a CI template and setting a couple of environment variables!
While we have not yet fully reached this loafty goal, we already are operating 3 farms in 3 locations with the above properties mostly implemented \o/ In this talk, we are presenting how easy it is to deploy a kernel and run containers in our farm, show what it takes to set up a test farm at home, and what can be done to get hobbyist projects like Nouveau tested!
With the ever-increasing focus on testing found in our community, let’s try to coordinate the efforts of every individual.
The main focus for this workgroup will be two-fold:
Most GPUs now have open source drivers, and the trend is for all of them to be treated not as a curiosity, but instead being full-featured and providing an excellent user experience. To further push the open source philosophy, we need to look at the next frontier: Open Source Hardware.
While usual hardware development is prohibitively expensive, reconfigurable hardware (FPGA) is accessible to every hobbyists! This type of hardware has historically been very expensive and unable to provide the necessary performance to achieve any sort of satisfactory user experience, but the cost has dropped dramatically in the past 20 years, and the rise of hardware blocks such as PCIe, DDR memory controllers, and ultra-fast transceivers have enabled the creation of open PCIe display controllers capable of reaching 4K and more for a reasonable amount of money.
Writing open source drivers for such hardware is however a little tricky since users will likely want to mix-and-match the different open source blocks to tailor the features to their liking, and even do this at run time!
In this talk, I will introduce the idea behind LiteDIP, my project of creating a library of discoverable IP blocks for FPGAs along with their Linux driver which would enable users to configure and deploy their own System on Chip in ~10 minutes.
There are many Linux kernel-testing projects, most of them are modeled over proven software testing workflows. These workflows however often rely on a stable host platform and stable test results and, as such, don’t apply to testing development versions of Linux where the filesystem, network, boot, or suspend might be unreliable.
The Intel GFX CI debuted in 2016 with a different workflow: providing pre-merge curated testing results in a timely manner to all patch series posted on the intel-gfx mailing list. The IGT tests would get executed on Intel platforms spanning from 2004 to upcoming platforms. Known issues are automatically associated to bugs to focus the report on what the developer is trying to change, making it easier to review the change.
After years of experimenting and refining this workflow, the GFX CI team became confident that it was generic-enough and went on to standardize interfaces between the different components in order to enable other drivers to reproduce the testing workflow and collaborate on the development of IGT and related tools.
An example of related tools comes from Google’s ChromeOS validation HW (Chamelium) which acts as an open hardware re-programmable screen with DP, HDMI, and VGA inputs. After initial work from Red Hat in IGT to support the Chamelium, Intel took on the project and have achieved a level of testing for Display Port and HDMI comparable to their official conformance test suites. This massively increases the level of testing achievable in an automated testing system, and not just for Intel, but for GPUs support DP and/or HDMI.
Finally, a new test suite for the KMS interface is being designed around VKMS in order to test how Xorg and Wayland compositors behave in the presence of GPU (un)hotplugging, bandwidth limitations for planes, DP link status issues, etc… This should further improve the reliability of the userspace when it comes to hard-to-reproduce events, regardless of the GPU driver being used!
In this talk, I will compare the different linux testing projects, introduce the i915 CI workflow and tools, the open sourcing and standardization effort going on in i915-infra, the recent development in IGT/Chamelium, and the plan to test Wayland compositors. Let’s work together on standardizing our testing, and moving to a model where not only the i915 driver, but all the drivers would be validated before every commit!
The Linux community is slowly moving towards better quality trough automated testing to prevent regressions in mainline and stable trees. However, Linux is full of HW-specific code which makes validation of patches impossible for individual developers, which leads to regressions. In this talk, we will explain how we solved these issues by getting inspired by Linux’s development model, and how we extend it to the development of our testsuite, CI infrastructure and bug handling.
After 2 years of activity, this led Linus Torvalds to say i915’s quality has greatly improved compared to other graphic drivers.
Linux’s development model has been described as being akin to a bazaar, where any developer can make changes to Linux as long as they strictly improve the state of Linux, without regressing any application that currently runs on it. This allows Linux users to update their kernels and benefit from the work of all developers, without having to fix anything in their applications when a new version comes. Unfortunately, it is impossible for developers to try their changes on all the different hardware and userspace combination being used in the wild.
Typically, a developer will mostly test the feature he/she is working on with the hardware at hand before submitting the patch for review. Once reviewed, the patch can land in a staging repository controlled by the maintainer of the subsystem the patch is changing. Validation of the staging tree is then performed ahead of sending these changes to Linus Torvalds (or one of his maintainers). Regressions caught at this point require to bisect the issue, which is time consuming and usually done by a separate team, which may become a bottleneck. Sometimes they let regressions through, hoping to be able to fix them during the -rc cycles.
To address this bottleneck, the developer should be responsible for validating the change completely. This leads to a virtuous cycle as not only developers can rework their patches until they do not break anything (saving the time of other people), but they also become more aware of the interaction their changes have on userspace, which improves their understanding of the driver which leads to better future patches.
To enable putting the full cost of integration on developers, validation needs to become 100% automated, have 100% code/HW coverage of the userspace usecases, and provide timely validation results to even the most time-pressured developers. To reach these really ambitious objectives, driver developers and validation engineers need to be considered as one team. The CI system developers need to provide a system capable of reaching the objectives, and driver developers need to develop a test suite capable of reaching the goal of having 100% code coverage of the whole driver on the CI system provided to them.
Finally, this increase in understanding of how validation is done allows developers to know if their patch series will be properly validated, which reduces the risk of letting regressions land in Linux.
The devil however lies in the details, so in this talk, we will explain how we are going from theory to practice, what is our current status and what we are doing to get closer to our ambitious goal! We will describe the current developer workflow and demonstrate how we empowered developers by providing timely testing as a transparent service to anyone sending patches to our mailing lists.
A quick update on the consensus we reached at the Gfx Testing Workshop.
Brief update on the Nouveau project.
We want to talk about what we have done on over the past year(s) and what we are planning to work on in the near future. Main topics will be power management, Nvidia and community.
In this talk, we highlight how important automated testing is to sustain the upstream development model of Linux, especially when products are involved. We will then give you a tour of our system, how it integrates with the developer’s workflow, what we are working on, and we managed to grow from a couple of thousands tests executed per week to over 4 millions!
We will also present the changes we have done to the IGT test suite to become less Intel-specific and serve the needs of multiple drivers.
Upstream development requires never regressing the features that were already present in previous versions. We believe that this is not only the right thing to do, it is also increasing in relevance as more and more products move towards this model. To embrace this model, we try to catch unintentional regressions as early as possible through our CI system. This improves the productivity of our developers (fewer bugs coming months after the code was committed), and provide users with a smoother upgrade path (either through the product manufacturer or by upgrading the kernel themselves).
The Intel GFX CI has grown massively in the past 1.5 years: the number of machine doubled; the test coverage went from 260 tests/machine up to over 4k; the number of tests executed went from 100k/week to over 4M; pre-merge testing time dropping to an average of 30 minutes despite the increased usage.
We will then tour the audience through our different services, how they integrate with the developers’ workflow, and how we manage to keep track of failures (and file them).
Finally, we will share what are our current developments, our goals, and what we are doing to the IGT test suite in order make it useful for more drivers than Intel.
Linux has almost achieved world domination. However, most of the world is stuck on ancient releases, which is not only a security issue, it is also fragmenting the development effort as features and bugs need to be backported to ancient releases, and additional support written on these kernels are unlikely to be upstreamed.
At best, when making a new device, device vendors fork the upstream Linux kernel. At worst, they have to base their work on a non-upstream kernel, probably coming from the company providing the SoC being used, which is often already ancient, even before the device comes out. In both cases, they then add support for the missing features, do the device validation and ship the kernel (and its source) in their devices.
Some device vendors are nice-enough to also upstream their changes, but this process is time consuming and may require the re-design of the code. When the changes actually land, it is quite likely that they will get broken despite the the strong non-regression rule of Linux. This is because of the limited amount of users using the upstream kernel on their devices, which means developers are not aware that they are regressing some plaforms. This pretty much means that device vendors have to re-do the bring-up of the device on a subsequent release, which somewhat makes the upstreaming investment close to useless.
In this presentation, I will present motivate the testing of the kernel from a graphics perspective, the current projects doing kernel testing, and propose ways of providing pre-merge and post-merge testing on vendor devices.
In the past year, the continuous integration of the i915 driver has been picking up and is massively improving the state of the driver by doing both pre- and post-merge testing.
During the summer, we have made a big effort to publicly provide timely, stable, and actionable reports in order to improve the developers’ involvement. Not only did it successfully change the way Intel developers are working (more focus on bug fixing), and made the drm-tip tree more stable, but it also lead to bigger involvement in the IGT test suite development as well.
In this talk, I would like to show i915 developers what is currently available to them, what’s coming next, and inspire other teams working on other drivers to provide similar service through sharing our key learnings and tools.
An article about this presentation was written by LWN.
]]>Our society relies more and more on smart devices to ease communication and efficiency. Smart devices are transforming both industries and personal lives. Smart and self-organising wide-area sensor networks are now used to increase the efficiency of farms, cities, supply chains or power grids. Because they are always connected to the Internet, they can constantly and accurately monitor assets and help deliver what is required precisely when and where it is needed. Also, the general public has seen the transition to smart devices, cell phones being switched to smartphones, TVs to smart-TVs and cars to semi-autonomous cars.
This “Internet of Things” (IoT) revolution is happening at a frantic pace as companies digitalize the physical world. Gartner estimated that there were 4.9 billion smart devices deployed in 2015, with this number expected to grow to 25 billion by 2020.1 With such high numbers, IoT devices have the potential to create significant amounts of waste, which may exceed their potential to reduce resource consumption thanks to their ability to keep the state of every asset of interest up to date. In this article, I discuss how smart devices’ software is an artificial cause that limits their lifetime. I then explain the need for an alternative model that decouples the software and the hardware, to allow the software to be changed according to its owner’s need. Finally, I explain how the Open Source movement has already solved the software’s planned obsolescence for personal computers and servers, and how this model also naturally applies to the IoT devices.
]]>As distributions decided to move to the modesetting driver, we started looking into its 2D performance and power efficiency.
The results yielded interesting results as using the CPU to render cairo demos could be up to 10 times faster than using X or 3 times more power efficient. More importantly, most cases tested were comparable in performance with the CPU backend.
For the tested cases, the 2D acceleration primitives provided by the DDX also failed to reduce the CPU usage, because of the overhead of feeding comands to the GPU which is not amortised by having a huge number of primitives to render. This explains the resulting poor GPU utilisation which in turns prevents to power gate the GPU.
In light of these issues, the fact that Qt is mostly using its own rasterizer, GTK moving towards their own GL-based 2D acceleration and the fact that Wayland applications cannot rely on X to provide 2D acceleration, client-side libraries should now be recommended to application developers and we should make sure that the performance of current application stays good-enough.
This talk will present the results and hopes to spark discussions about this widely-ignored topic.
Last year at XDC, I announced the EzBench project that was aiming at making running benchmarks as easy as possible.
Fast forward a year, EzBench grew quite a lot. It became an official Freedesktop project and now provides automated bisecting of performance, unit test or rendering changes. It is now aiming at being a fully-automated CI system that performs all the low-level tasks automatically and generate trust-worthy reports that are directly actionnable by developers. More importantly, the system is usable on the developers’ machine directly, without requiring external servers.
Developers can then take this report and use it to reproduce the issue on their machine and get the differences between their machine and the reporter’s.
In this talk, I will present the architecture of the project and the reasons for it along with how to use it and how it can be set up to provide continuous integration, in conjunction with systems like Jenkins.
In this talk, we will give you a status update of the Nouveau driver since XDC 2015. We will talk about the following main topics: kernelspace changes, OpenGL 4.3, SPIR-V and OpenCL. We will sum up with what we plan to do in the future.
At XDC2015, I presented some of the pitfalls of benchmarking Graphics applications and announced the Open Source EzBench project which aims at automating data collection and bisecting performance changes while avoiding those pitfalls.
This presentation will recap some of the nasty issues with benchmarking and present EzBench along with how to use it.
Tracking the performance of our complex graphics stack is a necessity to avoid unintentional regressions in the performance of the most important games and benchmarks. Such regressions lead to unhappy open source gaming enthusiasts and wasted developer time who need to track down performance regressions sometimes months after they got introduced. Fixing such performance issues may also be a challenge as the code that introduced them may have become a dependency for newer features.
The need for performance tracking is becoming more and more critical as the complexity of our open source drivers increases to reach a performance comparable to their closed-source equivalent. This increased complexity makes it more and more likely for commits to accidentally break the performance of some benchmark/games on some platforms that the developer may not have convenient access to.
In an effort to detect performance regressions before they even hit mainline, automation should be increased. However, when it can take up to an hour to test the performance of one commit on one benchmark, it becomes clear that we will never have the necessary hardware to be able to test all the commits found on the mailing lists and we will have to be smarter than this.
In this presentation, I will describe the different challenges found in benchmarking, some surprising results, some tricks to reduce the variance between runs and what is my current plan for improving our performance QA by automatically tracking the performance, bisecting performance changes and letting everyone know about them by auto answering on the mailing list.
This presentation was also covered on LWN.
]]>