Thursday, March 7, 2024

Improving license detection when generating SBOMs

I blogged last year about generating a Software Bill of Material (SBOM) for an Apache Maven project using the cyclonedx-maven-plugin. It's ideal to generate an SBOM at build time in this way, as you have access to an accurate dependency graph (from Maven in this case). However, sometimes you want to create an SBOM from a third-party binary artifact, such as a jar, zip or docker image. Anchore Syft is ideal for this purpose. However, I found that it generated somewhat limited licensing information for jars. In this post I'll examine a series of contributions I made to Syft over the end of 2023 that majorly improved this. As an aside, I found the Syft community to be very helpful and responsive, so it was an enjoyable process!

Improvements Contributed

The initial Syft release I looked at (Syft v0.92.0) only detected a license for a jar if the Java Manifest.MF contained in the jar had an OSGi Bundle-License tag detailing the license used. As many projects don't support OSGi it meant that relatively few licenses were detected. Here are the improvements I contributed and the versions they were released in:

An additional improvement (not by me) was made in a subsequent release (v0.103.1) to fix a bug with underscores in artifacts that resulted in licenses not being found.

One point to make is that going to Maven Central to find poms with license information is not enabled by default, this is what I have in a local .syft.yaml:

java:
   maven-url: "https://repo1.maven.org/maven2"
   max-parent-recursive-depth: 8
   use-network: true

Testcase

As a test-case, I chose Apache Spark as an example of a project containing a large number of third-party (Java-based) dependencies, specifically the distribution spark-3.5.1-bin-hadoop3.tgz. Using Syft v0.92.0 as a starting point, I generated a cyclonedx-json SBOM using Syft via:

  • syft packages ./spark-3.5.1-bin-hadoop3.tgz -o cyclonedx-json > spark.json (note: newer versions of Syft use "scan" instead of "packages")

Then I used jq to generate a CSV consisting of the dependencies found in the SBOM and their license detected, or "unknown-license" if no license was found:

  • jq -r '.components[] | .group + "/" + .name + ":" + .version + "," + try(.licenses[] | .license? | flatten | join(" ")) // .group + "/" + .name + ":" + .version + "," + .licenses?[]?.expression // .group + "/" + .name + ":" + .version +  ",unknown-license"' spark.json

Results

For the Apache Spark distributed detailed above, these are the results:

Syft version Dependencies detected Unknown licenses % licenses detected
v0.92.0 440 306 30.4%
v0.93.0 442 245 44.5%
v0.94.0 470 203 56.8%
v0.95.0 444 157 64.6%
v0.96.0 468 32 93.1%
v0.97.0 468 27 94.2%
v1.103.1 467 11 97.4%

Going from less than a third of dependencies getting their license detected correctly to almost 100% is pretty good! The remaining 11 dependencies don't contain any pom.xml or pom.properties or any other metadata that allow Syft to find the correct pom.xml in Maven Central. Possibly some improvements could be made in looking at the package names to try to find the correct path in Maven Central.