Open Source Security: Improving license detection when generating SBOMs

I blogged last year about generating a Software Bill of Material (SBOM) for an Apache Maven project using the cyclonedx-maven-plugin. It's ideal to generate an SBOM at build time in this way, as you have access to an accurate dependency graph (from Maven in this case). However, sometimes you want to create an SBOM from a third-party binary artifact, such as a jar, zip or docker image. Anchore Syft is ideal for this purpose. However, I found that it generated somewhat limited licensing information for jars. In this post I'll examine a series of contributions I made to Syft over the end of 2023 that majorly improved this. As an aside, I found the Syft community to be very helpful and responsive, so it was an enjoyable process!

Improvements Contributed

The initial Syft release I looked at (Syft v0.92.0) only detected a license for a jar if the Java Manifest.MF contained in the jar had an OSGi Bundle-License tag detailing the license used. As many projects don't support OSGi it meant that relatively few licenses were detected. Here are the improvements I contributed and the versions they were released in:

v0.93.0: Added support to get the license if specified in a pom.xml included in the jar.
v0.94.0: Added support to read a license file in the root directory or in META-INF and added support for different common license filenames.
v0.95.0: Perform case insensitive matching on Java License files, go to Maven Central to find a license defined in a parent pom, parse multiple poms in a jar. Also added recursive support to find a license from parent poms in Maven Central.
v0.96.0: Also check Maven Central for licenses defined in parent poms for embedded dependencies.
v0.97.0: If no pom.xml or pom.properties, fall back to use the Java metadata to find the correct artifact in Maven Central.

An additional improvement (not by me) was made in a subsequent release (v0.103.1) to fix a bug with underscores in artifacts that resulted in licenses not being found.

One point to make is that going to Maven Central to find poms with license information is not enabled by default, this is what I have in a local .syft.yaml:

java:
   maven-url: "https://repo1.maven.org/maven2"
   max-parent-recursive-depth: 8
   use-network: true

Testcase

As a test-case, I chose Apache Spark as an example of a project containing a large number of third-party (Java-based) dependencies, specifically the distribution spark-3.5.1-bin-hadoop3.tgz. Using Syft v0.92.0 as a starting point, I generated a cyclonedx-json SBOM using Syft via:

syft packages ./spark-3.5.1-bin-hadoop3.tgz -o cyclonedx-json > spark.json (note: newer versions of Syft use "scan" instead of "packages")

Then I used jq to generate a CSV consisting of the dependencies found in the SBOM and their license detected, or "unknown-license" if no license was found:

jq -r '.components[] | .group + "/" + .name + ":" + .version + "," + try(.licenses[] | .license? | flatten | join(" ")) // .group + "/" + .name + ":" + .version + "," + .licenses?[]?.expression // .group + "/" + .name + ":" + .version + ",unknown-license"' spark.json

Results

For the Apache Spark distributed detailed above, these are the results:

Syft version	Dependencies detected	Unknown licenses	% licenses detected
v0.92.0	440	306	30.4%
v0.93.0	442	245	44.5%
v0.94.0	470	203	56.8%
v0.95.0	444	157	64.6%
v0.96.0	468	32	93.1%
v0.97.0	468	27	94.2%
v1.103.1	467	11	97.4%

Going from less than a third of dependencies getting their license detected correctly to almost 100% is pretty good! The remaining 11 dependencies don't contain any pom.xml or pom.properties or any other metadata that allow Syft to find the correct pom.xml in Maven Central. Possibly some improvements could be made in looking at the package names to try to find the correct path in Maven Central.

Open Source Security

Thursday, March 7, 2024

Improving license detection when generating SBOMs

Improvements Contributed

Testcase

Results

No comments:

Post a Comment