Defining the barrier for Open Source AI components

Have you ever wondered about the fate of your Open Source AI software when it interfaces with other language models, inference code, and supporting libraries and tools? What if these components lack an Open Source AI-compliant license? Which connections can you safely establish without compromising code, content, data, models, or configurations?

As these questions swirl in my mind, I find myself searching for answers in the form of a concept. Is there a term within the realm of AI and data science that captures this essence? It appears to reside between the lines of the draft Open Source AI definition (OSAID), a crucial aspect concerning the components of OSAID.

This concept is analogous to the blood-brain barrier and existing between components in AI systems. By identifying this barrier and its various permutations, we unveil its presence. Recognizing its existence enables us to delve into its permeability and discern what travels across the barrier.

The significance of this barrier lies in its role in managing critical concerns: security, data privacy, personal privacy, and intellectual property (IP) rights. Could this same barrier between AI components serve as a guardian for each of these areas?

The necessity of naming this barrier revolves around IP rights, intersecting with the definition of Open Source AI that the OSI is refining this year. While we’re somewhat accustomed to this with Open Source software, Open Source AI introduces a complexity of a different magnitude.

If you thought the transition to Open Source companies operating as SaaS businesses was monumental, prepare for the intricacies awaiting in Open Source AI.

For Open Source AI, the permeability test for this barrier determines if it leaks IP rights and what are the repercussions. When this barrier is properly permeable, components can integrate seamlessly without concerns about sharing or creating derivative works. This is pertinent for individuals on both ends of the licensing spectrum, whether they engage with Open Source AI or not.

This barrier may influence whether different or incompatible legal terms for the original source of components matter when they come in contact. Sources could range from software to raw data, from data represented as knowledge in an LLM to configurations and tunables in the neural network or within the containing RAG.

Referencing the draft table of items for evaluating future legal documents under the Open Source AI definition, here’s the list of required and optional (good citizen) components:

Required to be under an OSI-compliant license or TBD:

Data pre-processing
Training, validation, and testing
Inference code
Supporting libraries and tools
Model architecture
Model parameters (including weights)
Good-citizen optional:
Code used for inference benchmark tests
Evaluation code
All data sets: training, testing, validation, and benchmarking data sets, data cards, evaluation metrics and results, and all other data documentation
All model elements: model card and sample model outputs
Any other documentation or tools produced or used: thorough research papers, usage documentation, technical report, supporting tools
Struggling to grasp this concept? Allow me to extend the metaphor:

In physiology, the blood-brain barrier shields the brain from toxins in the blood while supplying essential nutrients. Similarly, in engineering, control surfaces safeguard users from inadvertently damaging products. Think of networking firewalls as barriers, preserving the integrity of internet traffic without altering its content (or creating a derivative work!)

Let me know if I’m straying from the mark or merely circling it. Your thoughts on this topic are invaluable. Thanks!

(This article was fully drafted by me to where I was happy with it, then I constructed a prompt to ChatGPT 3.5 to “Edit this article In the style of Karsten Wade, making subtle changes so it performs higher in SEO rankings:” I then incorporated many of the suggestions, as they were fairly good copyedits and helped.)

Full text of the 0.0.6 version of the Open Source AI Definition (OSAID) from https://hackmd.io/@opensourceinitiative/osaid-0-0-6

What is Open Source AI

An Open Source AI is an AI system made available to the public under terms that grant the freedoms to:

Use the system for any purpose and without having to ask for permission.

Study how the system works and inspect its components.

Modify the system for any purpose, including to change its output.

Share the system for others to use with or without modifications, for any purpose.

Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system. For machine learning systems that means having public access to:

Data: Sufficiently detailed information on how the system was trained, including the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, the labeling procedures and data cleaning methodologies.

Code: The code used for pre-processing data, the code used for training, validation and testing, the supporting libraries like tokenizers and hyperparameters search code (if used), the inference code, and the model architecture.

Model: The model parameters, including weights. Where applicable, these should include checkpoints from key intermediate stages of training as well as the final optimizer state.

https://hackmd.io/@opensourceinitiative/osaid-0-0-6

Related