Bias in AI datasets — Kymes.blog

The term “garbage in, garbage out” is especially accurate when it comes to the training datasets used for AI models. To train AI, several tools are used including algorithms, statistics, and rules-based systems. When AI is being trained, patterns emerge and with enough repetition models develop that “model” the training dataset enabling AI to make predictions.

To make sure datasets are useable, it’s crucial that best practices for data management are utilized. Best practices ensure that data is diverse, balanced, and representative of the real-world population.

With data collection, curation, and training one of the main issues to be aware of and addressed quickly is bias. With bias data can become skewed and unbalanced, leading an AI model to make inaccurate, biased predictions. Bias can be especially glaring in AI because for an AI model to make predictions, it needs to disregard outlying information that may need to be considered otherwise.

Some forms of bias include:

Confirmation Bias: This type of bias occurs when data confirming existing beliefs is favored more heavily than data unfamiliar or contrary to a data curator’s understanding.

Historical Bias: Historical bias is revealed in pre-existing cultural, and systemic prejudices. This bias can influence data collection methods and data use by not showing a modern, up-to-date data sample.

Selection Bias: Selection bias issues occur mostly during data collection when sample selection is too small and lacks diversity. Selection bias leads to incorrect relationships among variables in data.

With selection bias there are different types which include:

Sampling bias: On a systemic level, Sampling bias occurs when a collection isn’t randomized enough. Because the ratio of groups in a data sample may be weighted differently, the AI model won’t be able to properly generalize the data for accurate predictions.
Convergence: Convergence bias is when data is not collected in a representative way. An example would be if a Western pen maker uses survey data on the ink-drying speed of their pen from mostly left-handed writers, excluding most right-handed writers.

Participation: Participation bias can be more insidious and difficult to spot. This type of bias happens when participants voluntarily place themselves in groups, skewing results and leading to unbalanced data that isn’t as randomized as it should be.

Survivorship bias: Survivorship bias is when narrative and inference creeps into data collection leading to incorrect results. This bias shows itself in results that only fit a certain narrative or selection process, ignoring the results that don’t fit that same selection process. A data analyst then infers incorrect connections between data points.

A good example of this bias is focusing on successful entrepreneurs who didn’t go to college as proof that college is unnecessary for success. This narrative ignores those who did go to college and were successful as well as others who didn’t go to college and were not successful.

Inference figures heavily in Survivorship bias, two main ways to reach inaccurate conclusions through Survivorship bias include:

Inferring causality: Believing an outcome is caused by a specific variable when there isn’t enough information to reach that conclusion, and there may not even be a connection there to begin with.

Inferring a norm: Believing data that survives the narrative represents a past norm, rather than looking at the data that did not survive over time.

Availability Bias: Availability bias involves the use of heuristics as a mental shortcut to reach conclusions. With heuristics, incorrect conclusions are reached based on how well a certain scenario is remembered. A good example is when travelers overestimate the chances of a catastrophe happening based on recently consumed information about that catastrophe. Availability bias favors short-term memory over randomized, diverse data.

Data bias comes in a lot of different forms and can find it’s way into training datasets in insidious ways, this leads to skewed datasets, which result in a less-than-ideal AI model. It’s important to keep vigilant about data diversity and favor data samples that are sufficiently randomized to avoid bias.