Unlocking Communication: AI Decodes Speech from Brain Activity

META AI has unveiled an innovative machine learning model that can interpret speech from non-invasive brain recordings. This advancement holds the promise of helping many individuals with brain injuries regain their ability to communicate.

Reading Minds: A New Frontier

Understanding brain activity for communication

Annually, numerous individuals lose their capacity to communicate due to brain injuries, strokes, or degenerative diseases. The United Nations estimates that over 1 billion people live with some form of disability. Recently, brain-computer interfaces (BCIs) have shown success in alleviating some of these challenges, allowing those with speech paralysis to communicate at rates of up to 15 words per minute.

BCIs typically involve implanting electrodes into the brain, which is an invasive procedure that carries risks such as scar tissue formation and potential rejection of foreign objects. In light of these dangers, researchers have turned their attention to non-invasive methods for decoding language from brain activity. Two primary techniques have emerged:

Magneto-encephalography (MEG): This functional neuroimaging technique uses magnetometers to map brain activity.
Electroencephalography (EEG): A diagnostic method that records electrical activity in the brain via electrodes.

Despite advancements in these technologies, they often produce signals that are noisy and variable among individuals. Due to this complexity, researchers have favored extracting specific features from the signals rather than using them in their raw form.

Traditionally, many studies focused on feature extraction before training a model, often customizing models for individual patients, which presents significant scalability challenges.

Can We Decode Brain Activity?

A recent publication aimed to develop a model capable of decoding speech from recorded brain signals. The primary challenge lies in the unknown representation of spoken words in the brain. Therefore, the researchers began their experiments with healthy participants who were listening to audio in their native language.

Previous methodologies treated the model as a regression problem; however, the authors proposed utilizing a contrastive loss (CLIP loss), originally designed for aligning text and image representations. They aimed to align the latent representations of brain activity spectrograms with corresponding sounds.

The authors introduced a new brain module for processing spectrograms (MEL or EEG), which integrates a spatial attention layer and participant-specific convolution layers to yield a latent representation of brain signals.

Simultaneously, they employed Wav2vec to analyze audio and derive a speech representation, striving to maximize the alignment between sound and brain activity representations.

Once the model was established, the authors compiled a dataset of MEG and EEG recordings from 175 participants listening to short stories. They evaluated the model's ability to identify audio segments corresponding to 1,500 brain recording segments across four datasets.

The outcomes revealed that MEG significantly outperformed EEG, with the model demonstrating exceptional accuracy in pinpointing the exact audio segments and even better performance when considering the most likely segments.

The authors conducted further analysis to identify critical components of the model. Their findings indicated that a regression-based model outperformed a random model, while integrating contrastive loss led to improved results. Additionally, leveraging latent representations of both spectrograms and sounds yielded superior performance.

Key takeaways include: - All components of the brain module (e.g., convolution, spatial attention) are essential. - MEG generally outperforms EEG, though this depends on the specific EEG equipment used. - Increased participant numbers contribute to improved model performance, as it helps account for individual variability.

What Does the Model Learn? Understanding what the model decodes from brain signals remains a challenge, but it is crucial for its interpretability.

The authors examined probabilities associated with participants for the phrase "Thank you for coming, Ed," revealing variations in model performance among participants. The critical inquiry is: if the model makes errors, what are their sources? Are they related to phonetics or the sentence's semantics? Addressing these questions could enhance model performance.

To explore these questions, the authors trained a linear regressor to estimate the probability of the correct word based on the model's predictions. This analysis aimed to discern whether low-level (phonemes) or high-level (sentences) representations influenced the model's predictions. Results indicated that part-of-speech, word embedding, and phrase embedding strongly correlated with predictions, suggesting the model is influenced more by semantic and syntactic contexts than by individual words.

Influence of representations on predictions

The authors have made their code available:

<div class="link-block">

<div>

<div>

<h2>GitHub - facebookresearch/brainmagick: Training and evaluation pipeline for MEG and EEG brain…</h2>

<div><h3>Training and evaluation pipeline for MEG and EEG brain signal encoding and decoding using deep learning. Code for our…</h3></div>

<div><p>github.com</p></div>

</div>

<div>

</div>

</div>

</div>

Parting Thoughts

The model successfully matches segments of speech to brain activity, an impressive feat given the inherent noise in the data. Traditionally, analyzing such data required intricate pipelines tailored for each participant. However, deep learning now allows for more streamlined analyses. The authors propose an efficient end-to-end architecture that minimizes preprocessing efforts.

While we still have much to learn about how the brain encodes language, the authors' approach of training a model on extensive speech data aligned with brain signal learning is innovative.

Nevertheless, it is premature to consider clinical applications of this model. The technology is not yet portable, and further refinement is needed to enhance its understanding of more complex sentences and improve accuracy.

What are your thoughts? Feel free to comment below!

If you found this article engaging: * You can explore my other writings and connect with me on LinkedIn. * Here’s the link to my GitHub repository, where I plan to compile code and resources related to machine learning, AI, and more: https://github.com/SalvatoreRa/tutorial.

You may also find my recent articles interesting:

<div class="link-block">

<div>

<div>

<h2>Lord of Vectors: One Embedder to Rule Them All</h2>

<div><h3>Embedders are back in vogue, so why not have a universal one?</h3></div>

<div><p>levelup.gitconnected.com</p></div>

</div>

<div>

</div>

</div>

</div>

<div class="link-block">

<div>

<div>

<h2>Mistral 7B: a New Wind Blowing Away Other Language Models</h2>

<div><h3>Mistral 7B is more performing and faster than other LLMs</h3></div>

<div><p>levelup.gitconnected.com</p></div>

</div>

<div>

</div>

</div>

</div>

<div class="link-block">

<div>

<div>

<h2>Scaling Data, Scaling Bias: A Deep Dive into Hateful Content and Racial Bias in Generative AI</h2>

<div><h3>Scaling seems the solution for every issue in machine learning: but is it true?</h3></div>

<div><p>levelup.gitconnected.com</p></div>

</div>

<div>

</div>

</div>

</div>

<div class="link-block">

<div>

<div>

<h2>Grokking: Learning Is Generalization and Not Memorization</h2>

<div><h3>Understanding how a neural network learns helps us to avoid that the model forgets what it learns</h3></div>

<div><p>levelup.gitconnected.com</p></div>

</div>

<div>

</div>

</div>

</div>