How to Convert a List of Strings to a Tensor in PyTorch

To convert a list of strings to a tensor in PyTorch, follow these steps:

  1. Tokenize the strings.
  2. Convert the tokens to numerical values.
  3. Create a tensor from the numerical values.

Step 1: Tokenize the strings

  1. Character-level tokenization: Each character is treated as a token. E.g., “hello” -> [“h”, “e”, “l”, “l”, “o”]
  2. Word-level tokenization: Each word is treated as a token. E.g., “hello world” -> [“hello”, “world”]
  3. Other advanced tokenization methods, like subword tokenization, are used mainly in NLP models like BERT.
import torch

string = "Today is Mahavir Jayanti"

tokens = string.split()

Step 2: Convert the tokens to numerical values

Once tokenized, you must map each unique token to a unique integer. This often involves creating a vocabulary of all unique tokens and assigning each token an integer ID.

word_to_ids = {word: i for i, word in enumerate(tokens)}

numerical_values = [word_to_ids[word] for word in tokens]

Step 3: Create a tensor from the numerical values

After converting the tokens to integers, you can create a PyTorch tensor. If working with sequences of varying lengths (like sentences), you might need to pad the sequences to make them the same length.

tensor = torch.tensor(numerical_values)

print(tensor)

Here is the complete code.

import torch

string = "Today is Mahavir Jayanti"

tokens = string.split()

word_to_ids = {word: i for i, word in enumerate(tokens)}

numerical_values = [word_to_ids[word] for word in tokens]

tensor = torch.tensor(numerical_values)

print(tensor)

Output

tensor([0, 1, 2, 3])

That’s it!

Related posts

Convert a Torch Tensor to a PIL Image

PyTorch Tensor to Numpy Array

Pandas DataFrame to a PyTorch Tensor

Python List to PyTorch Tensor

PyTorch Tensor to Python List

Leave a Comment