berttokenizer - when encoding and decoding sequences extra spaces appear

  • Last Update :
  • Techknowledgy :

If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping = True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Then once you get the token classification results, you can do something like

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]: tokens.encodings[0].offsets[span_stop_index][1]]

Suggestion : 2

Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).,Return the list of tokens (sub-parts of the input strings after word/subword splitting and before conversion to integer indices) at a given batch index (only works for the output of a fast tokenizer).,Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.,additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

from transformers
import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Push the tokenizer to your namespace with the name "my-finetuned-bert"
and have a local clone in the
# * my - finetuned - bert * folder.
tokenizer.push_to_hub("my-finetuned-bert")

# Push the tokenizer to your namespace with the name "my-finetuned-bert"
with no local clone.
tokenizer.push_to_hub("my-finetuned-bert", use_temp_dir = True)

# Push the tokenizer to an organization with the name "my-finetuned-bert"
and have a local clone in the
# * my - finetuned - bert * folder.
tokenizer.push_to_hub("my-finetuned-bert", organization = "huggingface")

# Make a change to an existing repo that has been cloned locally in * my - finetuned - bert * .
tokenizer.push_to_hub("my-finetuned-bert", repo_url = "https://huggingface.co/sgugger/my-finetuned-bert")

Suggestion : 3

09/15/2021

The following example uses a single byte array to encode strings in two separate operations. It maintains an index that indicates the starting position in the byte array for the next set of ASCII-encoded bytes. It calls the ASCIIEncoding.GetByteCount(String) method to ensure that the byte array is large enough to accommodate the encoded string. It then calls the ASCIIEncoding.GetBytes(String, Int32, Int32, Byte[], Int32) method to encode the characters in the string.

using System;
using System.Text;

public class Example {
   public static void Main() {
      string[] strings = {
         "This is the first sentence. ",
         "This is the second sentence. "
      };
      Encoding asciiEncoding = Encoding.ASCII;

      // Create array of adequate size.
      byte[] bytes = new byte[49];
      // Create index for current position of array.
      int index = 0;

      Console.WriteLine("Strings to encode:");
      foreach(var stringValue in strings) {
         Console.WriteLine("   {0}", stringValue);

         int count = asciiEncoding.GetByteCount(stringValue);
         if (count + index >= bytes.Length)
            Array.Resize(ref bytes, bytes.Length + 50);

         int written = asciiEncoding.GetBytes(stringValue, 0,
            stringValue.Length,
            bytes, index);

         index = index + written;
      }
      Console.WriteLine("\nEncoded bytes:");
      Console.WriteLine("{0}", ShowByteValues(bytes, index));
      Console.WriteLine();

      // Decode Unicode byte array to a string.
      string newString = asciiEncoding.GetString(bytes, 0, index);
      Console.WriteLine("Decoded: {0}", newString);
   }

   private static string ShowByteValues(byte[] bytes, int last) {
      string returnString = "   ";
      for (int ctr = 0; ctr <= last - 1; ctr++) {
         if (ctr % 20 == 0)
            returnString += "\n   ";
         returnString += String.Format("{0:X2} ", bytes[ctr]);
      }
      return returnString;
   }
}
// The example displays the following output:
//       Strings to encode:
//          This is the first sentence.
//          This is the second sentence.
//
//       Encoded bytes:
//
//          54 68 69 73 20 69 73 20 74 68 65 20 66 69 72 73 74 20 73 65
//          6E 74 65 6E 63 65 2E 20 54 68 69 73 20 69 73 20 74 68 65 20
//          73 65 63 6F 6E 64 20 73 65 6E 74 65 6E 63 65 2E 20
//
//       Decoded: This is the first sentence. This is the second sentence.
Imports System.Text

Module Example
Public Sub Main()
Dim strings() As String = {
   "This is the first sentence. ",
   "This is the second sentence. "
}
Dim asciiEncoding As Encoding = Encoding.ASCII

' Create array of adequate size.
Dim bytes(50) As Byte ' Create index for current position of array.
Dim index As Integer = 0

Console.WriteLine("Strings to encode:")
For Each stringValue In strings
Console.WriteLine("   {0}", stringValue)

Dim count As Integer = asciiEncoding.GetByteCount(stringValue)
If count + index >= bytes.Length Then
Array.Resize(bytes, bytes.Length + 50)
End If
Dim written As Integer = asciiEncoding.GetBytes(stringValue, 0,
   stringValue.Length,
   bytes, index)

index = index + written
Next
Console.WriteLine()
Console.WriteLine("Encoded bytes:")
Console.WriteLine("{0}", ShowByteValues(bytes, index))
Console.WriteLine()

' Decode Unicode byte array to a string.
Dim newString As String = asciiEncoding.GetString(bytes, 0, index)
Console.WriteLine("Decoded: {0}", newString)
End Sub

Private Function ShowByteValues(bytes As Byte(), last As Integer) As String
Dim returnString As String = "   "
For ctr As Integer = 0 To last - 1
If ctr Mod 20 = 0 Then returnString += vbCrLf + "   "
returnString += String.Format("{0:X2} ", bytes(ctr))
Next
Return returnString
End Function
End Module
   ' The example displays the following output:
'       Strings to encode:
'          This is the first sentence.
'          This is the second sentence.
'       
'       Encoded bytes:
'       
'          54 68 69 73 20 69 73 20 74 68 65 20 66 69 72 73 74 20 73 65
'          6E 74 65 6E 63 65 2E 20 54 68 69 73 20 69 73 20 74 68 65 20
'          73 65 63 6F 6E 64 20 73 65 6E 74 65 6E 63 65 2E 20
'       
'       Decoded: This is the first sentence. This is the second sentence.

The following example encodes three strings and then decodes them into a single array of characters. It maintains an index that indicates the starting position in the character array for the next set of decoded characters. It calls the GetCharCount method to ensure that the character array is large enough to accommodate all the decoded characters. It then calls the ASCIIEncoding.GetChars(Byte[], Int32, Int32, Char[], Int32) method to decode the byte array.

using System;
using System.Text;

public class Example {
   public static void Main() {
      string[] strings = {
         "This is the first sentence. ",
         "This is the second sentence. ",
         "This is the third sentence. "
      };
      Encoding asciiEncoding = Encoding.ASCII;
      // Array to hold encoded bytes.
      byte[] bytes;
      // Array to hold decoded characters.
      char[] chars = new char[50];
      // Create index for current position of character array.
      int index = 0;

      foreach(var stringValue in strings) {
         Console.WriteLine("String to Encode: {0}", stringValue);
         // Encode the string to a byte array.
         bytes = asciiEncoding.GetBytes(stringValue);
         // Display the encoded bytes.
         Console.Write("Encoded bytes: ");
         for (int ctr = 0; ctr < bytes.Length; ctr++)
            Console.Write(" {0}{1:X2}",
               ctr % 20 == 0 ? Environment.NewLine : "",
               bytes[ctr]);
         Console.WriteLine();

         // Decode the bytes to a single character array.
         int count = asciiEncoding.GetCharCount(bytes);
         if (count + index >= chars.Length)
            Array.Resize(ref chars, chars.Length + 50);

         int written = asciiEncoding.GetChars(bytes, 0,
            bytes.Length,
            chars, index);
         index = index + written;
         Console.WriteLine();
      }

      // Instantiate a single string containing the characters.
      string decodedString = new string(chars, 0, index - 1);
      Console.WriteLine("Decoded string: ");
      Console.WriteLine(decodedString);
   }
}
// The example displays the following output:
//    String to Encode: This is the first sentence.
//    Encoded bytes:
//    54 68 69 73 20 69 73 20 74 68 65 20 66 69 72 73 74 20 73 65
//    6E 74 65 6E 63 65 2E 20
//
//    String to Encode: This is the second sentence.
//    Encoded bytes:
//    54 68 69 73 20 69 73 20 74 68 65 20 73 65 63 6F 6E 64 20 73
//    65 6E 74 65 6E 63 65 2E 20
//
//    String to Encode: This is the third sentence.
//    Encoded bytes:
//    54 68 69 73 20 69 73 20 74 68 65 20 74 68 69 72 64 20 73 65
//    6E 74 65 6E 63 65 2E 20
//
//    Decoded string:
//    This is the first sentence. This is the second sentence. This is the third sentence.

If best-fit fallback is the default for an encoding object, you can choose another fallback strategy when you retrieve an Encoding object by calling the Encoding.GetEncoding(Int32, EncoderFallback, DecoderFallback) or Encoding.GetEncoding(String, EncoderFallback, DecoderFallback) overload. The following section includes an example that replaces each character that cannot be mapped to code page 1252 with an asterisk (*).

using System;
using System.Text;

public class Example {
   public static void Main() {
      Encoding cp1252r = Encoding.GetEncoding(1252,
         new EncoderReplacementFallback("*"),
         new DecoderReplacementFallback("*"));

      string str1 = "\u24C8 \u2075 \u221E";
      Console.WriteLine(str1);
      foreach(var ch in str1)
      Console.Write("{0} ", Convert.ToUInt16(ch).ToString("X4"));

      Console.WriteLine();

      byte[] bytes = cp1252r.GetBytes(str1);
      string str2 = cp1252r.GetString(bytes);
      Console.WriteLine("Round-trip: {0}", str1.Equals(str2));
      if (!str1.Equals(str2)) {
         Console.WriteLine(str2);
         foreach(var ch in str2)
         Console.Write("{0} ", Convert.ToUInt16(ch).ToString("X4"));

         Console.WriteLine();
      }
   }
}
// The example displays the following output:
//       Ⓢ ⁵ ∞
//       24C8 0020 2075 0020 221E
//       Round-trip: False
//       * * *
//       002A 0020 002A 0020 002A

Suggestion : 4

Last updated 2022-05-31 UTC.

Setup

pip install tensorflow_datasets
pip install - U 'tensorflow-text==2.8.*'
pip install tensorflow_datasets
pip install -U 'tensorflow-text==2.8.*'
import logging
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow as tf

# Import tf_text to load the ops used by the tokenizer saved model
import tensorflow_text # pylint: disable = unused -
   import
logging.getLogger('tensorflow').setLevel(logging.ERROR) # suppress warnings

The tf.data.Dataset object returned by TensorFlow datasets yields pairs of text examples:

for pt_examples, en_examples in train_examples.batch(3).take(1):
   for pt in pt_examples.numpy():
   print(pt.decode('utf-8'))

print()

for en in en_examples.numpy():
   print(en.decode('utf-8'))

Suggestion : 5

Last Updated : 16 Jun, 2022

w4a3d1e1x6y1w3