reading csv file in pandas with double 'double quotes' and embedded commas

  • Last Update :
  • Techknowledgy :

This will work. It falls back to the python parser (as you have non-regular separators, e.g. they are comma and sometimes space). If you only have commas it would use the c-parser and be much faster.

In[1]: import csv

In[2]: !cat test.csv "column1", "column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

In[3]: pd.read_csv('test.csv', sep = ',\s+', quoting = csv.QUOTE_ALL)
pandas / io / parsers.py: 637: ParserWarning: Falling back to the 'python'
engine because the 'c'
engine does not support regex separators;
you can avoid this warning by specifying engine = 'python'.
ParserWarning)
Out[3]:
   "column1", "column2"
"column3"
"column4"
"column5"
"column6"
"AM"
"07"
"1"
"SD"
"SD"
"CR"
"AM"
"08"
"1,2,3"
"PR,SD,SD"
"PR,SD,SD"
"PR,SD,SD"
"AM"
"01"
"2"
"SD"
"SD"
"SD"

Suggestion : 2

Super User is a question and answer site for computer enthusiasts and power users. It only takes a minute to sign up.,Comma inside double quotes is Ok, it's allowed by rfc4180 standard. As about " " inside of data values (such as "value" "13") - you will need to clean up source file before processing. If double quotes stay together as "" it shouldn't be an issue because it comply with CSV standard, it calls escaped double quotes, but if there is a space between double quotes then you need to clean it up, By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy. , Does any international law support the claim that "Taiwan's independence or not should be decided by Taiwanese people, not by any other country"?

No need to preprocess csv file, just use engine type python :

dataset = pd.read_csv('sample.csv', sep = ',', engine = 'python')

Use in python pandas sep=',\s*' instead of sep=',\s+', it will make space(s) optional after each comma:

file1 = pd.read_csv('sample.txt', sep = ',\s*', skipinitialspace = True, quoting = csv.QUOTE_ALL, engine = 'python')

Use:

sed - r 's/\"\s+\"/\"\"/g'
src.csv > cleared.csv

before you feeding CSV to pandas. It will remove space between quotes or run

sed - r 's/\"\s+\"//g'
src.csv > cleared.csv

Suggestion : 3

This is a line in a file named as data.txt and I am trying to read this in pandas using read_csv. "xxx"|"xxx"|"-xxxxxxx"|"xxxxx"|"x"|"xx"|""xxxxxx""|"x"|"xx"|"xxxxxxx"|""|"x"|"xxxxxx"|"X"|"xxxx"|"xxxxx"|"",The problem I am facing here is that even though all the other data are read in the data frame correctly, pandas has an issue when it comes to reading ""xxxxxx"" two double quotes and it reads it as xxxxxx"" inside the dataframe As you can notice there is in the 7th index in the above line, there is an item with double-double quotes, that is the issue,Please post something reproducible and your expected output. Also please narrow your example down to the minimum necessary to reproduce,This is exactly what I want. pandas cannot figure out the double quotes when it comes in between in the item. https://stackoverflow.com/questions/58325337/reading-csv-file-in-pandas-with-double-double-quotes-and-embedded-commas

df = pd.read_csv('data.txt', names = columns, dtype = column_dict, na_values = [''], keep_default_na = False, sep = '|', encoding = 'cp1252', skiprows = 1)
df = pd.read_csv(StringIO(data), na_values = [''], keep_default_na = False, sep = '|', encoding = 'cp1252', skiprows = 1, quoting = csv.QUOTE_NONE)

Suggestion : 4

I need to read a CSV file in Pandas which has data in the following format (double 'double quotes' for one of the fields),Is there a way for me to read this file as is without having to preprocess and remove the double 'double quotes' in the data? ,When column2 has no commas, I'm able to read the data with some extra quotes which I can replace in further processing steps. I'm having parsing problems only when column2 is having a comma. , 1 day ago Aug 25, 2020  · CSV (comma-separated value) files are one of the most common ways to store data. Fortunately the pandas function read_csv() allows you to easily read in CSV files into …


"column1", "column2", "column3", "column4"
"10", ""
AB "", "ABCD", "abcd"
"11", ""
CD, E "", "CDEF", "abcd"
"12", ""
WER "", "DEF,31", "abcd"

Suggestion : 5

This file uses backslash (\) character to escape the embedded double quotes. However, by default the default csv module uses a double quote character to escape double quote character.,This file uses double quote to escape the embedded double quote characters in the field. By default, doublequote is set to True. As a result, while reading two consecutive double quotes are interpreted as one.,Similarly, if you have double quotes embedded inside a field, it must be escaped with another double quote character. Otherwise, they will not be interpreted correctly. For example:,Notice that only the address field is wrapped around double quotes. This is because by default the quoting argument is set to QUOTE_MINIMAL. In other words, fields will be quoted only when quotechar or delimiter appears in the data.

1
2
3
4
5
6
"ID", "Name", "CountryCode", "District", "Population"
"1", "Kabul", "AFG", "Kabol", "1780000"
"2", "Qandahar", "AFG", "Qandahar", "237500"
"3", "Herat", "AFG", "Herat", "186800"
"4", "Mazar-e-Sharif", "AFG", "Balkh", "127800"
"5", "Amsterdam", "NLD", "Noord-Holland", "731200"
1
2
3
4
Name, Age, Address
Jerry, 10, "2776 McDowell Street, Nashville, Tennessee"
Tom, 20, "3171 Jessie Street, Westerville, Ohio"
Mike, 30, "1818 Sherman Street, Hope, Kansas"
1
2
3
Id, User, Comment
1, Bob, "John said "
"Hello World"
""
2, Tom, ""
The Magician ""

Suggestion : 6

jQuery Tutorial , Hire developers

You can process it with the following code (after https://stackoverflow.com/a/64456792/5660315):

from io
import StringIO
import csv
import pandas as pd

s = ""
"
Q, W, E, R
A, S, "D,F", G
Z, X, C, V
   ""
"
df = pd.read_csv(StringIO(s),
   names = range(4),
   sep = ',',
   quoting = csv.QUOTE_ALL,
   quotechar = '"'
)
print(df)
# 0 1 2 3
# 0 Q W E R
# 1 A S D, F G
# 2 Z X C V

Suggestion : 7

I need to read a CSV file in Pandas which anycodings_csv has data in the following format (double anycodings_csv 'double quotes' for one of the fields),I'm not sure if pandas can do this by anycodings_csv itself since you also have both anycodings_csv unescaped separators and quotes in your anycodings_csv data.,Is there a way for me to read this file as anycodings_csv is without having to preprocess and remove anycodings_csv the double 'double quotes' in the data? ,For this case + some other common issues anycodings_csv like extra quotes, spaces, commas, anycodings_csv imbedded commas. While pandas do have a anycodings_csv parameter doublequote, it's not flexible anycodings_csv enough.

I need to read a CSV file in Pandas which anycodings_csv has data in the following format (double anycodings_csv 'double quotes' for one of the fields)

"column1", "column2", "column3", "column4"
"10", ""
AB "", "ABCD", "abcd"
"11", ""
CD, E "", "CDEF", "abcd"
"12", ""
WER "", "DEF,31", "abcd"

I expect the correctly parsed dataframe to anycodings_csv be like

column1 column2 column3 column4
10 AB ABCD abcd
11 "CD,E"
CDEF abcd
12 WER "DEF,31"
abcd

I tried using

df = pd.read_csv('sample.txt', quotechar = '""', quoting = csv.QUOTE_ALL)

but getting

TypeError: "quotechar"
must be a 1 - character string

However, you should be able to parse it anycodings_csv after modifying the data with regex by anycodings_csv escaping quotes that are part of the anycodings_csv field.

import re
from io
import StringIO

data = ""
"
"column1", "column2", "column3", "column4"
"10", ""
AB "", "ABCD", "abcd"
"11", ""
CD, E "", "CDEF", "abcd"
"12", ""
WER "", "DEF,31", "abcd"
""
"

data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', data, flags = re.M)

pd.read_csv(StringIO(data), escapechar = '\\')

If you are reading from a file, then:

with open('path/to/csv', 'r') as f:
   data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', f.read(), flags = re.M)
df = pd.read_csv(StringIO(data), escapechar = '\\')

This should do the trick for you

df = pd.read_csv("so.txt", encoding = 'utf-8', names = ["column1", "column2", "column3", "column4"], sep = '",', header = 0, quoting = csv.QUOTE_ALL)

Using a system pipe, should be efficient anycodings_csv for large files on Linux

import os
df = pd.read_csv(
   os.popen('sed -r "s/^\s+|(^[,[:space:]]*|\s*)(#.*)?$//g; s/\s+,/,/g; s/\\"\\"/\\"/g" %s' % fname),
   quotechar = '"', skipinitialspace = True)

OR: using a python pipe

import re
from io
import StringIO
with open(fname) as f:
   data = re.sub('""', '"', re.sub('[ \t]+,', ',',
      re.sub('^[ \t]+|(^[ \t,]*|[ \t]*)(#.*)?$', '', f.read(), flags = re.M)))
df = pd.read_csv(StringIO(data), quotechar = '"', skipinitialspace = True)

Input file with comments and issues

a, b, c, d # header w / trailing spaces, , , , , , # commas + spaces, no data
# extra space before data
1, 2, 3.5, 4 k
3, " 5 ", 7.6, "n, m"
# extra spaces, comma inside
10, "20", 30.5, w z
40, 60, 75, ""
x, q ""
# double quoting

It's now clean and properly formatted:

a int64
b int64
c float64
d object

list(df['d']): ['4k', 'n, m', 'w z', 'x, q']

Suggestion : 8

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL),If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.,Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.,Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

public class Reader
    {
        public List<List<string>> ReadCsv(string[] data, char separator)
        {
            var rows = new List<List<string>>();
            foreach (string text in data)
            {
                bool isquotedString = false;
                int startIndex = 0;
                int endIndex = 0;
                var cols = new List<string>();
                foreach (char c in text)
                {
                    //in a correctly formatted csv file,
                    //when a separator occurs inside a quoted string 
                    //the number of '"' is always an odd number
                    if (c == '"')
                    {
                        //toggle isquotedString
                        isquotedString = !isquotedString;
                    }
                        //ignore separator if embedded within a quoted string
                        if (c == separator && !isquotedString)
                        {
                            //this will add all cols except the last
                            cols.Add(trimQuotes(text.Substring(startIndex, endIndex - startIndex)));
                            startIndex = endIndex + 1;
                        }
                 
                    endIndex++;
                }
                //reached the last column so trim and add it
                cols.Add(trimQuotes(text.Substring(startIndex, endIndex - startIndex)));
                rows.Add(cols);
            }
            return rows;
        }
        private string trimQuotes(string text)
        {
            if (string.IsNullOrEmpty(text))
            {
                return text;
            }
            return text[0] == '"' && text.Length > 2
                       //trim enclosing '"' and replace any embedded '""' with '"' 
                       ? text.Substring(1, text.Length - 2).Replace("\"\"", "\"")
                       //replace any embedded '""' with '"' 
                       : text.Replace("\"\"", "\"");
        }
    }