If you need a Python script:
lines_seen = set() # holds lines already seen outfile = open(outfilename, "w") for line in open(infilename, "r"): if line not in lines_seen: # not a duplicate outfile.write(line) lines_seen.add(line) outfile.close()
If you're on *nix, try running the following command:
sort <file name> | uniq
uniqlines = set(open('/tmp/foo').readlines())
writing that back to some file would be as easy as:
bar = open('/tmp/bar', 'w').writelines(uniqlines)
bar.close()
You can do:
import os
os.system("awk '!x[$0]++' /path/to/file > /path/to/rem-dups")
You have also other way:
with open('/tmp/result.txt') as result:
uniqlines = set(result.readlines())
with open('/tmp/rmdup.txt', 'w') as rmdup:
rmdup.writelines(set(uniqlines))
get all your lines in the list and make a set of lines and you are done. for example,
>>> x = ["line1", "line2", "line3", "line2", "line1"] >>>
list(set(x))['line3', 'line2', 'line1'] >>>
If you need to preserve the ordering of lines - as set is unordered collection - try this:
y = [] for l in x: if l not in y: y.append(l)
Its a rehash of whats already been said here - here what I use.
import optparse
def removeDups(inputfile, outputfile):
lines=open(inputfile, 'r').readlines()
lines_set = set(lines)
out=open(outputfile, 'w')
for line in lines_set:
out.write(line)
def main():
parser = optparse.OptionParser('usage %prog ' +\
'-i <inputfile> -o <outputfile>')
parser.add_option('-i', dest='inputfile', type='string',
help='specify your input file')
parser.add_option('-o', dest='outputfile', type='string',
help='specify your output file')
(options, args) = parser.parse_args()
inputfile = options.inputfile
outputfile = options.outputfile
if (inputfile == None) or (outputfile == None):
print parser.usage
exit(1)
else:
removeDups(inputfile, outputfile)
if __name__ == '__main__':
main()
In this tutorial, we will learn how to remove the duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file.,After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.,While writing, we will constantly check for any duplicate line in the file. If any line is previously written, we will skip that line. For example, for the following text file :,Read line by line from the input file and check if any line similar to this line was written to the output file.
First Line Second Line First Line First Line First Line
First Line Second Line
import hashlib #1 output_file_path = "C:/out.txt" input_file_path = "C:/in.txt" #2 completed_lines_hash = set() #3 output_file = open(output_file_path, "w") #4 for line in open(input_file_path, "r"): # 5 hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest() #6 if hashValue not in completed_lines_hash: output_file.write(line) completed_lines_hash.add(hashValue) # 7 output_file.close()
Number of lines at bottom of file to skip (unsupported with engine=’c’).,Number of rows of file to read. Useful for reading pieces of large files.,Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.,pandas itself only supports IO with a limited set of file formats that map cleanly to its tabular data model. For reading and writing other file formats into and from pandas, we recommend these packages from the broader community.
In[1]: import pandas as pd
In[2]: from io
import StringIO
In[3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
In[4]: pd.read_csv(StringIO(data))
Out[4]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In[5]: pd.read_csv(StringIO(data), usecols = lambda x: x.upper() in ["COL1", "COL3"])
Out[5]:
col1 col3
0 a 1
1 a 2
2 c 3
In[6]: data = "col1,col2,col3\na,b,1"
In[7]: df = pd.read_csv(StringIO(data))
In[8]: df.columns = [f "pre_{col}"
for col in df.columns
]
In[9]: df
Out[9]:
pre_col1 pre_col2 pre_col3
0 a b 1
In[10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
In[11]: pd.read_csv(StringIO(data))
Out[11]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In[12]: pd.read_csv(StringIO(data), skiprows = lambda x: x % 2 != 0)
Out[12]:
col1 col2 col3
0 a b 2
In[13]: import numpy as np
In[14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
In[15]: print(data)
a, b, c, d
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11
In[16]: df = pd.read_csv(StringIO(data), dtype = object)
In[17]: df
Out[17]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In[18]: df["a"][0]
Out[18]: '1'
In[19]: df = pd.read_csv(StringIO(data), dtype = {
"b": object,
"c": np.float64,
"d": "Int64"
})
In[20]: df.dtypes
Out[20]:
a int64
b object
c float64
d Int64
dtype: object
In [21]: data = "col_1\n1\n2\n'A'\n4.22"
In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
In [23]: df
Out[23]:
col_1
0 1
1 2
2 'A'
3 4.22
In [24]: df["col_1"].apply(type).value_counts()
Out[24]:
<class 'str'> 4
Name: col_1, dtype: int64
In [25]: df2 = pd.read_csv(StringIO(data))
In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
In [27]: df2
Out[27]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22
In [28]: df2["col_1"].apply(type).value_counts()
Out[28]:
<class 'float'> 4
Name: col_1, dtype: int64
Last Updated : 30 May, 2022
Examples:
Input: [2, 4, 10, 20, 5, 2, 20, 4]
Output: [2, 4, 10, 20, 5]
Input: [28, 42, 28, 16, 90, 42, 42, 28]
Output: [28, 42, 16, 90]
Output:
[2, 4, 10, 20, 5]
Output:
[2, 4, 10, 20, 5]
[1, 2, 6, 5, 3, 7, 8]