python - how to read html line by line [duplicate]

  • Last Update :
  • Techknowledgy :

If you need a Python script:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
   if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()

If you're on *nix, try running the following command:

sort <file name> | uniq
1._
uniqlines = set(open('/tmp/foo').readlines())

writing that back to some file would be as easy as:

bar = open('/tmp/bar', 'w').writelines(uniqlines)

bar.close()

You can do:

import os
os.system("awk '!x[$0]++' /path/to/file > /path/to/rem-dups")

You have also other way:

with open('/tmp/result.txt') as result:
   uniqlines = set(result.readlines())
with open('/tmp/rmdup.txt', 'w') as rmdup:
   rmdup.writelines(set(uniqlines))

get all your lines in the list and make a set of lines and you are done. for example,

>>> x = ["line1", "line2", "line3", "line2", "line1"] >>>
   list(set(x))['line3', 'line2', 'line1'] >>>

If you need to preserve the ordering of lines - as set is unordered collection - try this:

y = []
for l in x:
   if l not in y:
   y.append(l)

Its a rehash of whats already been said here - here what I use.

import optparse

def removeDups(inputfile, outputfile):
lines=open(inputfile, 'r').readlines()
lines_set = set(lines)
out=open(outputfile, 'w')
for line in lines_set:
out.write(line)

def main():
parser = optparse.OptionParser('usage %prog ' +\
'-i <inputfile> -o <outputfile>')
      parser.add_option('-i', dest='inputfile', type='string',
      help='specify your input file')
      parser.add_option('-o', dest='outputfile', type='string',
      help='specify your output file')
      (options, args) = parser.parse_args()
      inputfile = options.inputfile
      outputfile = options.outputfile
      if (inputfile == None) or (outputfile == None):
      print parser.usage
      exit(1)
      else:
      removeDups(inputfile, outputfile)

      if __name__ == '__main__':
      main()

Suggestion : 2

In this tutorial, we will learn how to remove the duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file.,After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.,While writing, we will constantly check for any duplicate line in the file. If any line is previously written, we will skip that line. For example, for the following text file :,Read line by line from the input file and check if any line similar to this line was written to the output file.

First Line
Second Line
First Line
First Line
First Line
First Line
Second Line
import hashlib

#1
output_file_path = "C:/out.txt"
input_file_path = "C:/in.txt"

#2
completed_lines_hash = set()

#3
output_file = open(output_file_path, "w")

#4
for line in open(input_file_path, "r"):
  # 5
hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest()
#6
  if hashValue not in completed_lines_hash:
    output_file.write(line)
    completed_lines_hash.add(hashValue)
# 7
output_file.close()

Suggestion : 3

Number of lines at bottom of file to skip (unsupported with engine=’c’).,Number of rows of file to read. Useful for reading pieces of large files.,Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.,pandas itself only supports IO with a limited set of file formats that map cleanly to its tabular data model. For reading and writing other file formats into and from pandas, we recommend these packages from the broader community.

In[1]: import pandas as pd

In[2]: from io
import StringIO

In[3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In[4]: pd.read_csv(StringIO(data))
Out[4]:
   col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In[5]: pd.read_csv(StringIO(data), usecols = lambda x: x.upper() in ["COL1", "COL3"])
Out[5]:
   col1 col3
0 a 1
1 a 2
2 c 3
In[6]: data = "col1,col2,col3\na,b,1"

In[7]: df = pd.read_csv(StringIO(data))

In[8]: df.columns = [f "pre_{col}"
   for col in df.columns
]

In[9]: df
Out[9]:
   pre_col1 pre_col2 pre_col3
0 a b 1
In[10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In[11]: pd.read_csv(StringIO(data))
Out[11]:
   col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In[12]: pd.read_csv(StringIO(data), skiprows = lambda x: x % 2 != 0)
Out[12]:
   col1 col2 col3
0 a b 2
In[13]: import numpy as np

In[14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In[15]: print(data)
a, b, c, d
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11

In[16]: df = pd.read_csv(StringIO(data), dtype = object)

In[17]: df
Out[17]:
   a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN

In[18]: df["a"][0]
Out[18]: '1'

In[19]: df = pd.read_csv(StringIO(data), dtype = {
   "b": object,
   "c": np.float64,
   "d": "Int64"
})

In[20]: df.dtypes
Out[20]:
   a int64
b object
c float64
d Int64
dtype: object
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]:
<class 'float'> 4
   Name: col_1, dtype: int64

Suggestion : 4

Last Updated : 30 May, 2022

Examples:  

Input: [2, 4, 10, 20, 5, 2, 20, 4]
Output: [2, 4, 10, 20, 5]

Input: [28, 42, 28, 16, 90, 42, 42, 28]
Output: [28, 42, 16, 90]

Output:  

[2, 4, 10, 20, 5]

Output:  

[2, 4, 10, 20, 5]

[1, 2, 6, 5, 3, 7, 8]