python joining csv files where key is first column value

  • Last Update :
  • Techknowledgy :

Something like

import csv
from collections
import OrderedDict

with open('b.csv', 'rb') as f:
   r = csv.reader(f)
dict2 = {
   row[0]: row[1: ]
   for row in r
}

with open('a.csv', 'rb') as f:
   r = csv.reader(f)
dict1 = OrderedDict((row[0], row[1: ]) for row in r)

result = OrderedDict()
for d in (dict1, dict2):
   for key, value in d.iteritems():
   result.setdefault(key, []).extend(value)

with open('ab_combined.csv', 'wb') as f:
   w = csv.writer(f)
for key, value in result.iteritems():
   w.writerow([key] + value)

produces

john, red, 34
andrew, green, 18
tonny, black, 50, driver, new york
jack, yellow, 27
phill, orange, 45, scientist, boston
kurt, blue, 29
mike, pink, 61

Suggestion : 2

In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the data into Python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into Python properly.,In many “real world” situations, the data that we want to use come in multiple files. We often need to combine these files into a single DataFrame to analyze the data. The pandas package provides various methods for combining DataFrames including merge and concat.,To work through the examples below, we first need to load the species and surveys files into pandas DataFrames. Before we start, we will make sure that libraries are currectly installed. ,Have a look at the vertical_stack dataframe? Notice anything unusual? The row indexes for the two data frames survey_sub and survey_sub_last10 have been repeated. We can reindex the new dataframe using the reset_index() method.

To work through the examples below, we first need to load the species and surveys files into pandas DataFrames. Before we start, we will make sure that libraries are currectly installed.

!pip install pandas matplotlib
!pip install pandas matplotlib
Requirement already satisfied: pandas in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(0.23 .0)
Requirement already satisfied: matplotlib in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(2.2 .2)
Requirement already satisfied: python - dateutil >= 2.5 .0 in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from pandas)(2.7 .3)
Requirement already satisfied: pytz >= 2011 k in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from pandas)(2018.4)
Requirement already satisfied: numpy >= 1.9 .0 in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from pandas)(1.14 .3)
Requirement already satisfied: cycler >= 0.10 in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from matplotlib)(0.10 .0)
Requirement already satisfied: six >= 1.10 in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from matplotlib)(1.11 .0)
Requirement already satisfied: pyparsing != 2.0 .4, != 2.1 .2, != 2.1 .6, >= 2.0 .1 in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from matplotlib)(2.2 .0)
Requirement already satisfied: kiwisolver >= 1.0 .1 in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from matplotlib)(1.0 .1)
Requirement already satisfied: setuptools in /Users/asha0035 / .local / share / virtualenvs / python - workshop - base - LFzz33nP / lib / python3 .6 / site - packages(from kiwisolver >= 1.0 .1 - > matplotlib)(39.2 .0)
You are using pip version 10.0 .1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip'
command.
import pandas as pd
surveys_df = pd.read_csv("surveys.csv",
   keep_default_na = False, na_values = [""])
surveys_df

When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame under the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. To stack the data vertically, we need to make sure we have the same columns and associated column format in both datasets. When we stack horizonally, we want to make sure what we are doing makes sense (ie the data are related in some way).

# Stack the DataFrames on top of each other
vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis = 0)

# Place the DataFrames side by side
horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis = 1)

We can use the to_csv command to do export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash to the file vertical_stack.to_csv('foldername/out.csv'). We use the ‘index=False’ so that pandas doesn’t include the index number for each line.

# Write DataFrame to CSV
vertical_stack.to_csv('output/out.csv', index = False)

Suggestion : 3

Learn how to concatenate two DataFrames together (append one dataFrame to a second dataFrame).,Learn how to concatenate two DataFrames together (append one dataFrame to a second dataFrame). ,The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.,Learn how to join two DataFrames together using a uniqueID found in both DataFrames.

import pandas as pd
articles_df = pd.read_csv('articles.csv',
   keep_default_na = False, na_values = [""])
articles_df
        id Title\
        0 0 The Fisher Thermodynamics of Quasi - Probabilities
        1 1 Aflatoxin Contamination of the Milk Supply: A...
           2 2 Metagenomic Analysis of Upwelling - Affected Bra...
           ...
           999 1 2015
        1000 11 2015

           [1001 rows x 16 columns]
journals_df = pd.read_csv('journals.csv')
journals_df
    id ISSN - L ISSNs PublisherId\
    0 0 2056 - 9890 2056 - 9890 1
    1 1 2077 - 0472 2077 - 0472 2
    2 2 2073 - 4395 2073 - 4395 2
       ...
       49 49 1999 - 4915 1999 - 4915 2
    50 50 2073 - 4441 2073 - 4441 2

    Journal_Title
    0 Acta Crystallographica Section E Crystallograp...
       1 Agriculture
    2 Agronomy
       ...
       49 Viruses
    50 Water
# read in first 10 lines of surveys table
articles_sub = articles_df.head(10)
# grab the last 10 rows(minus the last one)
articles_sub_last10 = articles_df[-11: -1]
#reset the index values to the second DataFrame appends properly
articles_sub_last10 = articles_sub_last10.reset_index(drop = True)
# drop = True option avoids adding new index column with old index values
# stack the DataFrames on top of each other
vertical_stack = pd.concat([articles_sub, articles_sub_last10], axis = 0)

# place the DataFrames side by side
horizontal_stack = pd.concat([articles_sub, articles_sub_last10], axis = 1)