Quantcast
Channel: Like Geeks
Viewing all articles
Browse latest Browse all 104

CSV processing using Python

$
0
0

Throughout this tutorial, we will explore methods for reading, writing, and editing CSV (Comma-Separated Values) files using the Python standard library “csv”.

Due to the popularity of CSV files for databasing, these methods will prove crucial to programmers across different fields of work.

CSV files are not standardized. Regardless, there are some common structures seen in all sorts of CSV files. In most cases, the first line of a CSV file is reserved for the headers of the columns of the files.

The lines following each form a row of the data where the fields are sorted in the order matching the first row. As the name suggests, data values are usually separated by a comma, however, other delimiters can be used.

Lastly, some CSV files will use double quotes when key characters are being used within a field.

All the examples used throughout this tutorial will be based on the following dummy data files: basic.csvmultiple_delimiters.csv, and new_delimiter.csv.

 

Read CSV (With Header or Without)

First, we will examine the simplest case: reading an entire CSV file and printing each item read in.

import csv

path = "data/basic.csv"

with open(path, newline='') as csvfile:

   reader = csv.reader(csvfile)

   for row in reader:

      for col in row:

         print(col,end=" ")

      print()

Let us break down this code. The only library needed to work with CSV files is the “csv” Python library. After importing the library and setting the path of our CSV file, we use the “open()” method to begin reading the file line by line.

The parsing of the CSV file is handled by the “csv.reader()” method which is discussed in detail later.

Each row of our CSV file will be returned as a list of strings that can be handled in any way you please. Here is the output of the code above:

Output for reading the basic.csv with the header.

Frequently in practice, we do not wish to store the headers of the columns of the CSV file. It is standard to store the headers on the first line of the CSV.

Luckily, “csv.reader()” tracks how many lines have been read in the “line_num” object. Using this object, we can simply skip the first line of the CSV file.

import csv

path = "data/basic.csv"

with open(path, newline='') as csvfile:

reader = csv.reader(csvfile)

   for row in reader:

   if(reader.line_num != 1):

      for col in row:

         print(col,end=" ")

      print()

Output for reading the basic.csv without the header.

 

CSV Reader Encoding

In the code above, we create an object called “reader” which is assigned the value returned by “csv.reader()”.

reader = csv.reader(csvfile)

The “csv.reader()” method takes a few useful parameters. We will only focus on two: the “delimiter” parameter and the “quotechar”. By default, these parameters take the values “,” and ‘”‘.

We will discuss the delimiter parameter in the next section.

The “quotechar” parameter is a single character that is used to define fields with special characters. In our example, all our header files have these quote characters around them.

This allows us to include a space character in the header “Favorite Color”. Notice how the result changes if we change our “quotechar” to the “|” symbol.

import csv

path = "data/basic.csv"

with open(path, newline='') as csvfile:

   reader = csv.reader(csvfile, quotechar='|')

   for row in reader:

      if(reader.line_num != 0):

      for col in row:

         print(col,end="\t")

      print()

Result when changing the quotechar attribute

Changing the “quotechar” from ‘”‘ to “|” resulted in the double quotes appearing around the headers.

 

Reading a Single Column(Without Pandas)

Reading a single column from a CSV is simple using our method above. Our row elements are a list containing the column elements.

Therefore, instead of printing out the entire row, we will only print out the desired column element from each row. For our example, we will print out the second column.

import csv

path = "data/basic.csv"

with open(path, newline='') as csvfile:

   reader = csv.reader(csvfile, delimiter=',')

   for row in reader:

      print(row[1])

Output for reading a single column of text

If you want to use pandas to read CSV files, you can check the pandas tutorial.

 

CSV Custom Delimiter

CSV files frequently use the “,” symbol to distinguish between data values. In fact, the comma symbol is the default delimiter for the csv.reader() method.

In practice though, data files may use other symbols to distinguish between data values. For example, examine the contents of a CSV file (called new_delimiter.csv) which uses “;” to delimit between data values.

Reading in this CSV file to Python is simple if we alter the “delimiter” parameter of the “csv.reader()” method.

reader = csv.reader(csvfile, delimiter=';')

Notice how we changed the delimiter argument from “,” to “;”. The “csv.reader()” method will parse our CSV file as expected with this simple change.

import csv

path = "data/new_delimiter.csv"

with open(path, newline='') as csvfile:

   reader = csv.reader(csvfile, delimiter=';')

   for row in reader:

      if(reader.line_num != 0):

      for col in row:

         print(col,end="\t")

      print()

Output when parsing with a custom delimiter

 

CSV with Multiple Delimiters

The standard CSV package in python cannot handle multiple delimiters. In order to deal with such cases, we will use the standard package “re”.

The following example parses the CSV file “multiple_delimiters.csv”. Looking at the structure of the data in “multiple_delimters.csv”, we see the headers are delimited with commas and the remaining rows are delimited with a comma, a vertical bar, and the text “Delimiter”.

The core function to accomplishing the desired parsing is the “re.split()” method which will take two strings as arguments: a highly structured string denoting the delimiters and a string to be split at those delimiters. First, let us see the code and output.

import re

path = "data/multiple_delimiters.csv"

with open(path, newline='') as csvfile:

   for row in csvfile:

      row = re.split('Delimiter|[|]|,|\n', row)

      for field in row:

         print(field, end='\t')

      print()

Output when handling multiple delimiters

The key component of this code is the first parameter of “re.split()”.

 'Delimiter|[|]|,|\n'

Each split point is separated by the symbol “|”. Since this symbol is also a delimiter in our text, we must put brackets around it to escape the character.

Lastly, we put the “\n” character as a delimiter so that the newline will not be included in the final field of each row. To see the importance of this, examine the result without “\n” included as a split point.

import re

path = "data/multiple_delimiters.csv"

with open(path, newline='') as csvfile:

   for row in csvfile:

      row = re.split('Delimiter|[|]|,', row)

      for field in row:

         print(field, end='\t')

      print()

Output 2 when handling multiple delimiters

Notice the extra spacing between each row of our output.

 

Writing to a CSV File

Writing to a CSV file will follow a similar structure to how we read the file. However, instead of printing the data, we will use the “writer” object within “csv” to write the data.

First, we will do the simplest example possible: creating a CSV file and writing a header and some data in it.

import csv

path = "data/write_to_file.csv"

with open(path, 'w', newline='') as csvfile:

   writer = csv.writer(csvfile)

   writer.writerow(['h1'] + ['h2'] + ['h3'])

   i = 0

   while i < 5:

      writer.writerow([i] + [i+1] + [i+2])

      i = i+1

Output in write_to_file.csv

In this example, we instantiate the “writer” object with the “csv.writer()” method. After doing so, simply calling the “writerow()” method will write the list of strings onto the next row in our file with the default delimiter “,” placed between each field element.

Editing the contents of an existing CSV file will require the following steps: read in the CSV file data, edit the lists (Update information, append new information, delete information), and then write the new data back to the CSV file.

For our example, we will be editing the file created in the last section “write_to_file.csv”.

Our goal will be to double the values of the first row of data, delete the second row, and append a row of data at the end of the file.

import csv

path = "data/write_to_file.csv"

#Read in Data
rows = []

with open(path, newline='') as csvfile:

   reader = csv.reader(csvfile)

   for row in reader:

      rows.append(row)

#Edit the Data
rows[1] = ['0','2','4']

del rows[2]

rows.append(['8','9','10'])

#Write the Data to File
with open(path, 'w', newline='') as csvfile:

   writer = csv.writer(csvfile)

   writer.writerows(rows)

Second output in write_to_file.csv

Using the techniques discussed in the prior sections, we read the data and stored the lists in a variable called “rows”. Since all the elements were Python lists, we made the edits using standard list methods.

We opened the file in the same manner as before. The only difference when writing was our use of the “writerows()” method instead of the “writerow()” method.

 

Search & Replace CSV File

We have created a natural way to search and replace a CSV file through the process discussed in the last section. In the example above, we read each line of the CSV file into a list of lists called “rows”.

Since “rows” is a list object, we can use Pythons list methods to edit our CSV file before writing it back to a file. We used some list methods in the example, but another useful method is the “list.replace()” method which takes two arguments: first a string to be found, and then the string to replace the found string with.

For example, to replace all ‘3’s with ’10’s we could have done

for row in rows:

   row = [field.replace('3','10') for field in row]

Similarly, if the data is imported as a dictionary object (as discussed later), we can use Python’s dictionary methods to edit the data before re-writing to the file.

 

Python Dictionary to CSV (DictWriter)

Pythons “csv” library also provides a convenient method for writing dictionaries into a CSV file.

import csv

Dictionary1 = {'header1': '5', 'header2': '10', 'header3': '13'}

Dictionary2 = {'header1': '6', 'header2': '11', 'header3': '15'}

Dictionary3 = {'header1': '7', 'header2': '18', 'header3': '17'}

Dictionary4 = {'header1': '8', 'header2': '13', 'header3': '18'}

path = "data/write_to_file.csv"

with open(path, 'w', newline='') as csvfile:

   headers = ['header1', 'header2', 'header3']

   writer = csv.DictWriter(csvfile, fieldnames=headers)

   writer.writeheader()

   writer.writerow(Dictionary1)

   writer.writerow(Dictionary2)

   writer.writerow(Dictionary3)

   writer.writerow(Dictionary4)

Output of Dictionary write in write_to_file.csv

In this example, we have four dictionaries with the same keys. It is crucial that the keys match the header names you want in the CSV file.

Since we will be inputting our rows as dictionary objects, we instantiate our writer object with the “csv.DictWriter()” method and specify our headers.

After this is done, it is as simple as calling the “writerow()” method to begin writing to our CSV file.

 

CSV to Python Dictionary (DictReader)

The CSV library also provides an intuitive “csv.DictReader()” method which inputs the rows from a CSV file into a dictionary object. Here is a simple example.

import csv

path = "data/basic.csv"

with open(path, newline='') as csvfile:

   reader = csv.DictReader(csvfile, delimiter=',')

   for row in reader:

      print(row)

Output when reading with DictRead()

As we can see in the output, each row was stored as a dictionary object.

 

Split Large CSV File

If we wish to split a large CSV file into smaller CSV files, we use the following steps: input the file as a list of rows, write the first half of the rows to one file and write the second half of the rows to another.

Here is a simple example where we turn “basic.csv” into “basic_1.csv” and “basic_2.csv”.

import csv

path = "data/basic.csv"

#Read in Data
rows = []

with open(path, newline='') as csvfile:

   reader = csv.reader(csvfile)

   for row in reader:

      rows.append(row)

Number_of_Rows = len(rows)

#Write Half of the Data to a File
path = "data/basic_1.csv"

with open(path, 'w', newline='') as csvfile:

   writer = csv.writer(csvfile)

   writer.writerow(rows[0]) #Header

   for row in rows[1:int((Number_of_Rows+1)/2)]:

      writer.writerow(row)

#Write the Second Half of the Data to a File
path = "data/basic_2.csv"

with open(path, 'w', newline='') as csvfile:

   writer = csv.writer(csvfile)

   writer.writerow(rows[0]) #Header

   for row in rows[int((Number_of_Rows+1)/2):]:

      writer.writerow(row)

basic_1.csv:

Output stored in basic_1.csv

basic_2.csv:

Output stored in basic_2.csv

In these examples, no new methods were used. Instead, we had two separate while loops for handling the first and second half of writing to the two CSV files.


Viewing all articles
Browse latest Browse all 104

Latest Images

Trending Articles





Latest Images