Introduction

I have a tedious task to perform which involves reading information from SVN and updating a review checklist with the information. The ‘hard’ part of this whole process is the actual review. The ‘easy’ part is looking up the revision information, paths, file names, etc and putting them into the spreadsheet. Despite the fact that this is the ‘easy’ part it’s also very easy to mess up. This tedious copy and paste process has many steps, requires significant working memory and can be easily confusing. The worst part is that even if you do the ‘hard’ part well, you need to do the ‘easy’ part perfectly because if you mess up a revision or a path there’s no way to link the work you did to the actual artifact being reviewed.

So, why not create a script?

Python can read SVN and filesystem information. It can read Excel files. It can write excel files. So, why not automate it?

To do so, I will need to be able to write Excel files with Python 3. More specifically, I will need to modify an existing template with information I glean from SVN and other sources.

Thus, this post will contain information on how to read, write and modify Excel files with Python 3

OpenPyXL

If you need a library that can read and write xlsx files (not just xls files), you want OpenPyXL. I don’t think there’s any disagreemnt on that on Stack Overflow. As far as I can tell, it’s the best way to write .xlsx files in Python.

Installation

OpenPyXL installed just fine using pip:

C:\users\sfrieder>pip3 install openpyxl
Collecting openpyxl
  Downloading https://files.pythonhosted.org/packages/f9/d8/be9dc2b17ba47f1db9032ed7e19915145b4c093f66bb36f0d919d2dc8ccf/openpyxl-3.0.4-py2.py3-none-any.whl (241kB)
     |████████████████████████████████| 245kB 172kB/s
Collecting jdcal (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/f0/da/572cbc0bc582390480bbd7c4e93d14dc46079778ed915b505dc494b37c57/jdcal-1.4.1-py2.py3-none-any.whl
Collecting et-xmlfile (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Installing collected packages: jdcal, et-xmlfile, openpyxl
  Running setup.py install for et-xmlfile ... done
Successfully installed et-xmlfile-1.0.1 jdcal-1.4.1 openpyxl-3.0.4
WARNING: You are using pip version 19.2.3, however version 20.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Reading Excel Files

I’m going to base all of this off of a tutorial I found here.

It looks pretty simple for most things. Here’s their simple code on how to read a cell:

import openpyxl

book = openpyxl.load_workbook('sample.xlsx')

sheet = book.active

a1 = sheet['A1']
a2 = sheet['A2']
a3 = sheet.cell(row=3, column=1)

print(a1.value)
print(a2.value) 
print(a3.value)

Except I don’t see a good answer on how to select a sheet. I see the book.active thing and I don’t think it’s what I want. I want to select by name.

Selecting a Sheet by Name

Turns out this is simple too. See this StackOverflow answer. There’s a dictionary of sheets that you can index by name. Like this:

from openpyxl import load_workbook
wb2 = load_workbook('test.xlsx')
ws4 = wb2["New Title"]  

Getting a List of Sheet Names

This one’s easy too. I’m beginning to think using OpenPyXL will be easy.

from openpyxl import load_workbook
wb2 = load_workbook('test.xlsx')
print(wb2.sheetnames)
['Sheet2', 'New Title', 'Sheet1']

Writing Data to Cells

Once again, easy.

From the tutorial I mentioned earlier:

from openpyxl import Workbook

book = Workbook()
sheet = book.active

sheet['A1'] = 1
sheet.cell(row=2, column=2).value = 2

book.save('write2cell.xlsx')

It’s worth noting here that OpenPyXL uses 1-based indexing: the cell at row 2 column 2 is “B2”, not “C3”.

Creating a Sheet

Here’s how to create a new sheet in a workbook:

import openpyxl

wb = openpyxl.load_workbook("example.xlsx")
wb.create_sheet("Sheet2")

Writing a List of Values Into a Sheet

A good way to do this is:

sheet = budgetWb.sheets["My Sheet"]
header = ["Date","Description","Amount","Category","Comment"]
    for col,value in enumerate(headereader):
        sheet.cell(column = col+1,row = 1,value=str(value))

Looping Over All Rows in a Sheet

Getting a Range of Cells

There are times when you want to pluck a range of cells and work with them. Here’s how you can do that:

You can do it by rows. This code retrieves all rows between 3 and 10 (inclusive) and returns them as a list of lists :

import openpyxl

wbPath = "test.xls"

wb = openpyxl.load_workbook(wbPath)

sheet = wb.sheets["Sheet1"]

rows = testSheet.iter_rows(min_row = 3, max_row=10)

for row in rows:
    for cell in row:
        print(str(cell.value))
    

And, to do the same thing with columns:

import openpyxl

wbPath = "test.xls"

wb = openpyxl.load_workbook(wbPath)

sheet = wb.sheets["Sheet1"]

cols = testSheet.iter_cols(min_col = 3, max_col=10)

for col in cols:
    for cell in col:
        print(str(cell.value))
    

Retrieving Just Data (Not Formulas) From a Workbook

I had a bit of a shock when I opened a workbook, specified a range of cells and tried to get values: I was getting formulas instead of the result of formulas. There’s a way to avoid this: load the workbook as ‘data only’. You do that like this:

import openpyxl

wb = openpyxl.load_workbook(wbPath,data_only=True)

This way, when you acces the value of a cell, you get the result of the formula.

OpenPyXL Notes

OpenPyXL uses 1-based indexing: the cell at row 2 column 2 is “B2”, not “C3”.

xlrd, xlt and xlutils

There are multiple ways to read and write Excel files in Python. One way to do it is a collection of libraries that all work together:

  • xlrdfor reading
  • xlwt for writing, and
  • xlutils for miscellaneous functions

HOWEVER!!! MAJOR CAVEAT THAT BORKED MY ATTEMPT TO USE THESE LIBRARIES FOLLOWS!

xlwt CANNOT write xlsx or xlsm files - only the older xls files

I put a lot of effort into xlrd, xlwt and xlutils - all for nothing because my application requires that I write .xlsx files. The information here will stay, because it’s useful. How is it useful you ask? Well, if there’s one thing I’ve found it’s that you have to use the right tool for the job - and there’s sadly many tools out there to read and write Excel files in Python. It’s not enough to just pick one and say ‘this is the only one I’m going to use’. One big reason: speed. I’ve found that some libraries are much slower than others. Thus, if you only, for example, needed to read Excel files, xlrd might be perfect for you over another, more comprhensive library that is much slower. You really have to watch out on large files sometimes - using a different, less featureful, but faster library may be useful.

Installation

Here’s what I did on a Windows 10 PC with Python 3.81 installed:

pip3 install xlwt
pip3 install xlutils
pip3 install xlrd

I have both Python 2.7 and 3 installed, so in order to make sure that pip installed the libraries for Python 3 I had to use pip3 instead of pip.

Example Code

Rather than write a lot of notes about how to use the libraries, I’ll just show you an example program. It does two things: reads specific data from an Excel file, and finds and modifies a specific incorrect value within an Excel file.

For reference, the Excel data being used looks like this:

Name Telephone Address Notes
Don Juan 1234567890 1121 Nowhere St. Tokyo Japan I like this guy
Jean-Luc Picard NA France, somewhere Why does a Frenchman have an English accent?

The script will read the passed Excel file and print the data. Then, it will find and correct Jean-Luc’s Address entry and save the result in a new Excel file.

"""----------------------------------------------------------------------------
Excel Example
Author: Stephen Friederichs
Version 0.1

This script is an example of how to use Python to interact with Excel 
workbooks.

Example usage:

python xlExample.py -r <spreadsheet path>
 
Command-line switches:

-r, --read - Specify the path to an Excel file to read
-h, --help - Show this helpful information

"""

import xlwt
from xlutils.copy import copy as xlCopy   #I don't like just having a bare 'copy' around - make it specific
import textwrap
import getopt
import sys
import os
import logging
import xlrd
import shutil
import distutils.dir_util
import itertools

"""Global Variables"""
logFormatStr = '%(asctime)s -- %(funcName)s  - %(levelname)-8s %(message)s'
logLevel =  logging.DEBUG


"""Utility Functions - Common to all scripts"""

def prettyPrint(uglyString):
    """This function properly fomats docstrings for printing on the console"""
    
    #Remove all newlines
    uglyString = uglyString.replace('\n','').replace('\r','')
    #Use textwrap module to automatically wrap lines at 79 characters of text
    print(textwrap.fill(uglyString,width=79))
    
    
def help():
    for line in __doc__.splitlines()[4:]:
        prettyPrint(line)
    
def version():
    for line in __doc__.splitlines()[1:4]:
        prettyPrint(line)

"""Custom Functions"""

"""Return a list of tuples where each tuple is a row with the deata specified in the colHeaders list"""
def getSheetData(sheet,colHeaders):
    return zip(*[getColData(sheet,colHeader) for colHeader in colHeaders])
    
"""Return a column number for the column with the given colHeader"""
def getColNum(sheet,colHeader):
   
    headerRow=sheet.row(0)
                    
    for column,cell in enumerate(headerRow):
        if colHeader == str(cell.value):
            return int(column)
    
    raise ValueError("Cannot find " + str(colHeader) + " in sheet "+ str(sheet.name) +" among columns " + str(headerRow) )

"""Return a list of data values for the given header"""
def getColData(sheet,colHeader):
    return [str(cell.value) for cell in sheet.col(getColNum(sheet,colHeader))[1:]]
    
"""Return a list of sheet objects from the workbook object with the given sheetNames"""
def getWbSheets(xlWbPath,sheetNames):
    sheets = None
    xlWbPath = validatePath(xlWbPath)
    
    with xlrd.open_workbook(xlWbPath) as xlWbData:
        logging.debug("Reading spreadsheet @ %s",xlWbPath)
        sheets = {sheetName:sheet for (sheetName,sheet) in [(sheetName,xlWbData.sheet_by_name(sheetName)) for sheetName in sheetNames]}
    
    return sheets
        

"""Pass a list of sheet names to retrieve from an xlwt workbook object"""
def xlwtGetWbSheets(xlWbObj,sheets):
    retVal = {}
    try:
        for idx in itertools.count():
            sheet=xlWbObj.get_sheet(idx)
            if str(sheet.name) in sheets:
                retVal[sheet.name]=sheet
    except IndexError:
        pass
    return retVal
    
def formatPath(path):
    try:
        normPath = os.path.normpath(path)
        absPath = os.path.abspath(normPath)
    except ValueError as ex:
        logging.error("Couldn't format path %s : %s",str(path),str(ex))
        return None
    return absPath
        
def validatePath(path):
    formattedPath = formatPath(path)
    if not os.path.exists(path):
        raise ValueError("Path "+str(formattedPath)+" does not exist")
    else:
        return formattedPath
        
"""Main"""
        
def main():
    global logFormatStr
    global logLevel
    
    logging.basicConfig(format=logFormatStr,level=logLevel)
    
    """Main Function"""
    readPath=None
    
    version()

    try: 
        opts, args = getopt.getopt(sys.argv[1:], 'hr:', ['help','read'])    
    except getopt.GetoptError:
        print("Bad argument(s)")
        sys.exit(2)                 

    logging.debug("Opts/args: %s",str(opts))
    
    for opt, arg in opts:     
        logging.debug("Option %s, argument %s",str(opt),str(arg))
        if opt in ('-h', '--help'):     
            help()                         
            sys.exit(2)                 
        elif opt in ('-r','--read'):
            try:
                readPath = validatePath(formatPath(str(arg)))
            except ValueError:
                print("Specified path not found: "+str(arg))
                sys.exit(2)
        else: 
            print ("Invalid command-line option " + str(opt))
            help()
            sys.exit(2)

            
    #Validate command-line options
    
   
    if readPath is None:
        print("Must specify read path")
        help()
        sys.exit(2)
        
    try:
       
        if readPath is not None:
            #Approach one - not using any helper functions
            with xlrd.open_workbook(readPath) as xlWbData:
                print("Sheets in the workbook:")
                print(xlWbData.sheet_names())
                
                participantsSheet = xlWbData.sheet_by_name("Participants")
                print("\r\nHeader row data:")
                print(str(participantsSheet.row(0)))
                
                print("\r\nNumber of rows (total):")
                print(str(participantsSheet.nrows))
                
                print("\r\nNumber of data rows:")
                print(str(participantsSheet.nrows-1))
                
                print("\r\nData rows:")
                for i in range(1,participantsSheet.nrows):
                    print(str(participantsSheet.row(i)))
                
                print("\r\nNames only:")
                print(str(participantsSheet.col(0)[1:]))    #The indexing removes the header row
                
            #Approach two - using my helper functions
            #These functions help you read data from spreadsheets by names, not numbers
            sheets = getWbSheets(readPath,["Participants"])
            participants = getSheetData(sheets["Participants"],["Name","Telephone","Address","Notes"])
            
            for participant in participants:
                print(str(participant))
                
            #Step 2 - Modifying and saving in a new file
            
            print("Modifying " + str(readPath))
            #Open the original
            rb = xlrd.open_workbook(readPath)
            
            #Copy it to a new, writeable workbook object
            wb = xlCopy(rb)
            
            #Find the cell coordinates for Jean-Luc's address
            #Sadly, it doesn't appear that xlwt has the ability to read data - at least, not as well
            #as xlrd. Thus, you have to figure out where the data you want to change is located with xlrd
            #before changing it in the new workbook with xlwt
            
            #So, to do this, we'll find out which cell contains the data we want to change using xlrd
            
            participantsSheet = rb.sheet_by_name("Participants")
            
            colIdx = None
            rowIdx = None
            
            #Find the column number for 'Address'
            for idx,header in enumerate(participantsSheet.row(0)):
                if "Address" in str(header):
                    colIdx = idx
                    break
                    
            #Find the row that contains Jean-Luc        
            for idx in range(1,participantsSheet.nrows):
                if "Jean-Luc" in str(participantsSheet.row(idx)):
                    rowIdx = idx
                    break
            print("Jean-Luc's address cell is on the Participants sheet at "+str(rowIdx)+","+str(colIdx))
            
            #Find the participants sheet in the writeable object
            #Oops! xlwt doesn't have a sheet_by_name option, so you have to find the sheet manually
            #https://stackoverflow.com/questions/14587271/accessing-worksheets-using-xlwt-get-sheet-method
            participantsSheet = None
            try:
                for idx in itertools.count():
                    sheet=wb.get_sheet(idx)
                    if "Participants" in str(sheet.name):
                        participantsSheet=sheet
            except IndexError:
                pass
                
            if participantsSheet is None:
                raise ValueError("No Participants sheet in "+str(modifyPath))
                    
            participantsSheet.write(rowIdx,colIdx,"Le Barre, France")

            
            #Save the modified file
            wb.save("modified.xlsx")
            print("Saved modified file at "+str(formatPath("modified.xlsx")))
            
            pass 
    except Exception as e:
        raise

        
       
if __name__=="__main__":
    main()    
        

Notes

I will document anything interesting or unique about specific workflows with xlrd, xlwt and xlutils here.

Modifying Existing Excel Files With xlrd and xlwt

So, it seems, according to this answer, that the way to do this is to:

  1. Use xlrd to read the excel file (i.e., ‘workbook’)
  2. Copy the workbook with xlutils copy function
  3. Modify the workbook in memory
  4. Save it to a new file.

The code snippet they used is here:

#xlrd, xlutils and xlwt modules need to be installed.  
#Can be done via pip install <module>
from xlrd import open_workbook
from xlutils.copy import copy

rb = open_workbook("names.xls")
wb = copy(rb)

s = wb.get_sheet(0)
s.write(0,0,'A1')
wb.save('names.xls')

I’ll create a simple Excel file with some information in it, load it, modify it, and finally, save it in a new file.

Writing to Excel Files

xlwt CANNOT write .xlsx or .xlsm files, only .xls!!

It’s worth noting that xlwt and xlrd do not have equivalent APIs. I have no idea why this is, but the fact is that there’s no easy way to iterate over the sheets in a workbook or to retrieve a sheet by name with xlwt. xlrd has it just fine, but not xlwt. So you need a workaround, found this.

It’s likely you or I will find more issues such as this while working with xlwt.

The upshot of most of this is that you’ll have to do the heavy-duty reading using xlrd and then sparingly use xlwt to modify the data.

Resources

SO - Edit existing excel workbook and sheets with xlrd and xlwt

xkrd API Reference

SO - Workaround to get sheets by name with xlwt

OpenPyXL Examples