Tricks and Tweaks of Open-Source World

Monday, January 13, 2014

Create neat excel files in Python using XlsxWriter

Working with Excel files is not always a great pleasure for me, but regardless of my feelings its a great tool for a lots of situations where dealing with data is easier and faster with Excel. As a programmer I tend to find alternatives to not deal with data directly and build some type of software layer around it to deal with data in Excel files. If you feel like being in the same boat and love python then you must use XlsxWriter.

XlsxWriter - as name suggests is a python package to create/update Excel files. XlsxWriter supports latest 'xlsx' file format and that is a must have feature as most of the excel files are now in this new format. The list of supported features is huge, but the ones that are important to me is below:

Compatibility with XLSX file format
Formulas
Formatting support
Data Validation
Defined names
Charts
Outlining and Grouping

As with most of the open-source tools/packages you won't find a lot of documentation and you'll have to dig through the auto-generated APIs or source code to find out all the available functions and what different arguments means, its not the case for XlsxWriter. It has a great set of 'Intro' documents (Getting Started, Tutorial-1, Tutorial-2, Tutorial-3), that helps you to quickly get started with this package. And once you are hooked to XlsxWriter, you an go through the rest of documentation to find out how to use a specific feature you are interested in. And this package has 40+ examples of showing how to use various features available in this package to generate great Excel files.

Tuesday, May 21, 2013

Advance Argument Parsing in Python: Reading Args from a File

Python is a great environment to write small scripts to make life little easier. In no time these scripts can become full fledged tool that is used by lot of people and now you have to keep things organized in your script/tool. One of the most important interface/design of a tool is how it handles command line arguments. Python's standard library provides a feature rich argparse module to provide a standardized way for your tool to manage command line arguments. Some of the key features of argparse module are:

Parse positional (required) arguments and optional arguments
Generate a nicely formatted -h (help) or usage output and customize it
Parse not only from sys.argv but also from parse from a list or string in your script
Read arguments from one or multiple files
Custom argument types, for example reading csv style arguments or dictionary style arguments

There are many more features not mentioned here, but you can always take a look at the argparse documentation to find more about it. In this blog post I am going to cover a feature that is really useful - Reading arguments from file. In my experience as the script/tool keeps growing, we just keep adding more and more arguments and the final command line becomes bloated that spans more than few lines of your terminal.

Enabling Argparse to read from file

argparse module provides built-in support to read arguments from file. To enable this feature, when ArgumentParser object is created, pass fromfile_prefix_char argument with a valid character as shown in an example below:

parser = argparse.ArgumentParser(fromfile_prefix_char='@')

This enables parser object to treat any argument starting with '@' character as file and read arguments from that file. For example, a defaut.args file has following content:

--verbose 1
--debug 0
--enable-all
--out output.txt

This default.args file can be passed to your tool with '@' argument and ArgumentParser object will be able to read all the arguments of the file as they were passed on the command line:

$ ./my_tool.py @default.args

Running above command on your command line sets parser.verbose to 1 and parser.debug to 0. Internally ArgumentParser object reads the file, parse all the arguments and place them into a common buffer that holds all arguments, those that are passed on the command line and also that are passed in default.args. One shortcoming of this feature is each line must have only one argument (you can easily overcome this limitation by extending ArgumentParser class, just keep reading..). This enables us to override argument values passed in the file from command line. For example,

$ ./my_tool.py @default.args --verbose 2

Above command will set parser.verbose value to 2 instead of 1 that was specified in default.args. This enables users to play with the arguments without changing the default argument files.

Reading Multiple Files

Because all the arguments are treated equal (either from command line or from file), you can pass multiple file arguments or create a hierarchy by adding file argument within a file. To show an example lets create a debug.args file with following content:

--debug 1
--loglevel 3

Now you can pass two files on command line like below:

$ ./my_tool.py @default.args @debug.args

This will read arguments from both default.args and debug.args, and it will override parser.debug value to be 1. Another option is to include @default.args in the specific.args at the top as shown below:

@default.args
--debug 1
--loglevel 3

And now you only have to pass only one file in the command line like this:

$ ./my_tool.py @debug.args

Supporting Comments in a File

Now that you have created multiple files and you'll find yourself adding/removing arguments from the file while you play with your tool. It would be nice if argparse can natively support comments but it does the second best thing, allows users to easily extend ArgumentParser class to support comments and many more features.

The ArgumentParser class implements convert_arg_line_to_args(..) function that contains default implementation of ArgumentParser's file reading feature. The default implementation is very simple, split the line by space and pass each element as argument. In order to support comments in args file, we can create our own parser class that extends ArgumentParser and overrides convert_arg_line_to_args function.

class CustomArgumentParser(argparse.ArgumentParser):
    def __init__(self, *args, **kwrags):
        super(CustomArgumentParser, self).__init__(*args, **kwargs)

    def convert_arg_line_to_args(self, line):
        for arg in line.split():
            if not arg.strip():
                continue
            if arg[0] == '#':
                break
            yield arg

In above code, CustomArgumentParser class overrides convert_arg_line_to_args function. This function is called for each line in a file. This example treat any string after '#' as comment, just like Python language. To detect comments, convert_arg_line_to_args check for '#' in first character of each element. If '#' is found, rest of the elements from the line are ignored, hence treating it as comment.

Now create your parser object from CustomArgumentParser and your tool will have one more feature - comments in args file. The comment support is just one example, you can also extend it to support multi-line comments, handling string based arguments etc.

Python's standard library modules have lots of little gems like this that allows users to extend basic functionality to create robust and powerful tools.

Tuesday, December 20, 2011

Git patches for subtree

Git is arguably the most useful and popular tool for source control. I have been using git for more than 3 years now and one feature I like the most is 'subtree'. Its a very useful feature when you work on large project which includes multiple projects from different people and you can merge the changes from those remote the projects from one repository.

I have used subtree for pulling changes from QEMU releases into MARSS . Sometimes when we want to make change to QEMU and send patches upstream, then using 'git format-patch' doesn't work by default because the patch is created with 'marss' has top directory. As shown in the 'git diff' output below:

diff --git a/qemu/target-i386/cpu.h b/qemu/target-i386/cpu.h
index 7f2103f..7047115 100644
--- a/qemu/target-i386/cpu.h
+++ b/qemu/target-i386/cpu.h
@@ -636,6 +636,7 @@ typedef struct CPUX86State {
 #ifdef MARSS_QEMU
     target_ulong cr[8]; /* NOTE: cr1 is unused */
     uint8_t handle_interrupt; /* Simulater managed int enable flag */
+    uint64_t simpoint_decr;
 #else
     target_ulong cr[5]; /* NOTE: cr1 is unused */
 #endif

As highlighted lines 3 and 4 the diff starts with 'qemu' folder. If we want to submit this patch to qemu mainline then it wont work as it should not start with 'qemu'. To solve this issue git diff provides a command line flag --relative=[path]. Now with this flag we can tell the git to generate 'diff' with relative folder. For example,

$ git diff --relative=qemu/

will show the diff as below:

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 7f2103f..7047115 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -636,6 +636,7 @@ typedef struct CPUX86State {
 #ifdef MARSS_QEMU
     target_ulong cr[8]; /* NOTE: cr1 is unused */
     uint8_t handle_interrupt; /* Simulater managed int enable flag */
+    uint64_t simpoint_decr;
 #else
     target_ulong cr[5]; /* NOTE: cr1 is unused */
 #endif

So use '--relative' to generate patches to submit upstream. Bonus Tip: You can also use '--relative' with 'git format-patch'.

Thursday, September 8, 2011

Awesome 'awk' : variables and conditions

I was working on generating some graphs for my research work and one of the data I needed to collect was from 'tcpdump' output. I had collected all the TCP requests between two simulated VMs using 'tcpdump' where one VM was running LAMP server and other was generating requests to the server. The output of the dump was like below.

18:04:02.898609 IP foxhound.cs.binghamton.edu.40080 > foxhound.cs.binghamton.edu.50847: 
18:04:02.898636 IP foxhound.cs.binghamton.edu.50847 > foxhound.cs.binghamton.edu.40080: 
18:04:03.121414 IP foxhound.cs.binghamton.edu.40080 > foxhound.cs.binghamton.edu.50847: 
18:04:03.121439 IP foxhound.cs.binghamton.edu.50847 > foxhound.cs.binghamton.edu.40080:

From that I needed to collect traffic (number of TCP packets per minute) between server and client. Each line was using same format that has time stamp at the start. The challange was to count the number of TCP packet received/sent by the server in a minute. Ofcourse it can be done in python by reading the file, and iterate over each line, split the line data to get the current time and increment the counter for that minute. But I thought that python is an overkill for this simple task. So I decided to use awesome 'awk'.

#!/bin/bash

awk '
BEGIN {print "Time, Requests"; hr=0; min=-1; count=0} \
   { split($1,a,":"); \
     if (a[2] != min) {\
        if(min >=0) {print hr ":" min ", " count;} \
       min=a[2]; count=0; hr=a[1] \
     } \
     count++; \
   } \
' $1

Within couple of minutes I came up with above bash script that uses variables within awk to store data over multiple lines and print the output at every minute interval using condition statements. The code shows how easy it is to declare and initialize variables in 'BEGIN' statement. Also by using conditional if statement you can easily manipulate the output of your script. Next time when you run into similar task that requires some basic calculation from text files, let the 'awk' be your swiss knife.

Bonus: If you are using 'awk' in a bash script and want to pass a variable to 'awk' then use -v command line option to declare and initialize a variable and use it within your 'awk' script.

Wednesday, February 9, 2011

chmod to copy user permission to group

Issue I had was simple,
- wanted to give read/write/execute permission to group
- but did not want to end up with 'executable' .cpp and .h files

If I give 'chmod g+rwx -R dir' then all users belong to my group will be able to access these files but it looked ugly to me. What I wanted was to copy user permissions to group so all users of my group can have same permission as me.

Solution was so simple (now I think that why I didn't check man pages earlier..)

chmod g+u -R dir

Thats it and its done.

Wednesday, January 12, 2011

Setting up github ssh access behind proxy servers

At work to access/update public repo of Marss I finally setup a rather complicated proxy setting for Git which finally worked :).

I followed the steps from this tutorial and it worked without a glitch. The only change I did is instead of using 'connect.c' file provided I used standard 'nc' for basic proxy support. So my 'socks-gw' file looks like following:

#!/bin/sh
# File ~/bin/socks-gw
# Connect a SOCKS 5 proxy using 'nc'
nc -X 5 -x proxy.server:1080 $@

Monday, January 3, 2011

Good Introductory Book on Issues of Parallel Programming

Found out a link of this book from Reddit Programming, called a 'Is Parallel Programming Hard, And, If So, What Can You Do About It?'.

I skimmed through this book and trust me if you are into system programming you'll get hooked. I liked the way Paul (author of this book) has explained key obstacles of hardware in parallel programming. Its not too detailed but he has explained the basic issues in a very simple language.

Here is a link to author's official announcement of this book.