Design you code! (and write less of it): All in One View

Content from Introduction

Last updated on 2024-08-16 | Edit this page

Overview

Questions

Why should you know about code design?

Objectives

Understand the 4 main concepts developed in this course: Maintainability, readability, reusability and scalibility

Why should you care?

Reproducibility and Reliability

Good code practices ensure that research results are reproducible and reliable. Research findings are often scrutinized and validated by others in the field, and well-written code facilitates this process. Clean, well-documented, and well-tested code allows other researchers to replicate experiments, verify results, and build upon existing work, thus advancing scientific knowledge.

Efficiency and Maintainability

Writing good code enhances efficiency and maintainability. Research projects can span several years and involve multiple collaborators. Readable and well-structured code makes it easier for current and future researchers to understand, modify, and extend the software. This reduces the time and effort required to troubleshoot issues, implement new features, or adapt the code for different datasets or experiments.

Collaboration and Community Contribution

Good coding practices facilitate collaboration and contribution from the wider research community. Open-source research software, written with clear, standardized coding practices, attracts contributions from other researchers and developers. This collaborative environment can lead to improvements in the software, innovative uses, and more robust and versatile tools, ultimately benefiting the entire research community.

Readability

Definition and key aspects

Readability in software refers to how easily a human reader can understand the purpose, control flow, and operation of the code. High readability means that the code is clear, easy to follow, and well-organized, which greatly enhances maintainability, collaboration, and reduces the likelihood of bugs.

Key aspects:

Descriptve Naming: Use meaningful and descriptive names that convey the purpose of the variable.
Consistent Formatting: Consistent indentation improves the visual structure of the code. Keeping lines of code within a reasonable length (usually 80-100 characters) prevents horizontal scrolling and improves readability.
Comments and documentation: Brief comments within the code explaining non-obvious parts. Detailed documentation at the beginning of modules, classes, and functions explaining their purpose, parameters, and return values.
Code structure: Breaking down code into functions, classes, and modules that each handle a specific task. Group related pieces of code together, and separate different functionalities clearly.

Benefits:

1 - Maintainability: Your code will be easier to understand and modify the code. It will also greatly reduce the risk of errors when introducing changes.

2 - Collaboration: writing readable code will enhance teamwork and make it easy for others to contribute. Code reviews will be easy!

3 - Efficiency: You are going to save a LOT of time. You will waste less time deciphering your code. That saved time will be used to develop the code.

4 - Quality: Reduces the likelihood of bugs and errors, leading to more reliable code

Reusability

Definition and Key aspects

Reusability in software refers to the ability to use existing software components (such as functions, classes, modules, or libraries) across multiple projects or in different parts of the same project without significant modification. Reusable code is designed to be generic and flexible, promoting efficiency, reducing redundancy, and enhancing maintainability.

Key aspects:

Modularity: Encapsulate functionality within well-defined modules or classes that can be independently reused.
Abstraction: Provide simple interfaces while hiding the complex implementation details.
Parametrization: Design functions and methods that accept parameters to make them adaptable to different situations.
Generic and Reusable Components: Develop generic libraries and utility functions that can be reused across multiple projects.
Documentation and Naming: Provide comprehensive documentation for modules, classes, and functions to explain their usage.
Avoid hardcoding values: Instead, use constants or configuration files.

Benefits:

Time saving: Reusable components save development time. You don’t need to rewrite from sratch! Avoids duplication of effort by using existing solutions for common tasks.
Consistency: Using the same code components across projects ensures consistency in functionality and behavior.
Maintainability: Reusable components can be maintained and updated independently, making it easier to manage large codebases.
Quality: Reusable components are often well-tested, leading to more reliable and bug-free software

Scalability

Definition and key aspects

Scalability in software refers to the ability of a system, application, or process to handle increased loads or demands without compromising performance, reliability, or efficiency. This involves the capacity of the software to grow and manage higher demands by adding resources or optimizing the existing ones. Scalability is a critical consideration in software design and architecture, ensuring that the system can accommodate growth in users, transactions, data volume, or other metrics over time.

Multiple types of scalability can be considered, here are a few examples:

Data scalability: The ability to efficiently store, retrieve, and process large volumes of data.
User scalability: Supporting an increasing number of simultaneous users without degradation of performance
Functional scalability: The ability to add new features of functionalities to the software without affecting existing performance

Benefits

Improved Performance: Scalable systems maintain or improve performance levels as the load increases.
Cost Efficiency: Scalability allows for gradual investment in additional resources as needed, rather than over-provisioning from the start.
Reliability and Availability: Scalable systems often include redundancy and failover mechanisms, improving overall system reliability and uptime.
User Satisfaction: Providing consistent and reliable performance even as user demand grows ensures a better user experience.
Future-Proofing: Designing for scalability ensures that the system can grow and adapt to future requirements without significant overhauls.

Maintainability

Definition and key aspects

Maintainability in software refers to the ease with which a software system can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment. Highly maintainable software is designed to be easily understood, tested, and updated by developers, ensuring that the software can evolve over time with minimal effort and cost.

Key aspects:

Core readibility: your code should be organized logically with meaningful names for variables, functions and classes.
Modularity: If you divide your software into distinct modules or components, each responsible for a specific functionality, you will greatly reduce dependencies.
Documentation: The documentation of the code should be continuously updated to reflect the latest state of the sotware.
Automated testing: Testing your software is important to make sure that modification and implementatio of new functionalities do not break it.

Benefits

Reduce technical debt: Maintainable code is easier to refacto ad improve over time, reducing the accumulation of technical debt. The cost and effort to maintain the software will be significantly reduced
Faster development: If you code is maintainable, it will be easier to understand, modify and extend. It will also be easier to identfy and fix bugs.
Increase collaboration: Having a maintainable code will make it easier for people to join you!
Adaptability to new requirements: if your code is maintainable it will be easier to adapt it to changing (or new) requirements, as it is often the case in research.

Quizz

Results of the Quiz in mentimeter slides. The question for each code is ‘Is this code readable, reusable, maintainable, scalable’?

Challenge

Code #1:

import numpy
def process_list(data):
    return [numpy.sqrt(x) * 2 + 3 for x in data if x * 1.5 < 5]

#Example usage
input_data = [1, 2, 3, 4, 5, 6]
result = process_list(input_data)
print("processed list:", result)

Show me the solution

Readable: The code is readable because it uses a list comprehension that is relatively straightforward to understand for someone familiar with Python.
Reusable: The function can be used with any list of integers to filter and transform the data.
Scalable: The function uses a list comprehension, which is efficient for processing lists.

However, the code will be difficult to maintain because:

There are no comments explaining what the function is doing or why it’s doing it.
Constraints are not explained.
The logic includes “magic numbers” (2 and 3) without any explanation or named constants.
There is no error handling, which makes it harder to maintain when unexpected inputs occur.

Challenge

Code #2:

def b(m, n):
    if m == 0:
        return n + 1
    elif m > 0 and n == 0:
        return b(m - 1, 1)
    else:
        return b(m - 1, b(m, n - 1))

# Example usage
result = b(3, 2)
print("Result:", result)

Show me the solution

This code implements the Ackermann function, a classic example of a computationally intensive function.

Maintainable: The code is structured and easy to update.
Reusable: You can call the b function with different arguments to compute the Ackermann function for different inputs.
Scalable: It is a recursive function that computes the Ackermann function efficiently.

Nevertheless, the code is difficult to read:

The code may not be very readable to someone unfamiliar with the Ackermann function or the specific implementation details. The function name b and the lack of comments or descriptive variable names may make it difficult to understand at first glance.

Challenge

Code #3:

def calculate_statistics():
    data = [23, 45, 12, 67, 34, 89, 23, 45, 23, 34]
    total_sum = sum(data)
    count = len(data)
    average = total_sum / count

    data_sorted = sorted(data)
    if count % 2 == 0:
        median = (data_sorted[count // 2 - 1] + data_sorted[count // 2]) / 2
    else:
        median = data_sorted[count // 2]

    occurrences = {}
    for num in data:
        if num in occurrences:
            occurrences[num] += 1
        else:
            occurrences[num] = 1
    mode = max(occurrences, key=occurrences.get)

    print("Sum:", total_sum)
    print("Average:", average)
    print("Median:", median)
    print("Mode:", mode)

# Calculate statistics for the specific data set
calculate_statistics()

Show me the solution

Maintainable: The code is well-structured, with clear variable names and straightforward logic. It’s easy to understand and modify if needed.
Readable: The code uses descriptive variable names and simple constructs, making it easy to follow.
Scalable: The code efficiently handles the data processing tasks (sum, average, median, mode) for a list of numbers.

However, the code is not reusable because the function calculate_statistics is hardcoded to work with a specific dataset defined within the function. It cannot be easily reused with different datasets without modifying the function itself.

Challenge

Code #4:

def factorial(n):
    """
    Calculate the factorial of a non-negative integer n.

    Parameters:
    n (int): A non-negative integer whose factorial is to be computed.

    Returns:
    int: The factorial of the given number n.
    """
    # Base case: factorial of 0 or 1 is 1
    if n == 0 or n == 1:
        return 1
    # Recursive case: n * factorial of (n-1)
    return n * factorial(n - 1)

# Example usage
number = 5
result = factorial(number)
print(f"Factorial of {number} is {result}")

Show me the solution

Maintainable: The code is well-structured with a clear base case and recursive case. The function is documented, explaining what it does and the parameters it takes.
Readable: The variable names are descriptive, and the function logic is simple and easy to follow. The use of comments and a docstring further enhances readability.
Reusable: The factorial function can be reused to calculate the factorial of any non-negative integer.

However, the recursive approach to calculating factorial is not scalable for large values of n due to the risk of stack overflow and the inefficiency of repeated function calls. For large inputs, this implementation will not perform well and can cause a maximum recursion depth exceeded error in Python.

Content from The Zen of Python

Last updated on 2024-08-19 | Edit this page

Overview

Questions

What are PEPs?
How to write clean?
How can I do this efficiently with Pylint?

Objectives

Understand why it is important to write good code
Write PEP8 compliant code
Use Pylint to help with code formating and programmatic errors

Python Enhancement Proposals and the Zen of Python

The Python Enhancement Proposals are documents that provide information to the Python community, or describing a new feature for Python or its processes or environment. Some of them are also focusing on design and style:

The main one is PEP8. It lays out rules to write clean code in Python.
Docstrings convention are given in PEP257.
The Zen of Python in PEP20 gives principle for Python’s design. It is accesible in any python distribution with:

In [1]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Readability counts

As Guido van Rossum (Python creator and Benevolent dictator for life) once said ‘’Code is read much more often than it is written’’

While coding you may spend of a few hours (days) on a piece of code and when you will be done with it you will not write it again. Nevertheless there is a very high chance that you will read it again. If the piece of code is part of an on-going project you will have to remember what that code does and why you actually wrote it. Hence, readability counts! Remembering what a code does after a few weeks/months is not easy. If you follow the standard guidelines it will greatly help you (and save you a lot of time!).

In addition, if multiple people are looking at the code and developing with you, writing readable code is paramount. If people have to decipher your coding style before actually trying to understand what you are coding that will become very difficult for everybody. PEP8 provides a standardisation of the python coding style.

Explicit is better than implicit.

Writing clear code is not complicated. It starts by giving meaningful name to variables, function and classes. Avoid single letter names like x or y. For example:

PYTHON

# This is bad:

x = 5
y = 10
z = 2*x + 2*y

# This is much better:

width = 5
height = 10
diameter = 2*height + 2*width

Just by using descriptive names we can understand what the code is trying to do.

In addition, everything that you write (variables, constants, function, classes…) comes with a way to name it. The main conventions are:

Variables, function and methods use the snake_case convention. It means that they should use lowercase letters and words should be separated by underscore:

PYTHON

# This is bad
def ComputeDiameter(width, height):
    return 2*width + 2*height

# This is good
def compute_diameter(width, height):
    return 2*width + 2*height

Class names follow the PascalCase convention (also known as CamelCase). In that convention, each word starts with a capital letter and there are NO underscores between words.

PYTHON

# This is bad
class example_class:

# This is good
class ExampleClass:

Constant names follow the UPPER_SNAKE_CASE convention. Constants, or variables that are intended to remain unchanged, should be written in all uppercase letters, with words separated by underscores.

PYTHON

# This is bad
speedoflight = 3e8
plankconstant = 6.62e-34

# This is good
SPEED_OF_LIGHT = 3e8
PLANK_CONSTANT = 6.62e-34

Beautiful is better than ugly

In the context of Python, beautiful means that the code is clean, readable and well structured. Beautiful code is easy to understand, not only for you but also others people who might have to maintain the code in the future. It uses meaningful names and clear logic and structure.

Challenge

What is this code doing?

PYTHON

print(sum(x**2 for x in range(2, 100) if all(x % d != 0 for d in range(2, int(x**0.5) + 1))))

Show me the solution

This one-liner finds all prime numbers less than 100, squares them, and returns the sum of these squares.

‘Beautiful is better than ugly’ means that developers should aim for simplicity and elegant solution. It makes a code very difficult to maintain when the author tries to cram as much functionality as possible in a single line or function. Always tries to break down into clear single component.

Beautiful code is aesthetically pleasing because it follows good design principles (see next chapter). It is modular, reusable, and adheres to the DRY (Don’t Repeat Yourself) principle. It avoids unnecessary complexity and focuses on clarity.

Challenge

Rewrite the following one-liner:

PYTHON

words = ['apple', 'banana', 'cherry', 'date', 'fig', 'grape']
result = [len(word) for word in words if len(word) % 2 == 0 and 'a' in word.lower()]

You will create three functions: has_even_length, contains_letter_a, process_words and you will pass the following list of word words = ['apple', 'banana', 'cherry', 'date', 'fig', 'grape'].

Show me the solution

PYTHON

def has_even_length(word):
    """
    Check if the length of the word is even.
    
    Args:
        word (str): The word to check.
        
    Returns:
        bool: True if the length of the word is even, False otherwise.
    """
    return len(word) % 2 == 0

def contains_letter_a(word):
    """
    Check if the word contains the letter 'a' (case-insensitive).
    
    Args:
        word (str): The word to check.
        
    Returns:
        bool: True if the word contains 'a', False otherwise.
    """
    return 'a' in word.lower()

def process_words(words):
    """
    Process a list of words to return the lengths of words that are both even in length and contain the letter 'a'.
    
    Args:
        words (list of str): The list of words to process.
        
    Returns:
        list of int: A list of lengths of words that meet the criteria.
    """
    lengths = []
    for word in words:
        if has_even_length(word) and contains_letter_a(word):
            lengths.append(len(word))
    return lengths

# Example usage:
words = ['apple', 'banana', 'cherry', 'date', 'fig', 'grape']
result = process_words(words)
print(result)  # Output: [5, 6]

Sparse is better than dense.

When you write your code it is important to make it readable. Avoiding cluttered code by making is sparse and spaced out makes it easier to read and increase clarity and readability. Use whitespaces, correct indentation and separation will make your code quicker to understand. Moreover, when code is spread out with proper comments and breaks it is easier to modify or debug. Let’s see an example:

Challenge

What is wrong with this code? Is it actually working?

PYTHON

def   example_function(param1,param2):print(param1+param2*2, end=' ')
print("The result is:",  param1,param2) 
def   another_function(x,y):return x+y
class  MyClass: def __init__(self,param):self.param=param
def  method(self):if self.param >10:print("Value is greater than 10")
else:print("Value is 10 or less") 
my_list=[1,2,3,4,5]
dictionary={'key1':'value1','key2':'value2'}
result=another_function(5,10) 
print(result)

So what are the rules?

Indentation: The convention is to use 4 spaces. Tabs are not recommended as they can lead to inconsistencies:

PYTHON

def example_function():
    if True:
        print("Indented correctly")

Whitespaces around operators: A single space on both sides of binary operators should be included (+, -, *, /, =, ==, !=, <, >, <=, >=, etc).

PYTHON

#This is bad
a=2
b=3
c=4
result=a+b+c


#This is good
a = 2
b = 3
c = 4
result = a + b * c

Comma and colon spacing: you shoud include a single space after a comma and you should include a space after the colon in dictionary:

PYTHON


#This is bad
dictionary={'key1':'value1','key2':'value2'}

#This is good
dictionary = {'key1': 'value1', 'key2': 'value2'}

Blank lines: Use two blank lines before a top-level function or class definition and use a single blank line between method definitions inside a class.

PYTHON

# This is bad
class MyClass:
    def method_one(self):
        pass
    def method_two(self):
        pass

# This is good

class MyClass:
    def method_one(self):
        pass

    def method_two(self):
        pass

Challenge

Based on what we saw up to now, rewrite this code to make it easier to understand.

PYTHON

def   example_function(param1,param2):print(param1+param2*2, end=' ')
print("The result is:",  param1,param2) 
def   another_function(x,y):return x+y
class  MyClass: def __init__(self,param):self.param=param
def  method(self):if self.param >10:print("Value is greater than 10")
else:print("Value is 10 or less") 
my_list=[1,2,3,4,5]
dictionary={'key1':'value1','key2':'value2'}
result=another_function(5,10) 
print(result)

Show me the solution

PYTHON

def calculate_adjusted_sum(base_value, multiplier):
    """
    Calculate and print the sum of the base_value and twice the multiplier.
    
    Args:
        base_value (int or float): The base value to which the adjusted multiplier will be added.
        multiplier (int or float): The value that will be doubled and added to the base value.
    """
    adjusted_sum = base_value + (multiplier * 2)
    print(adjusted_sum, end=' ')
    print("The adjusted sum is:", base_value, multiplier)


def add_two_numbers(x, y):
    """
    Return the sum of two numbers.
    
    Args:
        x (int or float): The first number.
        y (int or float): The second number.
    
    Returns:
        int or float: The sum of x and y.
    """
    return x + y


class ValueChecker:
    def __init__(self, value):
        """
        Initialize with a specific value.
        
        Args:
            value (int or float): The value to be checked.
        """
        self.value = value

    def check_and_print_message(self):
        """
        Print a message based on whether the value is greater than 10 or not.
        """
        if self.value > 10:
            print("The value is greater than 10.")
        else:
            print("The value is 10 or less.")


# Example usage
numbers_list = [1, 2, 3, 4, 5]
key_value_pairs = {'key1': 'value1', 'key2': 'value2'}

# Add two numbers and print the result
result = add_two_numbers(5, 10)
print("Sum of numbers:", result)

We actually see now that the class or the first function were not used at all in the rest of the code. If that codes stands like this, they can be removed..

If the implementation is hard to explain, it’s a bad idea…If the implementation is easy to explain, it may be a good idea.

If you follow this FAIR training program you might be interested to share your code with the wider research community. If that’s the case people might want to have a look at your code. This aphorism tells you that how you implemented your code matters! Code should always be easy to understand. If you are unable to explain what your code is doing then you should not leave it in your software. Conversely, if you are able to explain in an easy what your piece of code is doing, this is probably a good implementation. For example

Challenge

What is this code doing?

PYTHON


def check_number(num):
    if num % 2 == 0:
        if num % 5 == 0:
            return True
        else:
            return False
    else:
        return False

How could you make it easier to understand?

Show me the solution

The function checks if a number is both even and a multiple of 5. A better way of doing it could be:

PYTHON

def check_number(num):
    return num % 2 == 0 and num % 5 == 0

In addition to writing simpler and more logical code, commenting your code is important. For more complex type of operations it is often useful to explain what is the logic behind the reasoning and why a particular approach has been chosen.

There are a few rules for writing comments in Python:

Comments should be complete sentences and start with a capital letter.
Block comments apply to the code coming after it and are indented to the same level of that code. Each line should start with a # followed by a single space.
Inline comments should be separate by at least two spaces from the piece of code they are related to.
Comments should not state the obvious (it is distracting).

Finally, when you update your code you should always update the comment. ‘Comments that contradict the code are worse than no comments’ [PEP8].

PyLint

PyLint is a tool that analyzes Python code to find programming errors, enforce a coding standard, and look for improvements. It provides a score based on the number of issues detected, helping you writing clean and readable code.

Key Features of PyLint

Error Detection: Identifies syntax errors, undefined variables, and other potential bugs. Detects issues such as using undefined variables, unnecessary imports, and more.
Coding Standard Enforcement: Checks the code against PEP 8. Flags violations such as incorrect indentation, naming conventions, and line length.
Code Quality Metrics: Provides a detailed report with metrics like code complexity, number of lines, and number of classes. Offers a score that reflects the overall quality of the code.
Refactoring Suggestions: Suggests improvements to make the code cleaner and more efficient. Highlights duplicated code, unused variables, and functions that can be simplified.

Running pylint

To analyse a python file you can simply run:

BASH

pylint your_python_file.py

When you run PyLint on a Python file, it provides an output with the following components:

Messages: Each detected issue is reported with a message ID, type, line number, and a brief description.
Statistics: Provides a summary of the issues found, such as the number of errors, warnings, and refactor suggestions.
Score: An overall score out of 10, reflecting the code quality based on the issues detected.

Challenge

Let’s have a look at an example: Consider that file here and run PyLint on it. Try to clean up the code according to the error messages you see.

Configuration:

PyLint can be configured to match your specific project requirements. You can create a configuration file (.pylintrc) to customize the behavior of PyLint, such as enabling/disabling certain checks, adjusting thresholds, and more. Generate a configuration file using:

BASH

pylint --generate-rcfile > .pylintrc

Integrating with IDEs

Many Integrated Development Environments (IDEs) and text editors, such as Visual Studio Code, PyCharm, and Sublime Text, support PyLint integration. This allows you to see linting results directly within your editor as you write code.

Content from Principles of Code design

Last updated on 2024-07-30 | Edit this page

Overview

Questions

How to write maintainable, readable, resusable and scalable code?

Objectives

Be familiar with standard principles of code design
Understand what they mean and how to apply them

Don’t repeat yourself (DRY) - Rule of three

Keep it simple, Stupid (KISS) & Curly’s Law - Do one Thing

You aren’t gonna need it (YAGNI)

Principle of least astonishment (POLA)

Code for the maintainer

Content from Code structure

Last updated on 2024-06-11 | Edit this page

Overview

Questions

How to struture a code in a scalable and reusable way?

Objectives

Learn to use functions and classes
Understand how to organise your code in modules and packages

Introduction

This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.

What you need to know is that there are three sections required for a valid Carpentries lesson:

questions are displayed at the beginning of the episode to prime the learner for the content.
objectives are the learning objectives for an episode displayed with the questions.
keypoints are displayed at the end of the episode to reinforce the objectives.

Challenge 1: Can you do it?

What is the output of this command?

R

paste("This", "new", "lesson", "looks", "good")

Output

OUTPUT

[1] "This new lesson looks good"

Challenge 2: how do you nest solutions within challenge blocks?

Show me the solution

You can add a line with at least three colons and a solution tag.

Using Configuration files in Python

You can use standard markdown for static figures with the following syntax:

![optional caption that appears below the figure](figure url){alt='alt text for accessibility purposes'}

You belong in The Carpentries!

Callout

Callout sections can highlight information.

They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.

Creating command line interface

One of our episodes contains $\LaTeX$ equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:

$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$

Cool, right?

Key Points

Use .md files for episodes when you want static content
Use .Rmd files for episodes when you need to generate output
Run sandpaper::check_lesson() to identify any issues with your lesson
Run sandpaper::build_lesson() to preview your lesson locally

Content from Don't touch your code anymore!

Last updated on 2024-08-05 | Edit this page

Overview

Questions

How can you modify your code configuration without touching it?

Objectives

Learn how to set up configuration files using a simple INI file
Be able to create simple command line interfaces with argparse

Introduction

Research software is often based on a trial-error or trial-trial loops. You will often find yourself trying to rerun a code with different parameters to try different configuration of your experiment. So far what we have seen deals with the design of the code itself and how to make it cleaner, more readable and maintainable. BUT! what if you need to try something new by changing few parameters of your code? You will need to go and change the code itself! And it is very likely that you will do this a few times (or a lot!). Along the way, and unless you are able to track very well all your trials, you will probably loose track of some of them. In addition, modifying endlessly the code increase greatly the risk of introducing errors…

In order to avoid such problems we are going to see a couple of options that are easily available and implementable:

Configuration files
Command line interface

Configuration files

Advantages of using configuration files

Using configuration files in a research context offers several specific benefits that can greatly enhance the efficiency, reproducibility, and manageability of research projects. Here are the key reasons why configuration files are beneficial in a research setting:

Reproducibility: Configuration files ensure that experiments can be easily replicated by maintaining consistent settings across different runs. This is critical for verifying results and peer review.
Parameter Management: Research often involves experimenting with various parameters. Configuration files allow researchers to manage and tweak these parameters without altering the core codebase, enabling easier experimentation and optimization.
Collaboration: Research projects often involve collaboration between multiple researchers. Configuration files provide a clear and centralized way to share settings, making it easier for team members to understand and modify the setup as needed.
Documentation: Well-structured configuration files serve as documentation for the experimental setup. They provide a clear and organized record of the settings used, which is crucial for understanding and interpreting results.
Version Control: Configuration files can be versioned alongside the code using version control systems like Git. This makes it easy to track changes in experimental setups over time and understand the impact of these changes on the results.

How to build configuration files? What library should I use?

As it is often the case in Python, multiple options are available:

INI Files are easy to read and parse. The module used to load these files is configparser and part of the Python standard Library.

[section1]
key1 = value1
key2 = value2

#Comments

[section2]
key1 = value1


[Section3]
key = value3
    multiline

INI files are structured as (case sensitive) sections in which you can list keyword/value pairs (like for a dictionary) separated by either the = or : signs. Values can span multiple lines and comments are accepted as long as the extra lines are indented with respect to the first line.

JSON: Originally developed for JavaScript, they are very popular in web applications. The module to read these files is json and also part of the standard library.

{
  "section1": {
    "key1": "value1",
    "key2": "value2"
  },
  "section2": {
    "key1": "value1"
  }
}

JSON files are also structured as section and keyword/value pairs. JSON files start with an opening brace { and end with a closing brace }. Then each section comes with its name followed by :. Then key/value pairs are listed within braces (one for each section). Nevertheless, comments are not allowed.

YAML Files: are also a popular format (used for github action for example). In order to read (and write) YAML files, you will need to install a third party package called PyYAML.

section1:
  key1: value1
  key2: value2

section2:
  key1: value1

# Comments

YAML files work also with sections and keyword/value pairs.

Configparser: loading and writing config files

In the following we will be using INI files. We will start by a simple exercice on writing a configuration file, manually.

Challenge

Using the text editor of your choice, create an INI file with three sections: simulation, environment and initial conditions. In the first section, to parameters are given: time_step set at 0.01s and total_time set at 100.0s. The environment section also has two parameters with gravity at 9.81 and air_resistance at 0.02. Finally the initial conditions are: velocity at 10.0 km/s, angle at 45 degrees and height at 1 m

Show me the solution

Creating a file 'config.ini' with the following content.

[simulation]
time_step = 0.01
total_time = 100.0

[environment]
gravity = 9.81
air_resistance = 0.02

[initial_conditions]
velocity = 10.0
angle = 45.0
height = 0.0

Reading configuration files

Reading an INI file is very easy. It requires the use of the Configparser library. You do not need to install it because it comes as part of the standard library. When you want to read a config file you will need to import it and create a parser object which will then be used to read the file we created just above, as follows:

PYTHON

##Import the library
import configparser 

##Create the parser object
parser = configparser.ConfigParser()

##Read the configuration file
parser.read('config.ini')

From there you can access everything that is in the configuration file. Firstly you can access the section names and check if sections are there or not (useful to check that the config file is compliant with what you would expect):

PYTHON

>>> print(parser.sections())
['simulation', 'environment', 'initial_conditions'] 


>>>print(parser.has_section('simulation'))
True

>>>print(parser.has_section('Finalstate'))
False

Eventually, you will need to extract the values in the configuration file. You can get all the keys inside a section at once:

PYTHON

>>> options = parser.options('simulation')
['time_step', 'total_time']

You can also extract everything at once, in that caseeach couple key/value will be called an item:

PYTHON

>>> items_in_simulation = parser.items('simulation')
>>> print(items_in_simulation)
[('time_step', '0.01'), ('total_time', '100.0')]

That method will return a list of tuples, each tuple will contain the couple key/value. Values will always be of type string.

Finally, you can access directly values of keys inside a given section like this:

PYTHON

>>> time_step = parser['simulation']['time_step']
>>> print(time_step)
0.01

By default, ALL values will be a string. Another option is to use the method .get():

PYTHON

>>> time_step_with_get = parser.get('simulation', 'time_step')
>>> print(time_step_with_get)
0.01

It will also be giving a string…And that can be annoying when you have some other types because you will have to convert everything to the right type. Fortunately, other methods are available:

.getint() will extract the keyword and convert it to integer
.getfloat() will extract the keyword and convert it to a float
.getboolean() will extract the keyword and convert it to a boolean. Interestingly, you it return True is the value is 1, yes, true or on, while it will return False if the value is 0, no, false, or off.

Writing configuration files

In some occasions it might also be interesting to be able to write configuration file programatically. Configparser allows the user to write INI files as well. As for reading them, everything starts by importing the module and creating an object:

#Let's import the ConfigParser object directly
from configparser import ConfigParser

# And create a config object
config = ConfigParser()

Creating a configuration is equivalent of creating a dictionaries:

config['simulation'] = {'time_step': 1.0, 'total_time': 200.0}
config['environment'] = {'gravity': 9.81, 'air_resistance': 0.02}
config['initial_conditions'] = {'velocity': 5.0, 'angle': 30.0, 'height': 0.5}

And finally you will have to save it:

with open('config_file_program.ini', 'w') as configfile: ##This open the condif_file_program.ini in write mode
    config.write(configfile)

After running that piece of code, you will end with a new file called config_file_program.ini with the following content:

[simulation]
time_step = 1.0
total_time = 200.0

[environment]
gravity = 9.81
air_resistance = 0.02

[initial_conditions]
velocity = 5.0
angle = 30.0
height = 0.5

Using command line interfaces

Definition & advantages

A Command Line Interface (CLI) is a text-based interface used to interact with software and operating systems. It allows users to type commands into a terminal or command prompt to perform specific tasks, ranging from file manipulation to running scripts or programs.

When writing research software CLIs are particularly suitable:

Configuration: Using CLI it is easy to modify the configuration of a software without having to touch the source code.
Batch Processing: Researchers often need to process large datasets or run simulations multiple times. CLI commands can be easily scripted to automate these tasks, saving time and reducing the potential for human error.
Quick Execution: Experienced users can perform complex tasks more quickly with a CLI compared to navigating through a GUI.
Adding New Features: Adding new arguments and options is straightforward, making it easy to extend the functionality of your software as requirements evolve.
Documentation: CLI helps document the functionality of your script through the help command, making it clearer how different options and parameters affect the outcome.
Use in HPCs: HPCs are often accessible through terminal making command line interfaces particularly useful to start codes from HPCs.

Creating a command line interface in Python

In Python, there is a very nice module called argparse. It allows to write in a very limited amount of lines a nice command line user interface. Again, that module is part of the standard library so you do not need to install anything.

As for the configuration files, we must start by importing the module and creating a parser object. The parser object can take a few arguments, the main ones are:

prog: The name of the program
description: A short description of the program.
epilog: Text displayed at the bottom of the help

We would proceed as follows:

PYTHON

###import the library
import argparse


###create the parser object
parser = argparse.ArgumentParser(prog='My program',
                                 description='This program is an example of command line interface in Python',
 				 epilog='Author: R. Thomas, 2024, UoS')

Now we need to add arguments. To do so we need to use the add_argument method, part of the parser object:

PYTHON

###Add positional argument
parser.add_argument('filename')
parser.add_argument('outputdir')

Using this type of argument (‘filename’ and ‘outputdir’) will make them mandatory. The user will have to pass a filename AND an output directory to the program. It is worth mentioning that they will have to be passed in the right order by the user. It is useful sometimes to create optional arguments. This will be done using a - sign as first character in the name of the argument:

PYTHON

###Add optional arguments
parser.add_argument('-s', '--start')
parser.add_argument('-e')
parser.add_argument('--color')

You can either use the single dash (‘-s’), or double dash (‘–color’) or both. When given two options to call an argument, the user will have to make a choice on how to call it.

It is possible to use extra options to define arguments, we list a few here:

actions: this options allows you to do
default: This allows you to define a default value for the argument. In the case thr argument will not be used by the user, the default value will be selected: parser.add_argument('--color', default='blue').
type: By default, the argument will be extracted as strings. Nevertheless, it is possible to have them interpreted as other types using the type argument: parser.add_argument('-i', type=int). It the user passes a value that cannot be converted to the expected type an error will be returned.
choices: If you want to restrict the values an argument can take, you can use the choice option to add this contraints: parser.add_argument('--color', choiced=['blue', 'red', 'green']). If the user pass ‘purple’ as value, an error will be raised.
help: finally, and it is probably the most important option, you can provide a short description of the argument: parser.add_argument('--color', help='Color of the curve displayed in the plot')

Finally you must be able to retrieve all the argument values:

###retrieve all arguments
args = parser.parse_args()
print(args.start, args.e, args.color)

Final exercice: Mixing command line interface and configuration file

For this last part of the final lecture we will combine both package we just reviewed: argparse and configparser. Find the instructions below:

Challenge

The program that you will create will take an optional configuration file. If not configuration file is given, the program will load an internal one that you can find here (you need to put this next to your code). To do this you will create an optional argument --file.

arguments:

--name: This argument requires a name. If it is used, the value given will replace the default name in the configuration file.
--save: This argument require a directory as value. If used, the configuration is saved into that directory under the name X_config.ini where X is the name of the user found in the configuration file OR the one given by the --name argument.

Show me the solution

AHAHAHAHA you really though it would be that easy…. :)