How to Extract Numbers and Calculate Average using Regex in Python
Data processing often involves cleaning unstructured text to extract meaningful numerical data. Whether you are parsing logs, user input, or messy data files, Regular Expressions (Regex) are the most efficient tool for this job.
This guide explains how to extract integers and floating-point numbers from a string, calculate their average, and format the output using Python.
Understanding the Regex Pattern
To extract numbers mixed with text (e.g., "Item: 5 cost: 12.99"), simple string splitting won't work because delimiters vary. We need a regex pattern that can match:
- Integers (e.g.,
10,5) - Floats (e.g.,
3.14,0.5) - Negative numbers (e.g.,
-5)
The pattern r"[-+]?\d*\.\d+|\d+" covers these cases:
[-+]?: Optional+or-sign.\d*\.\d+: Matches floats (digits before dot, dot, digits after dot).|: OR operator.\d+: Matches integers (one or more digits).
Step 1: Extracting Numbers
We use re.findall() to retrieve all non-overlapping matches as a list of strings.
import re
text = "Temperature is -5.5, pressure is 1013, and humidity 40%."
# ⛔️ Incorrect: Splitting by space fails on "40%." or "-5.5,"
# parts = text.split(" ")
# ✅ Correct: Using regex to isolate numbers
# Finds floats OR integers
pattern = r"[-+]?\d*\.\d+|\d+"
matches = re.findall(pattern, text)
print(f"Extracted strings: {matches}")
Output:
Extracted strings: ['-5.5', '1013', '40']
Step 2: Calculating and Formatting
Once extracted, the data exists as strings. We must convert them to float, calculate the average, and then format the result to 2 decimal places.
matches = ['-5.5', '1013', '40']
# Convert to float
numbers = [float(n) for n in matches]
# Calculate Average
if len(numbers) > 0:
average = sum(numbers) / len(numbers)
else:
average = 0
# ✅ Correct: Format to 2 decimal places
# Methods: f-string (modern) or .format() (compatible)
formatted_avg = f"{average:.2f}"
print(f"Average: {formatted_avg}")
Output:
Average: 349.17
Step 3: Handling Command Line Arguments
To make the script reusable from the terminal, we use the sys module to read arguments passed during execution (e.g., python script.py "some text").
sys.argv[0]: The name of the script.sys.argv[1]: The first argument passed by the user.
Complete Code Solution
Here is the complete find_num.py script combining regex extraction, math, and CLI argument handling.
import re
import sys
def calculate_average(text):
"""
Extracts numbers from text and returns their average
formatted to 2 decimal places.
"""
# 1. Find all numbers (integers and floats)
# Pattern explanation:
# [-+]? -> Optional sign
# \d*\.\d+ -> Floats (e.g., .5, 10.5)
# | -> OR
# \d+ -> Integers (e.g., 50)
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", text)
# 2. Check if numbers exist to avoid DivisionByZero
if not numbers:
return "0.00"
# 3. Convert strings to floats
numbers = [float(num) for num in numbers]
# 4. Calculate Average
average = sum(numbers) / len(numbers)
# 5. Format output
return "{:.2f}".format(average)
if __name__ == "__main__":
# Ensure an argument was provided
if len(sys.argv) > 1:
input_text = sys.argv[1]
result = calculate_average(input_text)
print(result)
else:
print("Error: Please provide a string argument.")
Testing the Script
Test Case 1
python3 find_num.py "a11 b3.14c15 16"
Output:
11.29
Test Case 2
python3 find_num.py "a 5 b 6 c7 dd8 9"
Output:
7.00
Conclusion
By combining re.findall with basic list comprehensions, you can extract numerical data from chaotic text strings efficiently.
- Regex:
r"[-+]?\d*\.\d+|\d+"captures mostly any number format. - Conversion: Always convert regex matches (strings) to
floatorintbefore math. - Formatting: Use
"{:.2f}".format()to ensure clean, readable output.