Optimising Parallel Python and the Billion Row Challenge
1brc, Multiprocessing, Python
Attempting the Billion Row Challenge but in python.
This blog post is a copy of 1brc or The Billion Row Challenge - The goal, as per Gunnar Morlings 1brc is stated as:
The text file contains temperature values for a range of weather stations.
Each row is one measurement in the format <string: station name>;<double: measurement>
, with the measurement value
having exactly one fractional digit.
The following shows ten rows as an example:
1 2 3 4 5 6 7 8 9 10
Hamburg;12.0 Bulawayo;8.9 Palembang;38.8 St. John's;15.2 Cracow;12.6 Bridgetown;26.9 Istanbul;6.2 Roseau;34.4 Conakry;31.2 Istanbul;23.0
The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per
weather station, and emits the results on stdout like this
(i.e. sorted alphabetically by station name, and the result values per station in the format <min>/<mean>/<max>
,
rounded to one fractional digit):
1
{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, Abéché=-10.0/29.4/69.0, Accra=-10.1/26.4/66.4, Addis Ababa=-23.7/16.0/67.0, Adelaide=-27.8/17.3/58.5, ...}
Obviously, this will be done in Python in this case! As of writing the leaderboard is looking quite fast! With a result of 00:12.063/00:59.430/04:13.449.
Results are determined by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM)
I'm running everything locally on my M1 Mac with a M1 Max chip. According to my brief research into the chips, the Hertzner cloud runs on I'm going to assume my times will be faster by up to a factor of 2. That means to get anywhere near performant be aiming for ~2 mins and I'll be clinging on the slowest times of the original 1brc.
These are going to be some hard numbers to beat, especially in Python - but if we can get under our target that would be amazing. My approach will take a few stages, iteratively trying to get better with different optimisation approaches.
0 - ASAP With Polars
It's a nice place to start, a tool I'm familiar with and will likely be relatively fast thanks to its low-level implementations of operations using the Arrow in-memory storage.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
import logging from pathlib import Path import polars as pl from brc.util import DATA_DIR, timit_context def main(file_path: Path): lf = pl.scan_csv( file_path, has_header=False, separator=";", with_column_names=lambda _: ["city", "temp"], schema={"city": pl.Categorical, "temp": pl.Float64} ) q = ( lf.group_by("city") .agg( pl.min("temp").alias("min_temp"), pl.mean("temp").round(1).alias("mean_temp"), pl.max("temp").alias("max_temp"), ) .select( pl.col("city"), pl.concat_str([ pl.col("min_temp"), pl.col("mean_temp"), pl.col("max_temp") ], separator="/").alias("_res_str") ) .sort("city") .select( pl.concat_str([ pl.col("city"), pl.col("_res_str") ], separator="=").alias("res_str") ) ) res = q.collect() print(", ".join(res["res_str"])) if __name__ == '__main__': logging.basicConfig(level=logging.INFO) path = DATA_DIR / "data_1_000_000_000.txt" with timit_context(1_000_000_000): main(path)
In ~30 lines of code, we can do all the parsing, maths and string concat operation to create the result. On my machine, this took just over 1 minute*. Given how fast I was able to put this together using industry-standard Python libs is pretty sweet, that and the fact I'm now able to move on to other work rather than mess around using multiprocessing and parsing algos means the total cost of this solution is very cheap.
1 2 3
INFO:brc.util:benchmark >>> Starting timer INFO:brc.util:benchmark <<< Elapsed time: 0:01:00.263331 INFO:brc.util:benchmark <<< with 1,000,000,000 rows => 16,593,838.763 rows/s.
HOWEVER
Literally rule 2 mentions (rule 1 is about using Java… rules are meant to be broken right?)
No external library dependencies may be used
So let's do it with no libs...
1 - Make a Python Script to Read Then Calculate
The file I'm working with is ~13Gb, and my laptop has 64Gb, so I can do this all in memory. I'll probably end up not doing it all in memory as the target machine has less than 8Gb but I'll cross that bridge when I get there.
Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
import csv import logging from dataclasses import dataclass from pathlib import Path from brc.util import DATA_DIR, timit_context def main(file_path: Path): @dataclass class CollectionStruct(): city: bytes min_temp: float max_temp: float sum_temps: float count: int def mean(self): return self.sum_temps / self.count def __repr__(self): return f"{self.city.decode("utf-8")}={self.min_temp}/{self.mean():,.1f}/{self.max_temp}" temps: dict[bytes, CollectionStruct] = {} with open(file_path, 'rb') as f: for line in f: # parse line data city, temp = line.split(b";") temp = float(temp) # grab existing data collect = temps.get(city, CollectionStruct(city, temp, temp, 0, 0)) # add line to the collection settings collect.count += 1 collect.sum_temps += temp collect.min_temp = min(collect.min_temp, temp) collect.max_temp = max(collect.max_temp, temp) # update the dictionary temps[city] = collect # ordering here is slightly quicker than the dataclass ordering print(" ".join((str(s) for s in sorted(temps.values(), key=lambda c: c.city)))) if __name__ == "__main__": logging.basicConfig(level=logging.INFO) # This solution will not be fast, 100mm records is enough for a sample at the moment. n = 100_000_000 path = DATA_DIR / f"data_{n:_}.txt" with timit_context(n, "basic"): main(path) with timit_context(n, "basic_profile", profile=True): main(path)
We use data classes and a non-parallelized approach, giving us a result for 100mm records. This first step is really to gauge how fast native Python is vs polars and estimate how long 1bn records would take. the results are below:
1 2 3 4
INFO:brc.util:basic >>> Starting timer Abha=-24.7/18.0/64.2 Abidjan=-21.7/26.0/75.3 Abéché=-18.3/29.4/71.0 Accra=-17.0/26.4/75.2 Addis Ababa=-32.3/16.0/62.3 Adelaide=-24.6/17.3/67.1 Aden=-15.0/29.1/79.9 Ahvaz=-16.6/25.4/69.0 Albuquerque=-30.7/14.0/57.3 Alexandra=-29.9/11.0/59.4 Alexandria=-26.4/20.0/68.2 Algiers=-28.0/18.2/75.1 Alice Springs=-24.9/21.0/64.4 Almaty=-35.0/10.0/59.4 Amsterdam=-36.3/10.2/55.2 Anadyr=-53.0/-6.9/38.5 Anchorage=-41.4/2.8/48.7 Andorra la Vella=-34.0/9.8/52.5 Ankara=-33.9/12.0/57.9 Antananarivo=-27.4/17.9/63.5 Antsiranana=-27.8/25.2/71.0 Arkhangelsk=-46.6/1.3/47.2 Ashgabat=-24.9/17.1/58.7 Asmara=-29.9/15.6/59.1 Assab=-14.3/30.5/77.0 Astana=-41.0/3.5/48.5 Athens=-27.0/19.2/60.9 Atlanta=-28.9/17.0/64.1 Auckland=-34.3/15.2/60.9 Austin=-27.8/20.7/64.5 Baghdad=-19.7/22.8/71.7 Baguio=-32.6/19.5/62.6 Baku=-31.6/15.1/56.7 Baltimore=-33.6/13.1/59.0 Bamako=-17.8/27.8/72.4 Bangkok=-15.4/28.6/77.0 Bangui=-20.0/26.0/68.1 Banjul=-25.3/26.0/69.6 Barcelona=-26.9/18.2/60.9 Bata=-21.1/25.1/72.3 Batumi=-28.4/14.0/58.0 Beijing=-33.8/12.9/57.1 Beirut=-23.1/20.9/71.3 Belgrade=-36.3/12.5/57.4 Belize City=-17.6/26.7/72.0 Benghazi=-25.8/19.9/65.7 Bergen=-36.0/7.7/52.9 Berlin=-33.8/10.3/56.0 Bilbao=-30.1/14.7/62.1 Birao=-18.1/26.5/68.8 Bishkek=-35.1/11.2/55.8 Bissau=-18.1/27.0/67.8 Blantyre=-24.9/22.2/67.6 Bloemfontein=-28.6/15.6/58.9 Boise=-31.2/11.4/55.5 Bordeaux=-28.1/14.2/62.1 Bosaso=-13.0/30.0/80.5 Boston=-34.1/10.9/58.4 Bouaké=-20.0/26.0/71.8 Bratislava=-32.7/10.5/56.9 Brazzaville=-22.7/25.0/68.6 Bridgetown=-26.5/27.0/75.3 Brisbane=-24.2/21.4/65.7 Brussels=-36.0/10.5/56.7 Bucharest=-33.6/10.8/57.6 Budapest=-32.0/11.3/54.4 Bujumbura=-20.4/23.8/66.7 Bulawayo=-23.3/18.9/62.5 Burnie=-28.4/13.1/55.7 Busan=-31.2/15.0/59.8 Cabo San Lucas=-18.8/23.9/69.9 Cairns=-21.4/25.0/68.7 Cairo=-26.3/21.4/65.1 Calgary=-43.2/4.4/51.9 Canberra=-31.3/13.1/53.5 Cape Town=-26.2/16.2/61.6 Changsha=-26.7/17.4/61.8 Charlotte=-31.8/16.1/62.2 Chiang Mai=-27.9/25.8/73.8 Chicago=-34.0/9.8/57.3 Chihuahua=-22.9/18.6/66.5 Chittagong=-18.0/25.9/69.4 Chișinău=-33.2/10.2/53.7 Chongqing=-28.9/18.6/64.7 Christchurch=-32.5/12.2/55.1 City of San Marino=-33.0/11.8/54.9 Colombo=-14.7/27.4/70.1 Columbus=-33.4/11.7/58.3 Conakry=-22.4/26.4/74.4 Copenhagen=-39.0/9.1/53.6 Cotonou=-19.5/27.2/73.9 Cracow=-33.1/9.3/56.8 Da Lat=-28.3/17.9/62.7 Da Nang=-17.3/25.8/68.8 Dakar=-19.6/24.0/70.2 Dallas=-32.3/19.0/64.6 Damascus=-31.3/17.0/60.5 Dampier=-16.7/26.4/72.9 Dar es Salaam=-17.6/25.8/69.0 Darwin=-17.3/27.6/75.4 Denpasar=-20.9/23.7/71.2 Denver=-34.0/10.4/52.7 Detroit=-36.5/10.0/54.7 Dhaka=-18.7/25.9/75.9 Dikson=-54.8/-11.1/35.7 Dili=-19.4/26.6/73.0 Djibouti=-15.8/30.0/74.0 Dodoma=-25.4/22.7/71.5 Dolisie=-20.8/24.0/78.6 Douala=-18.2/26.7/70.5 Dubai=-17.4/26.9/71.4 Dublin=-37.7/9.8/55.1 Dunedin=-35.6/11.1/59.0 Durban=-24.3/20.6/65.0 Dushanbe=-29.2/14.7/58.0 Edinburgh=-42.8/9.3/53.5 Edmonton=-37.4/4.2/47.2 El Paso=-27.3/18.1/59.9 Entebbe=-24.0/21.0/67.6 Erbil=-25.5/19.5/67.4 Erzurum=-40.0/5.1/48.5 Fairbanks=-47.3/-2.3/46.9 Fianarantsoa=-26.0/17.9/68.7 Flores, Petén=-17.9/26.4/72.2 Frankfurt=-32.5/10.6/56.5 Fresno=-28.5/17.9/62.4 Fukuoka=-28.5/17.0/63.1 Gaborone=-28.0/21.0/62.9 Gabès=-25.3/19.5/62.2 Gagnoa=-17.9/26.0/69.4 Gangtok=-28.0/15.2/60.9 Garissa=-14.1/29.3/77.3 Garoua=-17.2/28.3/73.2 George Town=-16.4/27.9/74.8 Ghanzi=-27.5/21.4/64.8 Gjoa Haven=-58.1/-14.4/27.9 Guadalajara=-23.8/20.9/63.1 Guangzhou=-24.8/22.4/69.7 Guatemala City=-21.5/20.4/68.1 Halifax=-35.7/7.5/53.2 Hamburg=-35.4/9.7/54.8 Hamilton=-27.2/13.8/58.1 Hanga Roa=-26.0/20.5/64.3 Hanoi=-21.3/23.6/68.5 Harare=-26.5/18.4/62.1 Harbin=-39.7/5.0/51.1 Hargeisa=-23.7/21.7/64.8 Hat Yai=-21.3/27.0/72.8 Havana=-19.0/25.2/75.2 Helsinki=-36.4/5.9/52.7 Heraklion=-25.4/18.9/61.8 Hiroshima=-27.8/16.3/61.3 Ho Chi Minh City=-18.2/27.4/72.1 Hobart=-33.6/12.7/59.0 Hong Kong=-35.6/23.3/68.1 Honiara=-19.2/26.5/68.9 Honolulu=-18.1/25.4/73.0 Houston=-33.4/20.8/62.3 Ifrane=-37.2/11.4/55.5 Indianapolis=-29.2/11.8/57.1 Iqaluit=-53.8/-9.3/35.9 Irkutsk=-46.5/1.0/44.6 Istanbul=-31.4/13.9/57.1 Jacksonville=-30.8/20.3/64.4 Jakarta=-20.7/26.7/75.4 Jayapura=-16.5/27.0/73.3 Jerusalem=-28.8/18.3/61.3 Johannesburg=-34.8/15.5/65.7 Jos=-28.8/22.8/67.9 Juba=-15.9/27.8/71.0 Kabul=-36.0/12.1/55.8 Kampala=-27.2/20.0/60.9 Kandi=-18.0/27.7/71.1 Kankan=-21.1/26.5/69.7 Kano=-17.0/26.4/77.7 Kansas City=-34.4/12.5/56.0 Karachi=-20.0/26.1/72.6 Karonga=-20.6/24.4/70.9 Kathmandu=-24.6/18.3/61.9 Khartoum=-18.2/29.9/75.5 Kingston=-19.3/27.4/69.8 Kinshasa=-22.8/25.3/72.3 Kolkata=-20.8/26.7/72.2 Kuala Lumpur=-20.9/27.3/74.6 Kumasi=-17.5/26.0/70.2 Kunming=-32.8/15.7/58.4 Kuopio=-39.3/3.4/47.6 Kuwait City=-21.4/25.7/69.9 Kyiv=-39.3/8.4/53.3 Kyoto=-28.0/15.8/60.4 La Ceiba=-23.4/26.2/73.7 La Paz=-21.0/23.7/70.2 Lagos=-19.1/26.8/71.2 Lahore=-18.5/24.3/66.0 Lake Havasu City=-27.3/23.7/69.6 Lake Tekapo=-33.8/8.7/49.9 Las Palmas de Gran Canaria=-22.6/21.2/70.3 Las Vegas=-21.9/20.3/65.4 Launceston=-33.1/13.1/57.9 Lhasa=-39.3/7.6/50.8 Libreville=-28.0/25.9/69.4 Lisbon=-28.0/17.5/60.4 Livingstone=-25.6/21.8/66.4 Ljubljana=-37.7/10.9/54.7 Lodwar=-16.8/29.3/80.2 Lomé=-18.6/26.9/68.5 London=-40.4/11.4/55.8 Los Angeles=-29.5/18.6/61.0 Louisville=-32.2/13.9/59.1 Luanda=-16.9/25.8/72.2 Lubumbashi=-27.2/20.8/66.9 Lusaka=-24.0/19.9/65.5 Luxembourg City=-36.7/9.3/51.7 Lviv=-34.3/7.8/57.1 Lyon=-37.4/12.5/58.5 Madrid=-29.9/15.0/59.5 Mahajanga=-22.2/26.3/70.4 Makassar=-17.1/26.7/72.3 Makurdi=-22.2/26.0/73.5 Malabo=-18.8/26.3/71.3 Malé=-13.9/28.0/75.5 Managua=-14.7/27.3/73.1 Manama=-18.8/26.5/72.8 Mandalay=-18.7/28.0/74.7 Mango=-15.7/28.1/80.4 Manila=-16.2/28.4/79.0 Maputo=-26.1/22.8/71.1 Marrakesh=-25.2/19.6/62.5 Marseille=-28.0/15.8/63.9 Maun=-23.1/22.4/68.5 Medan=-20.7/26.5/75.0 Mek'ele=-22.9/22.7/64.3 Melbourne=-32.6/15.1/60.9 Memphis=-24.0/17.2/60.4 Mexicali=-20.2/23.1/70.7 Mexico City=-26.7/17.5/61.6 Miami=-18.2/24.9/71.1 Milan=-34.5/12.9/58.2 Milwaukee=-35.3/8.9/55.5 Minneapolis=-40.2/7.8/51.1 Minsk=-37.7/6.7/50.7 Mogadishu=-26.3/27.1/72.3 Mombasa=-19.6/26.3/69.7 Monaco=-29.6/16.4/62.9 Moncton=-40.1/6.1/47.4 Monterrey=-24.0/22.3/72.7 Montreal=-40.8/6.8/53.3 Moscow=-36.8/5.8/51.0 Mumbai=-14.3/27.1/71.7 Murmansk=-44.2/0.6/51.2 Muscat=-17.8/28.0/73.1 Mzuzu=-34.9/17.7/61.8 N'Djamena=-13.8/28.3/72.5 Naha=-21.1/23.1/71.9 Nairobi=-28.7/17.8/62.3 Nakhon Ratchasima=-31.1/27.3/69.4 Napier=-31.7/14.6/58.4 Napoli=-30.2/15.9/64.8 Nashville=-30.3/15.4/58.6 Nassau=-18.1/24.6/66.3 Ndola=-24.0/20.3/70.7 New Delhi=-18.9/25.0/68.7 New Orleans=-23.3/20.7/71.7 New York City=-29.1/12.9/55.5 Ngaoundéré=-21.7/22.0/65.4 Niamey=-14.6/29.3/75.3 Nicosia=-23.9/19.7/63.2 Niigata=-36.6/13.9/61.5 Nouadhibou=-25.4/21.3/66.4 Nouakchott=-16.6/25.7/73.6 Novosibirsk=-44.5/1.7/45.8 Nuuk=-50.4/-1.4/44.2 Odesa=-33.7/10.7/56.9 Odienné=-18.0/26.0/70.1 Oklahoma City=-33.1/15.9/57.5 Omaha=-34.4/10.6/54.4 Oranjestad=-15.9/28.1/74.9 Oslo=-36.5/5.7/53.2 Ottawa=-42.3/6.6/53.2 Ouagadougou=-20.4/28.3/72.6 Ouahigouya=-17.6/28.6/81.3 Ouarzazate=-23.8/18.9/62.6 Oulu=-41.3/2.7/49.8 Palembang=-17.9/27.4/69.5 Palermo=-30.1/18.5/64.7 Palm Springs=-17.5/24.5/71.0 Palmerston North=-34.5/13.2/57.7 Panama City=-17.0/28.0/75.4 Parakou=-23.1/26.8/70.4 Paris=-30.4/12.3/58.0 Perth=-27.3/18.7/68.1 Petropavlovsk-Kamchatsky=-46.8/1.9/55.8 Philadelphia=-42.3/13.2/60.0 Phnom Penh=-15.1/28.3/75.1 Phoenix=-19.4/23.9/68.4 Pittsburgh=-36.9/10.8/54.9 Podgorica=-27.8/15.3/59.0 Pointe-Noire=-17.5/26.1/71.9 Pontianak=-18.5/27.7/73.4 Port Moresby=-21.2/26.9/71.1 Port Sudan=-16.9/28.4/76.2 Port Vila=-20.3/24.3/78.5 Port-Gentil=-17.9/26.0/72.1 Portland (OR)=-32.4/12.4/58.2 Porto=-28.6/15.7/66.9 Prague=-39.9/8.4/54.1 Praia=-24.9/24.5/67.9 Pretoria=-25.7/18.2/65.2 Pyongyang=-37.2/10.8/58.5 Rabat=-29.0/17.2/68.8 Rangpur=-19.4/24.4/71.2 Reggane=-20.3/28.3/74.2 Reykjavík=-45.8/4.3/48.5 Riga=-40.5/6.2/49.2 Riyadh=-21.0/26.0/74.9 Rome=-28.8/15.2/60.5 Roseau=-27.9/26.2/74.2 Rostov-on-Don=-34.1/9.9/56.6 Sacramento=-29.2/16.3/71.8 Saint Petersburg=-41.4/5.9/55.0 Saint-Pierre=-36.7/5.7/56.6 Salt Lake City=-33.3/11.6/57.8 San Antonio=-22.4/20.8/65.3 San Diego=-29.5/17.8/64.8 San Francisco=-32.1/14.6/57.1 San Jose=-28.6/16.4/59.2 San José=-20.3/22.6/72.2 San Juan=-20.3/27.2/74.4 San Salvador=-24.7/23.1/75.7 Sana'a=-24.3/20.0/64.5 Santo Domingo=-18.1/25.9/72.0 Sapporo=-37.9/8.9/54.9 Sarajevo=-35.5/10.1/52.3 Saskatoon=-39.3/3.3/49.4 Seattle=-35.7/11.3/56.7 Seoul=-30.8/12.5/60.5 Seville=-25.6/19.2/62.9 Shanghai=-28.9/16.7/63.3 Singapore=-24.4/27.0/75.9 Skopje=-32.6/12.4/58.4 Sochi=-31.0/14.2/58.3 Sofia=-36.0/10.6/55.2 Sokoto=-14.3/28.0/77.0 Split=-29.6/16.1/59.9 St. John's=-39.7/5.0/52.4 St. Louis=-33.0/13.9/60.5 Stockholm=-37.2/6.6/50.5 Surabaya=-19.1/27.1/73.8 Suva=-17.0/25.6/73.7 Suwałki=-36.5/7.2/49.3 Sydney=-27.2/17.7/61.4 Ségou=-18.4/28.0/71.0 Tabora=-23.3/23.0/68.3 Tabriz=-32.4/12.6/55.6 Taipei=-26.8/23.0/70.1 Tallinn=-39.9/6.4/57.4 Tamale=-18.1/27.9/77.0 Tamanrasset=-22.2/21.7/66.9 Tampa=-21.9/22.9/69.5 Tashkent=-40.1/14.8/60.7 Tauranga=-28.8/14.8/57.8 Tbilisi=-37.6/12.9/60.5 Tegucigalpa=-19.9/21.7/69.5 Tehran=-30.4/17.0/62.9 Tel Aviv=-25.1/20.0/64.7 Thessaloniki=-29.2/16.0/58.7 Thiès=-23.2/24.0/68.1 Tijuana=-36.8/17.8/60.9 Timbuktu=-17.6/28.0/79.4 Tirana=-30.9/15.2/59.6 Toamasina=-19.1/23.4/71.5 Tokyo=-33.1/15.4/61.8 Toliara=-17.9/24.1/70.8 Toluca=-30.2/12.4/54.3 Toronto=-37.3/9.4/57.0 Tripoli=-24.3/20.0/67.7 Tromsø=-45.8/2.9/49.3 Tucson=-26.5/20.9/63.6 Tunis=-27.3/18.4/69.2 Ulaanbaatar=-43.0/-0.4/51.9 Upington=-28.7/20.4/65.1 Vaduz=-32.3/10.1/58.5 Valencia=-25.2/18.3/61.6 Valletta=-24.7/18.8/60.1 Vancouver=-34.8/10.4/58.6 Veracruz=-19.0/25.4/74.8 Vienna=-31.7/10.4/51.9 Vientiane=-16.8/25.9/74.6 Villahermosa=-15.2/27.1/73.7 Vilnius=-35.7/6.0/54.1 Virginia Beach=-28.7/15.8/61.2 Vladivostok=-38.6/4.9/50.3 Warsaw=-36.9/8.5/56.9 Washington, D.C.=-36.2/14.6/61.2 Wau=-18.9/27.8/72.2 Wellington=-35.2/12.9/61.3 Whitehorse=-45.5/-0.1/44.4 Wichita=-31.0/13.9/67.7 Willemstad=-15.2/28.0/73.8 Winnipeg=-40.7/3.0/52.7 Wrocław=-33.6/9.6/57.4 Xi'an=-27.7/14.1/57.1 Yakutsk=-56.0/-8.8/39.7 Yangon=-20.2/27.5/76.9 Yaoundé=-19.8/23.8/68.6 Yellowknife=-46.6/-4.3/41.0 Yerevan=-36.8/12.4/54.3 Yinchuan=-36.8/9.0/53.0 Zagreb=-33.1/10.7/55.5 Zanzibar City=-22.2/26.0/66.9 Zürich=-34.7/9.3/53.0 Ürümqi=-37.8/7.4/49.4 İzmir=-26.5/17.9/61.3 INFO:brc.util:basic <<< Elapsed time: 0:01:13.951427 INFO:brc.util:basic <<< with 100,000,000 rows => 1,352,238.944 rows/s. brc complete est 0:12:19.514273
! Note: the different values are due to producing the data of different lengths.
Profiling this code with cProfile shows the weak points, this result is logged to the screen using the Python builtins. Profiling the code does make it slower - so the below extended time is because of the attached profiler.
1 2 3 4 5 6 7 8 9 10 11
500002507 function calls (500002504 primitive calls) in 161.663 seconds Ordered by: cumulative time List reduced from 108 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 1 96.971 96.971 161.663 161.663 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_basic.py:9(main) 100000000 15.139 0.000 15.139 0.000 {built-in method builtins.max} 100000000 15.119 0.000 15.119 0.000 {method 'split' of 'bytes' objects} 100000000 15.030 0.000 15.030 0.000 {built-in method builtins.min} 100000013 13.081 0.000 13.081 0.000 {method 'get' of 'dict' objects}
It looks like most of the time (main) is spent doing the min and max for the temp values, then parsing the file lines as if it were a CSV, and then doing the lookups for the collection objects. Fundamentally we will need to change the processing approach as these are all builtins of Python and are considered as fast as they can get. So let's try spreading these operations to other processes.
2 - Parallelize The Slow Bits
If we look to leverage the multiprocessing and multithreading modules we would be able to pass around data to separate processes to allow these min()
max()
and split()
lines to be run by various processes. In part 1 we saw the profile results showing the code is still CPU bound, not IO bound; this means threading isn't needed (yet) and we should opt for more processes. Multiprocessing is more heavyweight and takes more to launch a new process than a thread. However once the thread is up it's relatively fast and unbound by the GIL. (It's also my personal preference in Python to start with coroutines then processes and approach multiprocessing from a 1:1 core:process implementation design, then thread these processes if needed)
Below is the code which allows the summing of various read results to be paralleled across several processes. It essentially batches 100k records being read from the file and places them onto a worker to complete the aggregation process. The largest part of refactoring this code is moving data back and forth from different processes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
import logging from collections import defaultdict from dataclasses import dataclass from functools import lru_cache from itertools import batched, chain from multiprocessing import Pool from pathlib import Path from brc.util import DATA_DIR, timit_context @dataclass class CollectionStruct(): city: str min_temp: float max_temp: float sum_temps: float count: int def __init__(self, city: str = '', init_value: float = 0): self.city = city self.min_temp = init_value self.max_temp = init_value self.sum_temps = init_value self.count = 1 def mean(self): return self.sum_temps / self.count def __add__(self, other: "CollectionStruct"): # this is more of a merge function # the city line is to allow for the default dict default factory. there's probably a performance hit here self.city = other.city self.count += other.count self.sum_temps += other.sum_temps self.min_temp = min(self.min_temp, other.min_temp) self.max_temp = max(self.max_temp, other.max_temp) return self def __repr__(self): return f"{self.city}={self.min_temp}/{self.mean():.1f}/{self.max_temp}" @lru_cache(None) def parse_city(city: bytes) -> str: return city.decode("utf-8") @lru_cache def parse_temp(temp: bytes) -> float: return float(temp) def do_parse(line: bytes): city, temp = line.split(b";") temp = float(temp) city = parse_city(city) return CollectionStruct(city, temp) def process_lines(*lines: bytes): """ Takes a number of lines from the file and aggregates them to a single results dict """ totals = defaultdict(CollectionStruct) for line in (do_parse(l) for l in lines): totals[line.city] += line return totals def main(file_path: Path): totals = defaultdict(CollectionStruct) # read in file and batch lines in groups to a process to aggregate with open(file_path, 'rb') as f: # 8 to match the change target machine cores. with Pool(8) as pool: # this is slightly faster than pool.map(process_lines, f, n) thanks to serializing more lines to each # process at once (the slow bit is moving bin objects to python res = pool.starmap(process_lines, batched(f, 250_000)) # group all results (approx n_proc x m_distinct_stations elements) for c in chain.from_iterable((r.values() for r in res)): totals[c.city] += c print(" ".join((str(s) for s in sorted(totals.values(), key=lambda c: c.city)))) if __name__ == "__main__": logging.basicConfig(level=logging.INFO) n = 100_000_000 path = DATA_DIR / f"data_{n:_}.txt" with timit_context(n, "parallel"): main(path) with timit_context(n, "parallel_profile", profile=True): main(path)
Then we look at the results and profile of this run:
1 2 3 4
INFO:brc.util:parallel >>> Starting timer Abha=-24.7/18.0/64.2 Abidjan=-21.7/26.0/75.3 Abéché=-18.3/29.4/71.0 Accra=-17.0/26.3/75.2 Addis Ababa=-32.3/15.9/62.3 Adelaide=-24.6/17.3/67.1 Aden=-15.0/29.1/79.9 Ahvaz=-16.6/25.4/69.0 Albuquerque=-30.7/14.0/57.3 Alexandra=-29.9/11.0/59.4 Alexandria=-26.4/20.0/68.2 Algiers=-28.0/18.2/75.1 Alice Springs=-24.9/21.0/64.4 Almaty=-35.0/10.0/59.4 Amsterdam=-36.3/10.2/55.2 Anadyr=-53.0/-6.9/38.5 Anchorage=-41.4/2.8/48.7 Andorra la Vella=-34.0/9.8/52.5 Ankara=-33.9/12.0/57.9 Antananarivo=-27.4/17.8/63.5 Antsiranana=-27.8/25.2/71.0 Arkhangelsk=-46.6/1.3/47.2 Ashgabat=-24.9/17.1/58.7 Asmara=-29.9/15.6/59.1 Assab=-14.3/30.5/77.0 Astana=-41.0/3.5/48.5 Athens=-27.0/19.2/60.9 Atlanta=-28.9/17.0/64.1 Auckland=-34.3/15.2/60.9 Austin=-27.8/20.7/64.5 Baghdad=-19.7/22.7/71.7 Baguio=-32.6/19.5/62.6 Baku=-31.6/15.1/56.7 Baltimore=-33.6/13.1/59.0 Bamako=-17.8/27.8/72.4 Bangkok=-15.4/28.6/77.0 Bangui=-20.0/26.0/68.1 Banjul=-25.3/26.0/69.6 Barcelona=-26.9/18.1/60.9 Bata=-21.1/25.0/72.3 Batumi=-28.4/14.0/58.0 Beijing=-33.8/12.9/57.1 Beirut=-23.1/20.9/71.3 Belgrade=-36.3/12.5/57.4 Belize City=-17.6/26.7/72.0 Benghazi=-25.8/19.9/65.7 Bergen=-36.0/7.6/52.9 Berlin=-33.8/10.3/56.0 Bilbao=-30.1/14.7/62.1 Birao=-18.1/26.5/68.8 Bishkek=-35.1/11.2/55.8 Bissau=-18.1/27.0/67.8 Blantyre=-24.9/22.2/67.6 Bloemfontein=-28.6/15.6/58.9 Boise=-31.2/11.3/55.5 Bordeaux=-28.1/14.2/62.1 Bosaso=-13.0/29.9/80.5 Boston=-34.1/10.9/58.4 Bouaké=-20.0/25.9/71.8 Bratislava=-32.7/10.5/56.9 Brazzaville=-22.7/24.9/68.6 Bridgetown=-26.5/26.9/75.3 Brisbane=-24.2/21.4/65.7 Brussels=-36.0/10.5/56.7 Bucharest=-33.6/10.8/57.6 Budapest=-32.0/11.2/54.4 Bujumbura=-20.4/23.8/66.7 Bulawayo=-23.3/18.8/62.5 Burnie=-28.4/13.1/55.7 Busan=-31.2/15.0/59.8 Cabo San Lucas=-18.8/23.9/69.9 Cairns=-21.4/25.0/68.7 Cairo=-26.3/21.3/65.1 Calgary=-43.2/4.4/51.9 Canberra=-31.3/13.1/53.5 Cape Town=-26.2/16.2/61.6 Changsha=-26.7/17.4/61.8 Charlotte=-31.8/16.1/62.2 Chiang Mai=-27.9/25.7/73.8 Chicago=-34.0/9.8/57.3 Chihuahua=-22.9/18.6/66.5 Chittagong=-18.0/25.9/69.4 Chișinău=-33.2/10.2/53.7 Chongqing=-28.9/18.6/64.7 Christchurch=-32.5/12.2/55.1 City of San Marino=-33.0/11.8/54.9 Colombo=-14.7/27.3/70.1 Columbus=-33.4/11.7/58.3 Conakry=-22.4/26.3/74.4 Copenhagen=-39.0/9.1/53.6 Cotonou=-19.5/27.1/73.9 Cracow=-33.1/9.3/56.8 Da Lat=-28.3/17.9/62.7 Da Nang=-17.3/25.7/68.8 Dakar=-19.6/24.0/70.2 Dallas=-32.3/19.0/64.6 Damascus=-31.3/16.9/60.5 Dampier=-16.7/26.3/72.9 Dar es Salaam=-17.6/25.7/69.0 Darwin=-17.3/27.5/75.4 Denpasar=-20.9/23.7/71.2 Denver=-34.0/10.4/52.7 Detroit=-36.5/10.0/54.7 Dhaka=-18.7/25.9/75.9 Dikson=-54.8/-11.1/35.7 Dili=-19.4/26.6/73.0 Djibouti=-15.8/29.9/74.0 Dodoma=-25.4/22.7/71.5 Dolisie=-20.8/24.0/78.6 Douala=-18.2/26.7/70.5 Dubai=-17.4/26.8/71.4 Dublin=-37.7/9.8/55.1 Dunedin=-35.6/11.1/59.0 Durban=-24.3/20.6/65.0 Dushanbe=-29.2/14.7/58.0 Edinburgh=-42.8/9.3/53.5 Edmonton=-37.4/4.2/47.2 El Paso=-27.3/18.0/59.9 Entebbe=-24.0/21.0/67.6 Erbil=-25.5/19.4/67.4 Erzurum=-40.0/5.1/48.5 Fairbanks=-47.3/-2.3/46.9 Fianarantsoa=-26.0/17.9/68.7 Flores, Petén=-17.9/26.3/72.2 Frankfurt=-32.5/10.6/56.5 Fresno=-28.5/17.8/62.4 Fukuoka=-28.5/17.0/63.1 Gaborone=-28.0/21.0/62.9 Gabès=-25.3/19.4/62.2 Gagnoa=-17.9/25.9/69.4 Gangtok=-28.0/15.2/60.9 Garissa=-14.1/29.3/77.3 Garoua=-17.2/28.3/73.2 George Town=-16.4/27.8/74.8 Ghanzi=-27.5/21.4/64.8 Gjoa Haven=-58.1/-14.4/27.9 Guadalajara=-23.8/20.9/63.1 Guangzhou=-24.8/22.3/69.7 Guatemala City=-21.5/20.3/68.1 Halifax=-35.7/7.5/53.2 Hamburg=-35.4/9.7/54.8 Hamilton=-27.2/13.8/58.1 Hanga Roa=-26.0/20.5/64.3 Hanoi=-21.3/23.6/68.5 Harare=-26.5/18.4/62.1 Harbin=-39.7/5.0/51.1 Hargeisa=-23.7/21.7/64.8 Hat Yai=-21.3/26.9/72.8 Havana=-19.0/25.2/75.2 Helsinki=-36.4/5.9/52.7 Heraklion=-25.4/18.9/61.8 Hiroshima=-27.8/16.3/61.3 Ho Chi Minh City=-18.2/27.3/72.1 Hobart=-33.6/12.7/59.0 Hong Kong=-35.6/23.3/68.1 Honiara=-19.2/26.4/68.9 Honolulu=-18.1/25.4/73.0 Houston=-33.4/20.8/62.3 Ifrane=-37.2/11.4/55.5 Indianapolis=-29.2/11.8/57.1 Iqaluit=-53.8/-9.3/35.9 Irkutsk=-46.5/1.0/44.6 Istanbul=-31.4/13.9/57.1 Jacksonville=-30.8/20.2/64.4 Jakarta=-20.7/26.6/75.4 Jayapura=-16.5/27.0/73.3 Jerusalem=-28.8/18.3/61.3 Johannesburg=-34.8/15.4/65.7 Jos=-28.8/22.8/67.9 Juba=-15.9/27.8/71.0 Kabul=-36.0/12.1/55.8 Kampala=-27.2/19.9/60.9 Kandi=-18.0/27.7/71.1 Kankan=-21.1/26.5/69.7 Kano=-17.0/26.4/77.7 Kansas City=-34.4/12.5/56.0 Karachi=-20.0/26.0/72.6 Karonga=-20.6/24.3/70.9 Kathmandu=-24.6/18.3/61.9 Khartoum=-18.2/29.9/75.5 Kingston=-19.3/27.3/69.8 Kinshasa=-22.8/25.3/72.3 Kolkata=-20.8/26.7/72.2 Kuala Lumpur=-20.9/27.3/74.6 Kumasi=-17.5/25.9/70.2 Kunming=-32.8/15.7/58.4 Kuopio=-39.3/3.4/47.6 Kuwait City=-21.4/25.6/69.9 Kyiv=-39.3/8.4/53.3 Kyoto=-28.0/15.8/60.4 La Ceiba=-23.4/26.2/73.7 La Paz=-21.0/23.6/70.2 Lagos=-19.1/26.8/71.2 Lahore=-18.5/24.3/66.0 Lake Havasu City=-27.3/23.7/69.6 Lake Tekapo=-33.8/8.7/49.9 Las Palmas de Gran Canaria=-22.6/21.2/70.3 Las Vegas=-21.9/20.3/65.4 Launceston=-33.1/13.1/57.9 Lhasa=-39.3/7.6/50.8 Libreville=-28.0/25.9/69.4 Lisbon=-28.0/17.5/60.4 Livingstone=-25.6/21.8/66.4 Ljubljana=-37.7/10.9/54.7 Lodwar=-16.8/29.2/80.2 Lomé=-18.6/26.9/68.5 London=-40.4/11.3/55.8 Los Angeles=-29.5/18.6/61.0 Louisville=-32.2/13.9/59.1 Luanda=-16.9/25.8/72.2 Lubumbashi=-27.2/20.7/66.9 Lusaka=-24.0/19.8/65.5 Luxembourg City=-36.7/9.3/51.7 Lviv=-34.3/7.8/57.1 Lyon=-37.4/12.5/58.5 Madrid=-29.9/15.0/59.5 Mahajanga=-22.2/26.2/70.4 Makassar=-17.1/26.6/72.3 Makurdi=-22.2/26.0/73.5 Malabo=-18.8/26.2/71.3 Malé=-13.9/28.0/75.5 Managua=-14.7/27.2/73.1 Manama=-18.8/26.4/72.8 Mandalay=-18.7/28.0/74.7 Mango=-15.7/28.1/80.4 Manila=-16.2/28.3/79.0 Maputo=-26.1/22.8/71.1 Marrakesh=-25.2/19.5/62.5 Marseille=-28.0/15.8/63.9 Maun=-23.1/22.4/68.5 Medan=-20.7/26.5/75.0 Mek'ele=-22.9/22.7/64.3 Melbourne=-32.6/15.1/60.9 Memphis=-24.0/17.2/60.4 Mexicali=-20.2/23.1/70.7 Mexico City=-26.7/17.5/61.6 Miami=-18.2/24.9/71.1 Milan=-34.5/12.9/58.2 Milwaukee=-35.3/8.9/55.5 Minneapolis=-40.2/7.8/51.1 Minsk=-37.7/6.7/50.7 Mogadishu=-26.3/27.0/72.3 Mombasa=-19.6/26.3/69.7 Monaco=-29.6/16.3/62.9 Moncton=-40.1/6.1/47.4 Monterrey=-24.0/22.2/72.7 Montreal=-40.8/6.8/53.3 Moscow=-36.8/5.8/51.0 Mumbai=-14.3/27.1/71.7 Murmansk=-44.2/0.6/51.2 Muscat=-17.8/27.9/73.1 Mzuzu=-34.9/17.6/61.8 N'Djamena=-13.8/28.3/72.5 Naha=-21.1/23.0/71.9 Nairobi=-28.7/17.8/62.3 Nakhon Ratchasima=-31.1/27.2/69.4 Napier=-31.7/14.6/58.4 Napoli=-30.2/15.9/64.8 Nashville=-30.3/15.4/58.6 Nassau=-18.1/24.5/66.3 Ndola=-24.0/20.3/70.7 New Delhi=-18.9/25.0/68.7 New Orleans=-23.3/20.7/71.7 New York City=-29.1/12.9/55.5 Ngaoundéré=-21.7/22.0/65.4 Niamey=-14.6/29.2/75.3 Nicosia=-23.9/19.7/63.2 Niigata=-36.6/13.9/61.5 Nouadhibou=-25.4/21.3/66.4 Nouakchott=-16.6/25.6/73.6 Novosibirsk=-44.5/1.7/45.8 Nuuk=-50.4/-1.4/44.2 Odesa=-33.7/10.7/56.9 Odienné=-18.0/26.0/70.1 Oklahoma City=-33.1/15.8/57.5 Omaha=-34.4/10.6/54.4 Oranjestad=-15.9/28.1/74.9 Oslo=-36.5/5.7/53.2 Ottawa=-42.3/6.6/53.2 Ouagadougou=-20.4/28.3/72.6 Ouahigouya=-17.6/28.6/81.3 Ouarzazate=-23.8/18.9/62.6 Oulu=-41.3/2.7/49.8 Palembang=-17.9/27.3/69.5 Palermo=-30.1/18.5/64.7 Palm Springs=-17.5/24.4/71.0 Palmerston North=-34.5/13.1/57.7 Panama City=-17.0/28.0/75.4 Parakou=-23.1/26.8/70.4 Paris=-30.4/12.3/58.0 Perth=-27.3/18.7/68.1 Petropavlovsk-Kamchatsky=-46.8/1.9/55.8 Philadelphia=-42.3/13.2/60.0 Phnom Penh=-15.1/28.3/75.1 Phoenix=-19.4/23.8/68.4 Pittsburgh=-36.9/10.8/54.9 Podgorica=-27.8/15.3/59.0 Pointe-Noire=-17.5/26.0/71.9 Pontianak=-18.5/27.7/73.4 Port Moresby=-21.2/26.8/71.1 Port Sudan=-16.9/28.4/76.2 Port Vila=-20.3/24.3/78.5 Port-Gentil=-17.9/26.0/72.1 Portland (OR)=-32.4/12.3/58.2 Porto=-28.6/15.7/66.9 Prague=-39.9/8.4/54.1 Praia=-24.9/24.4/67.9 Pretoria=-25.7/18.2/65.2 Pyongyang=-37.2/10.7/58.5 Rabat=-29.0/17.2/68.8 Rangpur=-19.4/24.4/71.2 Reggane=-20.3/28.2/74.2 Reykjavík=-45.8/4.3/48.5 Riga=-40.5/6.2/49.2 Riyadh=-21.0/25.9/74.9 Rome=-28.8/15.2/60.5 Roseau=-27.9/26.2/74.2 Rostov-on-Don=-34.1/9.9/56.6 Sacramento=-29.2/16.2/71.8 Saint Petersburg=-41.4/5.9/55.0 Saint-Pierre=-36.7/5.7/56.6 Salt Lake City=-33.3/11.6/57.8 San Antonio=-22.4/20.8/65.3 San Diego=-29.5/17.8/64.8 San Francisco=-32.1/14.6/57.1 San Jose=-28.6/16.4/59.2 San José=-20.3/22.5/72.2 San Juan=-20.3/27.2/74.4 San Salvador=-24.7/23.0/75.7 Sana'a=-24.3/20.0/64.5 Santo Domingo=-18.1/25.9/72.0 Sapporo=-37.9/8.9/54.9 Sarajevo=-35.5/10.1/52.3 Saskatoon=-39.3/3.3/49.4 Seattle=-35.7/11.3/56.7 Seoul=-30.8/12.5/60.5 Seville=-25.6/19.2/62.9 Shanghai=-28.9/16.7/63.3 Singapore=-24.4/26.9/75.9 Skopje=-32.6/12.4/58.4 Sochi=-31.0/14.2/58.3 Sofia=-36.0/10.6/55.2 Sokoto=-14.3/28.0/77.0 Split=-29.6/16.1/59.9 St. John's=-39.7/5.0/52.4 St. Louis=-33.0/13.9/60.5 Stockholm=-37.2/6.6/50.5 Surabaya=-19.1/27.0/73.8 Suva=-17.0/25.6/73.7 Suwałki=-36.5/7.2/49.3 Sydney=-27.2/17.7/61.4 Ségou=-18.4/28.0/71.0 Tabora=-23.3/22.9/68.3 Tabriz=-32.4/12.6/55.6 Taipei=-26.8/23.0/70.1 Tallinn=-39.9/6.4/57.4 Tamale=-18.1/27.8/77.0 Tamanrasset=-22.2/21.7/66.9 Tampa=-21.9/22.8/69.5 Tashkent=-40.1/14.8/60.7 Tauranga=-28.8/14.8/57.8 Tbilisi=-37.6/12.9/60.5 Tegucigalpa=-19.9/21.7/69.5 Tehran=-30.4/17.0/62.9 Tel Aviv=-25.1/20.0/64.7 Thessaloniki=-29.2/16.0/58.7 Thiès=-23.2/23.9/68.1 Tijuana=-36.8/17.8/60.9 Timbuktu=-17.6/28.0/79.4 Tirana=-30.9/15.2/59.6 Toamasina=-19.1/23.4/71.5 Tokyo=-33.1/15.4/61.8 Toliara=-17.9/24.1/70.8 Toluca=-30.2/12.4/54.3 Toronto=-37.3/9.4/57.0 Tripoli=-24.3/20.0/67.7 Tromsø=-45.8/2.9/49.3 Tucson=-26.5/20.9/63.6 Tunis=-27.3/18.4/69.2 Ulaanbaatar=-43.0/-0.4/51.9 Upington=-28.7/20.4/65.1 Vaduz=-32.3/10.1/58.5 Valencia=-25.2/18.3/61.6 Valletta=-24.7/18.8/60.1 Vancouver=-34.8/10.4/58.6 Veracruz=-19.0/25.3/74.8 Vienna=-31.7/10.4/51.9 Vientiane=-16.8/25.8/74.6 Villahermosa=-15.2/27.1/73.7 Vilnius=-35.7/6.0/54.1 Virginia Beach=-28.7/15.8/61.2 Vladivostok=-38.6/4.9/50.3 Warsaw=-36.9/8.5/56.9 Washington, D.C.=-36.2/14.6/61.2 Wau=-18.9/27.8/72.2 Wellington=-35.2/12.9/61.3 Whitehorse=-45.5/-0.1/44.4 Wichita=-31.0/13.8/67.7 Willemstad=-15.2/28.0/73.8 Winnipeg=-40.7/3.0/52.7 Wrocław=-33.6/9.6/57.4 Xi'an=-27.7/14.1/57.1 Yakutsk=-56.0/-8.8/39.7 Yangon=-20.2/27.5/76.9 Yaoundé=-19.8/23.8/68.6 Yellowknife=-46.6/-4.3/41.0 Yerevan=-36.8/12.4/54.3 Yinchuan=-36.8/9.0/53.0 Zagreb=-33.1/10.7/55.5 Zanzibar City=-22.2/25.9/66.9 Zürich=-34.7/9.3/53.0 Ürümqi=-37.8/7.4/49.4 İzmir=-26.5/17.9/61.3 INFO:brc.util:parallel <<< Elapsed time: 0:00:27.071418 INFO:brc.util:parallel <<< with 100,000,000 rows => 3,693,932.807 rows/s. brc complete est 0:04:30.714183
30 seconds isn't bad for 100mm records in Python, not amazing but it over halved the time from the non-multiprocess solution. It is however less of an improvement than I would have liked; only ~3x records processed per second with 8 more cores really should be a much larger improvement. This will also leave us with a very long time for the full billion too.
1 2 3 4 5 6 7 8 9 10 11
553270 function calls (553255 primitive calls) in 29.698 seconds Ordered by: cumulative time List reduced from 280 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 249 0.001 0.000 60.440 0.243 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/pool.py:500(_wait_for_updates) 39 0.029 0.001 35.980 0.923 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/connection.py:201(send) 44 0.000 0.000 31.395 0.714 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/connection.py:389(_send_bytes) 75 0.000 0.000 31.394 0.419 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/connection.py:364(_send) 107 2.057 0.019 31.394 0.293 {built-in method posix.write}
Looking at the profile We can see most of the code time has migrated to waiting for updates and sending data to and from each child process. This approach is taking a hacksaw to the problem and brute-forcing the code to run in more places. It's not a bad approach, but now we would have to optimise the heebie-jeebies of Python multiprocessing, which would be rather technical. One problem with this approach is that we haven't fine-tuned the single process first; which leads to just slamming the inefficient process.
Another approach to this multiprocessing module would be to try leveraging some of the python's shared state tools, such as the Value, Array, or in this case, a Manager would be best. (We could also look into using a Queue but this isn't really the right problem). In the end, this will likely end up still having the same problem as before; python objects are expensive to send and receive across processes.
3 - Change the Data Pipeline
As part 2 showed were still wasting a lot of time passing data back and forth across Python processes and we hadn't really thought of the underlying problem. This can be improved by partitioning the problem and allowing each process to reach the file at the same time then do its processing for its chunk then return the result. This allows us to focus on optimising a single partition flow and then distributing that to multiple cores, it will also allow us to reduce the data we send back and forth between the processes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
import logging import os from collections import defaultdict from functools import reduce from itertools import chain, pairwise from multiprocessing import Pool from pathlib import Path from brc.util import DATA_DIR, timit_context def do_parse(line: bytes): # this is basically O(n) every time. # todo: no improvement possible for parsing in python? city, temp = line.split(b";", maxsplit=1) temp = float(temp) return city, temp result_tuple = tuple[float, float, float, int] def collect_part(file_path: Path, start: int, end: int) -> dict[bytes, result_tuple]: data = defaultdict(list) part_data = read_file_part(file_path, start, end) for part in part_data: city, temp = do_parse(part) # todo: .append() is still slow, change this list approach somehow? data[city].append(temp) return { city: (min(items), max(items), sum(items), len(items)) for city, items in data.items() } def merge_result(one: dict[bytes, result_tuple], two: dict[bytes, result_tuple]): # todo: can this be optimised further? for k, two_v in two.items(): # if we have a common data, do the partial sum if k in one.keys(): one_v = one[k] one[k] = ( min(one_v[0], two_v[0]), max(one_v[1], two_v[1]), one_v[2] + two_v[2], one_v[3] + two_v[3] ) else: # if we don't have a common data, overwrite one[k] = two_v return one def read_file_part(file_path: Path, start: int, end: int) -> list[bytes]: """ Reads in the files bytes from a start position to the end position """ with open(file_path, 'rb') as f: f.seek(start) # we provide the offset form our data return f.readlines(end - start) def find_next_newline(file_path: Path, start: int) -> int: """ Finds the next newline in the file after the position given. """ # special case if were at the start of the file if start == 0: return 0 offset = start + 1 with open(file_path, 'rb') as f: # move to start place f.seek(start) # read byte by byte til a newline is found while f.read(1) != b'\n': offset += 1 return offset def main(file_path: Path): file_size = os.path.getsize(file_path) n_splits = 5000 split_size = file_size // n_splits partitions = (find_next_newline(file_path, n * split_size) for n in range(n_splits)) # extend partitions with the end of the file. partitions = chain(partitions, (file_size,)) # single threaded version for profiling # results = [] # for i, j in pairwise(partitions): # results.append(collect_part(file_path, i, j - 1)) # multithreading version distributing the loop/generator with Pool(8) as pool: partitions_ = ((file_path, i, j - 1) for i, j in pairwise(partitions)) results = pool.starmap(collect_part, partitions_, 10) parts = reduce(merge_result, results) sorted_items = sorted(parts.items()) print(" ".join(f"{k.decode("utf-8")}={p[0]}/{p[2] / p[3]:.1f}/{p[1]}" for k, p in sorted_items)) if __name__ == "__main__": logging.basicConfig(level=logging.INFO) n = 1_000_000_000 path = DATA_DIR / f"data_{n:_}.txt" with timit_context(n, f"parallel"): main(path)
The single process version produces the following result for 100mm rows:
1 2 3 4
INFO:brc.util:parallel2 >>> Starting timer Abha=-24.7/18.0/64.2 Abidjan=-21.7/26.0/75.3 Abéché=-18.3/29.4/71.0 Accra=-17.0/26.4/75.2 Addis Ababa=-32.3/16.0/62.3 Adelaide=-24.6/17.3/67.1 Aden=-15.0/29.1/79.9 Ahvaz=-16.6/25.4/69.0 Albuquerque=-30.7/14.0/57.3 Alexandra=-29.9/11.0/59.4 Alexandria=-26.4/20.0/68.2 Algiers=-28.0/18.2/75.1 Alice Springs=-24.9/21.0/64.4 Almaty=-35.0/10.0/59.4 Amsterdam=-36.3/10.2/55.2 Anadyr=-53.0/-6.9/38.5 Anchorage=-41.4/2.8/48.7 Andorra la Vella=-34.0/9.8/52.5 Ankara=-33.9/12.0/57.9 Antananarivo=-27.4/17.9/63.5 Antsiranana=-27.8/25.2/71.0 Arkhangelsk=-46.6/1.3/47.2 Ashgabat=-24.9/17.1/58.7 Asmara=-29.9/15.6/59.1 Assab=-14.3/30.5/77.0 Astana=-41.0/3.5/48.5 Athens=-27.0/19.2/60.9 Atlanta=-28.9/17.0/64.1 Auckland=-34.3/15.2/60.9 Austin=-27.8/20.7/64.5 Baghdad=-19.7/22.8/71.7 Baguio=-32.6/19.5/62.6 Baku=-31.6/15.1/56.7 Baltimore=-33.6/13.1/59.0 Bamako=-17.8/27.8/72.4 Bangkok=-15.4/28.6/77.0 Bangui=-20.0/26.0/68.1 Banjul=-25.3/26.0/69.6 Barcelona=-26.9/18.2/60.9 Bata=-21.1/25.1/72.3 Batumi=-28.4/14.0/58.0 Beijing=-33.8/12.9/57.1 Beirut=-23.1/20.9/71.3 Belgrade=-36.3/12.5/57.4 Belize City=-17.6/26.7/72.0 Benghazi=-25.8/19.9/65.7 Bergen=-36.0/7.7/52.9 Berlin=-33.8/10.3/56.0 Bilbao=-30.1/14.7/62.1 Birao=-18.1/26.5/68.8 Bishkek=-35.1/11.2/55.8 Bissau=-18.1/27.0/67.8 Blantyre=-24.9/22.2/67.6 Bloemfontein=-28.6/15.6/58.9 Boise=-31.2/11.4/55.5 Bordeaux=-28.1/14.2/62.1 Bosaso=-13.0/30.0/80.5 Boston=-34.1/10.9/58.4 Bouaké=-20.0/26.0/71.8 Bratislava=-32.7/10.5/56.9 Brazzaville=-22.7/25.0/68.6 Bridgetown=-26.5/27.0/75.3 Brisbane=-24.2/21.4/65.7 Brussels=-36.0/10.5/56.7 Bucharest=-33.6/10.8/57.6 Budapest=-32.0/11.3/54.4 Bujumbura=-20.4/23.8/66.7 Bulawayo=-23.3/18.9/62.5 Burnie=-28.4/13.1/55.7 Busan=-31.2/15.0/59.8 Cabo San Lucas=-18.8/23.9/69.9 Cairns=-21.4/25.0/68.7 Cairo=-26.3/21.4/65.1 Calgary=-43.2/4.4/51.9 Canberra=-31.3/13.1/53.5 Cape Town=-26.2/16.2/61.6 Changsha=-26.7/17.4/61.8 Charlotte=-31.8/16.1/62.2 Chiang Mai=-27.9/25.8/73.8 Chicago=-34.0/9.8/57.3 Chihuahua=-22.9/18.6/66.5 Chittagong=-18.0/25.9/69.4 Chișinău=-33.2/10.2/53.7 Chongqing=-28.9/18.6/64.7 Christchurch=-32.5/12.2/55.1 City of San Marino=-33.0/11.8/54.9 Colombo=-14.7/27.4/70.1 Columbus=-33.4/11.7/58.3 Conakry=-22.4/26.4/74.4 Copenhagen=-39.0/9.1/53.6 Cotonou=-19.5/27.2/73.9 Cracow=-33.1/9.3/56.8 Da Lat=-28.3/17.9/62.7 Da Nang=-17.3/25.8/68.8 Dakar=-19.6/24.0/70.2 Dallas=-32.3/19.0/64.6 Damascus=-31.3/17.0/60.5 Dampier=-16.7/26.4/72.9 Dar es Salaam=-17.6/25.8/69.0 Darwin=-17.3/27.6/75.4 Denpasar=-20.9/23.7/71.2 Denver=-34.0/10.4/52.7 Detroit=-36.5/10.0/54.7 Dhaka=-18.7/25.9/75.9 Dikson=-54.8/-11.1/35.7 Dili=-19.4/26.6/73.0 Djibouti=-15.8/30.0/74.0 Dodoma=-25.4/22.7/71.5 Dolisie=-20.8/24.0/78.6 Douala=-18.2/26.7/70.5 Dubai=-17.4/26.9/71.4 Dublin=-37.7/9.8/55.1 Dunedin=-35.6/11.1/59.0 Durban=-24.3/20.6/65.0 Dushanbe=-29.2/14.7/58.0 Edinburgh=-42.8/9.3/53.5 Edmonton=-37.4/4.2/47.2 El Paso=-27.3/18.1/59.9 Entebbe=-24.0/21.0/67.6 Erbil=-25.5/19.5/67.4 Erzurum=-40.0/5.1/48.5 Fairbanks=-47.3/-2.3/46.9 Fianarantsoa=-26.0/17.9/68.7 Flores, Petén=-17.9/26.4/72.2 Frankfurt=-32.5/10.6/56.5 Fresno=-28.5/17.9/62.4 Fukuoka=-28.5/17.0/63.1 Gaborone=-28.0/21.0/62.9 Gabès=-25.3/19.5/62.2 Gagnoa=-17.9/26.0/69.4 Gangtok=-28.0/15.2/60.9 Garissa=-14.1/29.3/77.3 Garoua=-17.2/28.3/73.2 George Town=-16.4/27.9/74.8 Ghanzi=-27.5/21.4/64.8 Gjoa Haven=-58.1/-14.4/27.9 Guadalajara=-23.8/20.9/63.1 Guangzhou=-24.8/22.4/69.7 Guatemala City=-21.5/20.4/68.1 Halifax=-35.7/7.5/53.2 Hamburg=-35.4/9.7/54.8 Hamilton=-27.2/13.8/58.1 Hanga Roa=-26.0/20.5/64.3 Hanoi=-21.3/23.6/68.5 Harare=-26.5/18.4/62.1 Harbin=-39.7/5.0/51.1 Hargeisa=-23.7/21.7/64.8 Hat Yai=-21.3/27.0/72.8 Havana=-19.0/25.2/75.2 Helsinki=-36.4/5.9/52.7 Heraklion=-25.4/18.9/61.8 Hiroshima=-27.8/16.3/61.3 Ho Chi Minh City=-18.2/27.4/72.1 Hobart=-33.6/12.7/59.0 Hong Kong=-35.6/23.3/68.1 Honiara=-19.2/26.5/68.9 Honolulu=-18.1/25.4/73.0 Houston=-33.4/20.8/62.3 Ifrane=-37.2/11.4/55.5 Indianapolis=-29.2/11.8/57.1 Iqaluit=-53.8/-9.3/35.9 Irkutsk=-46.5/1.0/44.6 Istanbul=-31.4/13.9/57.1 Jacksonville=-30.8/20.3/64.4 Jakarta=-20.7/26.7/75.4 Jayapura=-16.5/27.0/73.3 Jerusalem=-28.8/18.3/61.3 Johannesburg=-34.8/15.5/65.7 Jos=-28.8/22.8/67.9 Juba=-15.9/27.8/71.0 Kabul=-36.0/12.1/55.8 Kampala=-27.2/20.0/60.9 Kandi=-18.0/27.7/71.1 Kankan=-21.1/26.5/69.7 Kano=-17.0/26.4/77.7 Kansas City=-34.4/12.5/56.0 Karachi=-20.0/26.1/72.6 Karonga=-20.6/24.4/70.9 Kathmandu=-24.6/18.3/61.9 Khartoum=-18.2/29.9/75.5 Kingston=-19.3/27.4/69.8 Kinshasa=-22.8/25.3/72.3 Kolkata=-20.8/26.7/72.2 Kuala Lumpur=-20.9/27.3/74.6 Kumasi=-17.5/26.0/70.2 Kunming=-32.8/15.7/58.4 Kuopio=-39.3/3.4/47.6 Kuwait City=-21.4/25.7/69.9 Kyiv=-39.3/8.4/53.3 Kyoto=-28.0/15.8/60.4 La Ceiba=-23.4/26.2/73.7 La Paz=-21.0/23.7/70.2 Lagos=-19.1/26.8/71.2 Lahore=-18.5/24.3/66.0 Lake Havasu City=-27.3/23.7/69.6 Lake Tekapo=-33.8/8.7/49.9 Las Palmas de Gran Canaria=-22.6/21.2/70.3 Las Vegas=-21.9/20.3/65.4 Launceston=-33.1/13.1/57.9 Lhasa=-39.3/7.6/50.8 Libreville=-28.0/25.9/69.4 Lisbon=-28.0/17.5/60.4 Livingstone=-25.6/21.8/66.4 Ljubljana=-37.7/10.9/54.7 Lodwar=-16.8/29.3/80.2 Lomé=-18.6/26.9/68.5 London=-40.4/11.3/55.8 Los Angeles=-29.5/18.6/61.0 Louisville=-32.2/13.9/59.1 Luanda=-16.9/25.8/72.2 Lubumbashi=-27.2/20.8/66.9 Lusaka=-24.0/19.9/65.5 Luxembourg City=-36.7/9.3/51.7 Lviv=-34.3/7.8/57.1 Lyon=-37.4/12.5/58.5 Madrid=-29.9/15.0/59.5 Mahajanga=-22.2/26.3/70.4 Makassar=-17.1/26.7/72.3 Makurdi=-22.2/26.0/73.5 Malabo=-18.8/26.3/71.3 Malé=-13.9/28.0/75.5 Managua=-14.7/27.3/73.1 Manama=-18.8/26.5/72.8 Mandalay=-18.7/28.0/74.7 Mango=-15.7/28.1/80.4 Manila=-16.2/28.4/79.0 Maputo=-26.1/22.8/71.1 Marrakesh=-25.2/19.6/62.5 Marseille=-28.0/15.8/63.9 Maun=-23.1/22.4/68.5 Medan=-20.7/26.5/75.0 Mek'ele=-22.9/22.7/64.3 Melbourne=-32.6/15.1/60.9 Memphis=-24.0/17.2/60.4 Mexicali=-20.2/23.1/70.7 Mexico City=-26.7/17.5/61.6 Miami=-18.2/24.9/71.1 Milan=-34.5/12.9/58.2 Milwaukee=-35.3/8.9/55.5 Minneapolis=-40.2/7.8/51.1 Minsk=-37.7/6.7/50.7 Mogadishu=-26.3/27.1/72.3 Mombasa=-19.6/26.3/69.7 Monaco=-29.6/16.4/62.9 Moncton=-40.1/6.1/47.4 Monterrey=-24.0/22.3/72.7 Montreal=-40.8/6.8/53.3 Moscow=-36.8/5.8/51.0 Mumbai=-14.3/27.1/71.7 Murmansk=-44.2/0.6/51.2 Muscat=-17.8/28.0/73.1 Mzuzu=-34.9/17.7/61.8 N'Djamena=-13.8/28.3/72.5 Naha=-21.1/23.1/71.9 Nairobi=-28.7/17.8/62.3 Nakhon Ratchasima=-31.1/27.3/69.4 Napier=-31.7/14.6/58.4 Napoli=-30.2/15.9/64.8 Nashville=-30.3/15.4/58.6 Nassau=-18.1/24.6/66.3 Ndola=-24.0/20.3/70.7 New Delhi=-18.9/25.0/68.7 New Orleans=-23.3/20.7/71.7 New York City=-29.1/12.9/55.5 Ngaoundéré=-21.7/22.0/65.4 Niamey=-14.6/29.3/75.3 Nicosia=-23.9/19.7/63.2 Niigata=-36.6/13.9/61.5 Nouadhibou=-25.4/21.3/66.4 Nouakchott=-16.6/25.7/73.6 Novosibirsk=-44.5/1.7/45.8 Nuuk=-50.4/-1.4/44.2 Odesa=-33.7/10.7/56.9 Odienné=-18.0/26.0/70.1 Oklahoma City=-33.1/15.9/57.5 Omaha=-34.4/10.6/54.4 Oranjestad=-15.9/28.1/74.9 Oslo=-36.5/5.7/53.2 Ottawa=-42.3/6.6/53.2 Ouagadougou=-20.4/28.3/72.6 Ouahigouya=-17.6/28.6/81.3 Ouarzazate=-23.8/18.9/62.6 Oulu=-41.3/2.7/49.8 Palembang=-17.9/27.4/69.5 Palermo=-30.1/18.5/64.7 Palm Springs=-17.5/24.5/71.0 Palmerston North=-34.5/13.2/57.7 Panama City=-17.0/28.0/75.4 Parakou=-23.1/26.8/70.4 Paris=-30.4/12.3/58.0 Perth=-27.3/18.7/68.1 Petropavlovsk-Kamchatsky=-46.8/1.9/55.8 Philadelphia=-42.3/13.2/60.0 Phnom Penh=-15.1/28.3/75.1 Phoenix=-19.4/23.9/68.4 Pittsburgh=-36.9/10.8/54.9 Podgorica=-27.8/15.3/59.0 Pointe-Noire=-17.5/26.1/71.9 Pontianak=-18.5/27.7/73.4 Port Moresby=-21.2/26.9/71.1 Port Sudan=-16.9/28.4/76.2 Port Vila=-20.3/24.3/78.5 Port-Gentil=-17.9/26.0/72.1 Portland (OR)=-32.4/12.4/58.2 Porto=-28.6/15.7/66.9 Prague=-39.9/8.4/54.1 Praia=-24.9/24.5/67.9 Pretoria=-25.7/18.2/65.2 Pyongyang=-37.2/10.8/58.5 Rabat=-29.0/17.2/68.8 Rangpur=-19.4/24.4/71.2 Reggane=-20.3/28.3/74.2 Reykjavík=-45.8/4.3/48.5 Riga=-40.5/6.2/49.2 Riyadh=-21.0/26.0/74.9 Rome=-28.8/15.2/60.5 Roseau=-27.9/26.2/74.2 Rostov-on-Don=-34.1/9.9/56.6 Sacramento=-29.2/16.3/71.8 Saint Petersburg=-41.4/5.9/55.0 Saint-Pierre=-36.7/5.7/56.6 Salt Lake City=-33.3/11.6/57.8 San Antonio=-22.4/20.8/65.3 San Diego=-29.5/17.8/64.8 San Francisco=-32.1/14.6/57.1 San Jose=-28.6/16.4/59.2 San José=-20.3/22.6/72.2 San Juan=-20.3/27.2/74.4 San Salvador=-24.7/23.1/75.7 Sana'a=-24.3/20.0/64.5 Santo Domingo=-18.1/25.9/72.0 Sapporo=-37.9/8.9/54.9 Sarajevo=-35.5/10.1/52.3 Saskatoon=-39.3/3.3/49.4 Seattle=-35.7/11.3/56.7 Seoul=-30.8/12.5/60.5 Seville=-25.6/19.2/62.9 Shanghai=-28.9/16.7/63.3 Singapore=-24.4/27.0/75.9 Skopje=-32.6/12.4/58.4 Sochi=-31.0/14.2/58.3 Sofia=-36.0/10.6/55.2 Sokoto=-14.3/28.0/77.0 Split=-29.6/16.1/59.9 St. John's=-39.7/5.0/52.4 St. Louis=-33.0/13.9/60.5 Stockholm=-37.2/6.6/50.5 Surabaya=-19.1/27.1/73.8 Suva=-17.0/25.6/73.7 Suwałki=-36.5/7.2/49.3 Sydney=-27.2/17.7/61.4 Ségou=-18.4/28.0/71.0 Tabora=-23.3/23.0/68.3 Tabriz=-32.4/12.6/55.6 Taipei=-26.8/23.0/70.1 Tallinn=-39.9/6.4/57.4 Tamale=-18.1/27.9/77.0 Tamanrasset=-22.2/21.7/66.9 Tampa=-21.9/22.9/69.5 Tashkent=-40.1/14.8/60.7 Tauranga=-28.8/14.8/57.8 Tbilisi=-37.6/12.9/60.5 Tegucigalpa=-19.9/21.7/69.5 Tehran=-30.4/17.0/62.9 Tel Aviv=-25.1/20.0/64.7 Thessaloniki=-29.2/16.0/58.7 Thiès=-23.2/24.0/68.1 Tijuana=-36.8/17.8/60.9 Timbuktu=-17.6/28.0/79.4 Tirana=-30.9/15.2/59.6 Toamasina=-19.1/23.4/71.5 Tokyo=-33.1/15.4/61.8 Toliara=-17.9/24.1/70.8 Toluca=-30.2/12.4/54.3 Toronto=-37.3/9.4/57.0 Tripoli=-24.3/20.0/67.7 Tromsø=-45.8/2.9/49.3 Tucson=-26.5/20.9/63.6 Tunis=-27.3/18.4/69.2 Ulaanbaatar=-43.0/-0.4/51.9 Upington=-28.7/20.4/65.1 Vaduz=-32.3/10.1/58.5 Valencia=-25.2/18.3/61.6 Valletta=-24.7/18.8/60.1 Vancouver=-34.8/10.4/58.6 Veracruz=-19.0/25.4/74.8 Vienna=-31.7/10.4/51.9 Vientiane=-16.8/25.9/74.6 Villahermosa=-15.2/27.1/73.7 Vilnius=-35.7/6.0/54.1 Virginia Beach=-28.7/15.8/61.2 Vladivostok=-38.6/4.9/50.3 Warsaw=-36.9/8.5/56.9 Washington, D.C.=-36.2/14.6/61.2 Wau=-18.9/27.8/72.2 Wellington=-35.2/12.9/61.3 Whitehorse=-45.5/-0.1/44.4 Wichita=-31.0/13.9/67.7 Willemstad=-15.2/28.0/73.8 Winnipeg=-40.7/3.0/52.7 Wrocław=-33.6/9.6/57.4 Xi'an=-27.7/14.1/57.1 Yakutsk=-56.0/-8.8/39.7 Yangon=-20.2/27.5/76.9 Yaoundé=-19.8/23.8/68.6 Yellowknife=-46.6/-4.3/41.0 Yerevan=-36.8/12.4/54.3 Yinchuan=-36.8/9.0/53.0 Zagreb=-33.1/10.7/55.5 Zanzibar City=-22.2/26.0/66.9 Zürich=-34.7/9.3/53.0 Ürümqi=-37.8/7.4/49.4 İzmir=-26.5/17.9/61.3 INFO:brc.util:parallel2 <<< Elapsed time: 0:00:37.261046 INFO:brc.util:parallel2 <<< with 100,000,000 rows => 2,683,767.901 rows/s. brc complete est 0:06:12.610463
An estimate of just over 6 minutes for a billion rows is decent for Python and beat our first attempt for 100mm by half. Also profiling a single-threaded version is much easier to understand:
1 2 3 4 5 6 7 8 9 10 11
302615043 function calls (302615041 primitive calls) in 87.055 seconds Ordered by: cumulative time List reduced from 51 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 1 1.793 1.793 87.055 87.055 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_parallel_2.py:91(main) 999 27.433 0.027 84.822 0.085 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_parallel_2.py:33(collect_part) 99899999 24.782 0.000 39.515 0.000 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_parallel_2.py:21(do_parse) 99899999 14.733 0.000 14.733 0.000 {method 'split' of 'bytes' objects} 99900998 9.070 0.000 9.070 0.000 {method 'append' of 'list' objects}
This shows the slow bits are still down to Python builtin calls (split
and append
) which in their cases are because of the data structures we use. Changing the processing would be the solution to reduce this time, however, these are well-designed for the problem so some out-of-the-box thinking with lower-level stuff will be needed for these numbers to improve. I find that probably out of scope for "pure Python".
The multiprocessing result is, surprisingly, much faster than expected, even for 1 billion records!
1 2 3 4
INFO:brc.util:parallel >>> Starting timer profile=False Abha=-31.4/18.0/67.0 Abidjan=-26.3/26.0/75.2 Abéché=-23.3/29.4/79.7 Accra=-21.8/26.4/74.9 Addis Ababa=-31.6/16.0/63.5 Adelaide=-42.0/17.3/71.0 Aden=-18.2/29.1/80.3 Ahvaz=-23.9/25.4/75.4 Albuquerque=-34.7/14.0/64.6 Alexandra=-40.1/11.0/60.8 Alexandria=-26.7/20.0/75.5 Algiers=-32.1/18.2/69.0 Alice Springs=-27.2/21.0/69.9 Almaty=-40.9/10.0/61.2 Amsterdam=-38.4/10.2/59.4 Anadyr=-59.4/-6.9/44.8 Anchorage=-51.0/2.8/54.2 Andorra la Vella=-40.6/9.8/59.3 Ankara=-36.9/12.0/63.1 Antananarivo=-36.2/17.9/67.9 Antsiranana=-24.8/25.2/78.0 Arkhangelsk=-49.5/1.3/52.2 Ashgabat=-35.1/17.1/68.6 Asmara=-32.8/15.6/64.8 Assab=-20.6/30.5/80.5 Astana=-46.6/3.5/50.9 Athens=-31.1/19.2/68.3 Atlanta=-37.0/17.0/66.6 Auckland=-34.0/15.2/67.0 Austin=-32.6/20.7/70.0 Baghdad=-30.5/22.8/71.3 Baguio=-29.4/19.5/67.9 Baku=-32.9/15.1/63.9 Baltimore=-34.3/13.1/62.9 Bamako=-20.0/27.8/80.2 Bangkok=-20.5/28.6/76.5 Bangui=-24.1/26.0/75.0 Banjul=-23.8/26.0/72.0 Barcelona=-35.3/18.2/70.4 Bata=-22.7/25.1/74.2 Batumi=-33.6/14.0/64.4 Beijing=-36.1/12.9/66.3 Beirut=-29.6/20.9/70.0 Belgrade=-40.9/12.5/60.0 Belize City=-21.6/26.7/77.2 Benghazi=-29.3/19.9/71.9 Bergen=-40.8/7.7/59.0 Berlin=-38.7/10.3/60.8 Bilbao=-37.2/14.7/63.8 Birao=-23.8/26.5/78.9 Bishkek=-39.8/11.3/61.8 Bissau=-27.2/27.0/77.9 Blantyre=-26.4/22.2/75.1 Bloemfontein=-34.2/15.6/64.4 Boise=-36.0/11.4/65.7 Bordeaux=-44.7/14.2/62.3 Bosaso=-17.7/30.0/77.3 Boston=-36.8/10.9/57.7 Bouaké=-24.3/26.0/74.5 Bratislava=-38.4/10.5/62.8 Brazzaville=-24.1/25.0/73.5 Bridgetown=-20.0/27.0/78.7 Brisbane=-31.2/21.4/72.6 Brussels=-38.9/10.5/59.9 Bucharest=-35.3/10.8/64.3 Budapest=-36.5/11.3/62.1 Bujumbura=-32.7/23.8/70.4 Bulawayo=-29.8/18.9/66.8 Burnie=-38.5/13.1/68.0 Busan=-35.5/15.0/61.5 Cabo San Lucas=-24.8/23.9/77.8 Cairns=-29.9/25.0/76.7 Cairo=-28.0/21.4/71.8 Calgary=-46.8/4.4/54.4 Canberra=-35.6/13.1/60.7 Cape Town=-38.6/16.2/62.9 Changsha=-29.1/17.4/66.0 Charlotte=-32.5/16.1/63.1 Chiang Mai=-25.6/25.8/76.4 Chicago=-39.0/9.8/61.1 Chihuahua=-32.0/18.6/70.4 Chittagong=-30.0/25.9/76.8 Chișinău=-40.5/10.2/58.1 Chongqing=-34.4/18.6/67.3 Christchurch=-39.2/12.2/61.2 City of San Marino=-38.8/11.8/61.4 Colombo=-19.9/27.4/81.3 Columbus=-34.9/11.7/61.2 Conakry=-24.4/26.4/75.2 Copenhagen=-39.6/9.1/57.2 Cotonou=-23.4/27.2/78.0 Cracow=-40.2/9.3/62.0 Da Lat=-31.8/17.9/68.7 Da Nang=-24.0/25.8/74.3 Dakar=-24.0/24.0/72.7 Dallas=-31.0/19.0/66.7 Damascus=-33.4/17.0/69.7 Dampier=-23.8/26.4/77.3 Dar es Salaam=-24.8/25.8/79.6 Darwin=-22.0/27.6/81.9 Denpasar=-23.8/23.7/73.8 Denver=-38.1/10.4/62.3 Detroit=-36.2/10.0/58.3 Dhaka=-23.8/25.9/77.1 Dikson=-56.5/-11.1/40.8 Dili=-22.6/26.6/73.3 Djibouti=-24.2/29.9/81.2 Dodoma=-25.5/22.7/74.5 Dolisie=-27.3/24.0/74.0 Douala=-19.2/26.7/77.6 Dubai=-20.6/26.9/78.3 Dublin=-39.1/9.8/59.9 Dunedin=-37.4/11.1/61.9 Durban=-29.3/20.6/68.7 Dushanbe=-37.8/14.7/63.6 Edinburgh=-38.3/9.3/61.7 Edmonton=-46.1/4.2/53.2 El Paso=-35.7/18.1/69.3 Entebbe=-30.8/21.0/80.1 Erbil=-30.6/19.5/68.4 Erzurum=-44.7/5.1/57.5 Fairbanks=-55.5/-2.3/47.1 Fianarantsoa=-37.1/17.9/66.4 Flores, Petén=-24.1/26.4/74.5 Frankfurt=-37.8/10.6/57.6 Fresno=-38.2/17.9/66.2 Fukuoka=-34.2/17.0/69.1 Gaborone=-29.3/21.0/71.3 Gabès=-35.2/19.5/70.8 Gagnoa=-23.6/26.0/80.2 Gangtok=-36.2/15.2/63.2 Garissa=-22.2/29.3/78.3 Garoua=-25.9/28.3/77.9 George Town=-18.4/27.9/77.8 Ghanzi=-28.6/21.4/69.1 Gjoa Haven=-64.9/-14.4/34.7 Guadalajara=-28.6/20.9/70.7 Guangzhou=-29.4/22.4/76.6 Guatemala City=-28.9/20.4/71.2 Halifax=-43.1/7.5/57.3 Hamburg=-41.7/9.7/59.6 Hamilton=-34.9/13.8/63.4 Hanga Roa=-29.4/20.5/75.5 Hanoi=-24.3/23.6/71.8 Harare=-28.4/18.4/69.0 Harbin=-43.3/5.0/54.7 Hargeisa=-28.4/21.7/72.0 Hat Yai=-21.5/27.0/76.2 Havana=-25.9/25.2/76.8 Helsinki=-45.1/5.9/58.6 Heraklion=-29.7/18.9/70.6 Hiroshima=-34.5/16.3/67.5 Ho Chi Minh City=-23.2/27.4/80.8 Hobart=-39.6/12.7/63.8 Hong Kong=-25.3/23.3/75.1 Honiara=-24.5/26.5/76.0 Honolulu=-22.1/25.4/79.8 Houston=-32.3/20.8/67.8 Ifrane=-40.1/11.4/62.2 Indianapolis=-39.0/11.8/63.1 Iqaluit=-60.5/-9.3/45.3 Irkutsk=-49.2/1.0/49.9 Istanbul=-33.1/13.9/62.8 Jacksonville=-26.8/20.3/69.8 Jakarta=-25.3/26.7/76.2 Jayapura=-21.7/27.0/74.1 Jerusalem=-30.6/18.3/68.5 Johannesburg=-31.5/15.5/69.1 Jos=-27.6/22.8/71.4 Juba=-26.7/27.8/78.8 Kabul=-35.0/12.1/62.4 Kampala=-30.0/20.0/72.3 Kandi=-24.4/27.7/78.9 Kankan=-20.3/26.5/74.9 Kano=-22.1/26.4/77.1 Kansas City=-34.9/12.5/65.5 Karachi=-25.9/26.0/76.5 Karonga=-26.0/24.4/73.6 Kathmandu=-37.6/18.3/67.6 Khartoum=-21.1/29.9/81.3 Kingston=-24.1/27.4/81.6 Kinshasa=-27.8/25.3/75.0 Kolkata=-25.7/26.7/73.2 Kuala Lumpur=-22.0/27.3/78.3 Kumasi=-21.5/26.0/74.6 Kunming=-32.6/15.7/69.9 Kuopio=-49.4/3.4/51.8 Kuwait City=-25.1/25.7/76.6 Kyiv=-42.1/8.4/57.3 Kyoto=-37.9/15.8/63.3 La Ceiba=-22.1/26.2/81.9 La Paz=-24.0/23.7/72.3 Lagos=-31.0/26.8/77.0 Lahore=-22.9/24.3/78.6 Lake Havasu City=-30.2/23.7/73.0 Lake Tekapo=-43.5/8.7/61.8 Las Palmas de Gran Canaria=-29.5/21.2/70.0 Las Vegas=-29.9/20.3/72.9 Launceston=-34.0/13.1/64.6 Lhasa=-41.8/7.6/58.4 Libreville=-25.0/25.9/77.0 Lisbon=-34.7/17.5/67.3 Livingstone=-30.8/21.8/75.3 Ljubljana=-40.9/10.9/62.0 Lodwar=-27.1/29.3/78.5 Lomé=-21.9/26.9/80.6 London=-38.7/11.3/57.7 Los Angeles=-32.5/18.6/68.5 Louisville=-36.5/13.9/64.1 Luanda=-22.8/25.8/82.6 Lubumbashi=-30.9/20.8/71.3 Lusaka=-31.6/19.9/71.4 Luxembourg City=-44.6/9.3/62.2 Lviv=-40.7/7.8/55.9 Lyon=-37.9/12.5/66.8 Madrid=-36.7/15.0/62.8 Mahajanga=-32.1/26.3/78.5 Makassar=-24.4/26.7/78.3 Makurdi=-27.8/26.0/71.4 Malabo=-26.0/26.3/78.7 Malé=-19.7/28.0/81.3 Managua=-24.2/27.3/76.1 Manama=-26.7/26.5/79.2 Mandalay=-23.0/28.0/78.6 Mango=-22.4/28.1/78.5 Manila=-22.3/28.4/79.9 Maputo=-27.9/22.8/71.4 Marrakesh=-30.8/19.6/69.1 Marseille=-33.4/15.8/67.8 Maun=-28.1/22.4/68.7 Medan=-21.2/26.5/74.5 Mek'ele=-30.1/22.7/72.3 Melbourne=-37.1/15.1/66.4 Memphis=-33.9/17.2/69.7 Mexicali=-26.9/23.1/74.0 Mexico City=-30.9/17.5/74.1 Miami=-28.3/24.9/80.7 Milan=-37.3/13.0/61.8 Milwaukee=-38.7/8.9/58.3 Minneapolis=-41.4/7.8/57.4 Minsk=-45.5/6.7/58.7 Mogadishu=-25.3/27.1/80.7 Mombasa=-25.6/26.3/74.5 Monaco=-32.1/16.4/71.8 Moncton=-45.5/6.1/60.4 Monterrey=-28.2/22.3/72.0 Montreal=-43.6/6.8/57.6 Moscow=-44.4/5.8/55.6 Mumbai=-23.6/27.1/78.6 Murmansk=-47.9/0.6/47.7 Muscat=-22.7/28.0/80.7 Mzuzu=-31.9/17.7/67.9 N'Djamena=-20.6/28.3/77.9 Naha=-33.6/23.1/72.7 Nairobi=-33.3/17.8/66.5 Nakhon Ratchasima=-21.2/27.3/75.4 Napier=-43.3/14.6/66.6 Napoli=-31.8/15.9/68.4 Nashville=-35.0/15.4/62.4 Nassau=-23.5/24.6/76.2 Ndola=-31.2/20.3/67.7 New Delhi=-27.7/25.0/75.9 New Orleans=-26.6/20.7/72.3 New York City=-35.1/12.9/60.3 Ngaoundéré=-28.9/22.0/69.5 Niamey=-18.3/29.3/78.7 Nicosia=-27.6/19.7/73.7 Niigata=-35.1/13.9/63.7 Nouadhibou=-25.0/21.3/76.7 Nouakchott=-23.4/25.7/76.1 Novosibirsk=-48.6/1.7/52.4 Nuuk=-50.0/-1.4/53.2 Odesa=-38.8/10.7/60.1 Odienné=-24.0/26.0/75.6 Oklahoma City=-34.1/15.9/66.4 Omaha=-44.2/10.6/57.8 Oranjestad=-17.6/28.1/77.9 Oslo=-42.5/5.7/55.2 Ottawa=-48.4/6.6/55.4 Ouagadougou=-22.2/28.3/77.1 Ouahigouya=-19.7/28.6/80.5 Ouarzazate=-36.1/18.9/67.5 Oulu=-47.8/2.7/50.5 Palembang=-20.9/27.3/77.3 Palermo=-29.2/18.5/64.5 Palm Springs=-23.1/24.5/71.2 Palmerston North=-34.8/13.2/61.6 Panama City=-22.0/28.0/80.2 Parakou=-22.0/26.8/74.7 Paris=-41.2/12.3/62.9 Perth=-32.9/18.7/68.5 Petropavlovsk-Kamchatsky=-51.3/1.9/53.5 Philadelphia=-36.4/13.2/66.6 Phnom Penh=-22.9/28.3/76.7 Phoenix=-25.7/23.9/75.1 Pittsburgh=-42.2/10.8/61.6 Podgorica=-30.8/15.3/63.3 Pointe-Noire=-26.2/26.1/80.2 Pontianak=-22.6/27.7/79.2 Port Moresby=-24.1/26.9/74.8 Port Sudan=-23.4/28.4/76.4 Port Vila=-25.3/24.3/77.0 Port-Gentil=-23.0/26.0/80.0 Portland (OR)=-36.5/12.4/61.7 Porto=-31.8/15.7/64.9 Prague=-45.7/8.4/56.0 Praia=-26.0/24.4/73.6 Pretoria=-28.6/18.2/72.4 Pyongyang=-37.6/10.8/59.3 Rabat=-34.8/17.2/66.6 Rangpur=-31.0/24.4/74.4 Reggane=-22.1/28.3/80.7 Reykjavík=-44.8/4.3/54.1 Riga=-45.6/6.2/55.2 Riyadh=-28.5/26.0/75.6 Rome=-35.0/15.2/61.4 Roseau=-22.2/26.2/75.2 Rostov-on-Don=-43.9/9.9/60.8 Sacramento=-32.7/16.3/65.7 Saint Petersburg=-41.3/5.8/55.8 Saint-Pierre=-45.5/5.7/54.9 Salt Lake City=-42.0/11.6/59.1 San Antonio=-34.8/20.8/67.3 San Diego=-28.8/17.8/73.3 San Francisco=-32.7/14.6/65.4 San Jose=-33.4/16.4/68.8 San José=-27.1/22.6/71.6 San Juan=-21.9/27.2/76.6 San Salvador=-28.0/23.1/73.6 Sana'a=-29.7/20.0/68.7 Santo Domingo=-29.8/25.9/75.8 Sapporo=-41.4/8.9/57.4 Sarajevo=-38.8/10.1/61.0 Saskatoon=-44.9/3.3/60.7 Seattle=-37.4/11.3/66.6 Seoul=-35.2/12.5/63.2 Seville=-28.4/19.2/66.8 Shanghai=-38.4/16.7/71.4 Singapore=-24.3/27.0/75.5 Skopje=-39.3/12.4/62.6 Sochi=-38.9/14.2/62.4 Sofia=-39.0/10.6/62.5 Sokoto=-26.0/28.0/80.9 Split=-42.5/16.1/65.8 St. John's=-43.3/5.0/52.7 St. Louis=-39.6/13.9/64.0 Stockholm=-42.7/6.6/55.7 Surabaya=-21.5/27.1/75.4 Suva=-29.3/25.6/76.6 Suwałki=-40.3/7.2/60.9 Sydney=-31.3/17.7/70.7 Ségou=-19.5/28.0/75.7 Tabora=-24.6/23.0/71.2 Tabriz=-36.0/12.6/64.4 Taipei=-26.0/23.0/70.9 Tallinn=-44.1/6.4/57.0 Tamale=-23.6/27.9/81.3 Tamanrasset=-27.1/21.7/70.7 Tampa=-26.3/22.9/78.2 Tashkent=-31.7/14.8/65.4 Tauranga=-33.0/14.8/65.2 Tbilisi=-34.0/12.9/62.8 Tegucigalpa=-32.0/21.7/72.2 Tehran=-37.6/17.0/68.4 Tel Aviv=-34.6/20.0/69.4 Thessaloniki=-31.4/16.0/64.6 Thiès=-25.0/24.0/70.5 Tijuana=-34.2/17.8/68.1 Timbuktu=-22.0/28.0/83.3 Tirana=-36.5/15.2/64.3 Toamasina=-24.4/23.4/71.3 Tokyo=-32.3/15.4/62.8 Toliara=-26.4/24.1/75.9 Toluca=-35.9/12.4/62.8 Toronto=-43.4/9.4/59.0 Tripoli=-27.4/20.0/71.6 Tromsø=-54.1/2.9/55.5 Tucson=-28.1/20.9/70.9 Tunis=-29.0/18.4/67.1 Ulaanbaatar=-47.8/-0.4/47.5 Upington=-35.4/20.4/68.5 Vaduz=-41.1/10.1/57.8 Valencia=-30.2/18.3/66.6 Valletta=-29.8/18.8/69.3 Vancouver=-39.0/10.4/57.7 Veracruz=-23.5/25.4/79.1 Vienna=-40.9/10.4/62.1 Vientiane=-23.9/25.9/75.5 Villahermosa=-21.7/27.1/79.0 Vilnius=-41.5/6.0/56.3 Virginia Beach=-43.4/15.8/66.0 Vladivostok=-43.9/4.9/51.6 Warsaw=-44.2/8.5/54.5 Washington, D.C.=-34.4/14.6/63.6 Wau=-30.3/27.8/77.2 Wellington=-40.2/12.9/63.4 Whitehorse=-52.0/-0.1/57.3 Wichita=-38.1/13.9/66.5 Willemstad=-21.4/28.0/76.1 Winnipeg=-48.3/3.0/51.3 Wrocław=-40.2/9.6/61.4 Xi'an=-34.8/14.1/65.9 Yakutsk=-60.3/-8.8/41.0 Yangon=-21.2/27.5/75.5 Yaoundé=-24.1/23.8/74.1 Yellowknife=-52.2/-4.3/44.8 Yerevan=-39.1/12.4/62.7 Yinchuan=-40.9/9.0/63.3 Zagreb=-42.7/10.7/60.2 Zanzibar City=-21.2/26.0/75.2 Zürich=-41.1/9.3/58.6 Ürümqi=-40.3/7.4/61.7 İzmir=-29.6/17.9/69.3 INFO:brc.util:parallel <<< Elapsed time: 0:01:00.662966 INFO:brc.util:parallel <<< with 1,000,000,000 rows => 16,484,521.949 rows/s. brc complete est 0:01:00.662966
There are probably some improvements that can be made such as grid searching for n_splits
and chunksize
but mainly from the comments I have made in the code. Personally, I'm well-chuffed with the results I've achieved with multiprocessing and optimization at this level of Python. The next step would likely start looking at how to solve only using the C bindings (i.e. solve it in C) or rewrite in a faster language.
I also allowed the Pool()
to use all the cores on the machine which resulted in a sub-1min total time of 0:00:55.958772
- beating Polars (for this one very specific simple task) by a couple of seconds!!
Other Notable Tools
DuckDB
DuckDB took a whole 47 seconds to run the below query, of which loading the CSV took more than 40 seconds. The query itself only took 3.81 seconds on average! This is because the data format on disk took much less time to parse; It really shows the trade-off between having the data in the right format to start with.
1 2 3 4 5 6 7 8 9 10
CREATE TABLE brc AS SELECT * FROM read_csv_auto('/Users/toby.devlin/dev/projects/1brc/data/data_1_000_000_000.txt'); ALTER TABLE brc RENAME column0 TO city; ALTER TABLE brc RENAME column1 TO temp; select city, min(temp), avg(temp), max(temp) from brc group by city
As an aside I also tried this with SQLite for shits and gigs; it loaded the records in about an hour and then took over 5 minutes to compute the same query, so I gave up. Local ODBC has been won by DuckDB in this one.
Polars with Parquet
Step one was dumping the CSV over to a parquet file with a short bit of code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
import polars as pl from brc.util import DATA_DIR, timit_context def sink_to_parquet(): n = 1_000_000_000 with timit_context(n, "scan_parquet"): lf = pl.scan_csv( DATA_DIR / "data_1_000_000_000.txt", has_header=False, separator=";", with_column_names=lambda _: ["city", "temp"], schema={"city": pl.Categorical, "temp": pl.Float64} ) with timit_context(n, "sink_parquet"): lf.sink_parquet(DATA_DIR / "data_1_000_000_000.parquet") with timit_context(n, "sink_parquet_stats"): lf.sink_parquet(DATA_DIR / "data_1_000_000_000_stats.parquet", statistics=True)
This code is part of the benchmark_parquet.py
file and source -> sink time was 0:01:15.312101 for the non-stats file
and 0:01:24.109170 for the stats to be added and on disk both these files are only 3.6GB! Parquet has much better
compression and RLE to save space & is columnar which means it interfaces with Arrow very well (natively by design
actually). We can now rerun the benchmark code but pass in the parquet file as the source. This gives the result of
0:01:25.495492 and 0:01:28.913853 respectively, which is fascinating!
I'm probably doing something wrong here and will look into it. I was expecting an order of magnitude speed up.
ChatGPT
I did ask ChatGPT (GPT3.5) for a solution and, very unhelpfully, it provided code that errored and was fundamentally flawed. As with other large code tasks, ChatGPT fell flat; however, it did help with smaller more targeted feature requests and helped me with some of the aggregation code, once I gave it the gist and asked for a refactor. Personally, this approach is much better for feedback and a "give me a starting point" type
Final Results
- Single process: 0:06:12.610463 (estimated from 0:00:37.261046 for 100mm)
- Multi-process: 0:01:00.662966
- Multi-process, all cores: 0:00:55.958772
- Polars: 0:01:00.263331
- Other tools: Very fast once data is loaded.
Its official (on my machine) boys and girls, I beat polars in native Python. Although I did spend nearly 60x longer to write it.
All the code can be seen on the GitHub project