How would you approach this problem?

Hello everyone!

I'm new to this topic of machine learning, python and data science. I've been programming for over 7 years now, but never had any sort of contact with that sort of stuff.

So I decided to build up some knowledge and set myself a project, and I'm looking for advice.

I want to process sustainability reports of all sorts from different companies like Apple, Procter & Gamble, Microsoft and so forth. I especially want to extract their GHG emissions from the .pdf files that they offer on their websites. I want to then create a comparison between them and create some sort of scoring.

So far I came across a few problems:

- The data in the pdfs are not really standardized, although I had thought that I could use the ghg protocol, they use different metrics, units and stuff

- I'm trying to identify the pages that contain specific keywords like "scope 1" "scope 2" via regex and pdf text extraction, but I have no idea how to process the extracted data

I'm using tabula-py to extract data from tables and store them in a .csv file an example dataset looks like this:

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Fiscal Year,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
,,KPI,Unit,,,,,,
,,,,2018 2017,2016,2015,2014,2013,2012
,Scope 1,,,"54,590 45,400","34,370","28,100","28,490","29,300","21,220"
,"Natural gas, diesel, propane2",,,"39,990 34,560","27,000","19,360","20,710","22,090","14,300"
,Fleet vehicles,,,"11,110 8,300","7,370","8,740","7,780","7,210","6,920"
,Process emissions3,,,"3,490 2,540",—,—,—,—,—
Greenhouse Gas,Scope 24,,metric tons,"8,730 36,250","41,000","42,460","63,210","91,510","139,160"
Emissions,Scope 35,,CO2e,"520,500 293,440","303,910","312,910","259,130","225,630","202,060"
,Business travel6,,,"337,340 121,000","117,550","139,940","110,940","90,948","85,090"
,Employee commute7,,,"183,160 172,440","186,360","172,970","148,190","134,685","116,970"
,Total facilities emissions,,,,,,,,
,,,,"583,820 375,090","379,280","383,470","350,830","346,440","362,440"
,"(Scopes 1, 2, 3)",,,,,,,,
,Electricity,,,"2,182 1,832","1,420",996,839,708,608*
,U.S.,,million kWh,"1,830 1,536","1,157",831,702,590,—
,International,,,351 296,262,166,137,118,—
Energy Use,,,,,,,,,
,Natural gas,,,"1,419,240 1,225,210","974,570","851,660","922,860","764,550","304,000"
,U.S.,,million BTU,"1,333,850 1,127,550","901,950","794,830","840,490","676,630","240,230"
,International,,,"85,390 97,660","72,620","56,830","82,370","87,920","63,770"
,Electricity saved per year as,,,,,,,,
,a result of energy efficiency,,kWh/year,"113,203,780 69,989,660","55,288,800","37,875,000","31,225,000","26,241,600","11,354,200"
Energy,measures,,,,,,,,
Efficiency8,Natual gas saved per year as,,,,,,,,
,a result of energy efficiency,,therms/year,"2,541,440 2,453,410","2,228,477","1,676,735","1,431,215","1,238,291","548,508"
,measures,,,,,,,,
,Renewable energy sourcing,,,,,,,,
,60 (fiscal year),9,%,99 97,96,93,87,73,
Renewable,,,,,,,,,
Energy,"Emissions avoided as a  metric tons result of renewable energy 116,000 CO2esourcing (fiscal year)10",,,"690,000 589,000","541,000","336,000","255,000","195,000",
,Total,,,"1,260 1,000",630,573,494,430,345
,Data centers11,,million,460 410,207,166,113,69,57
Water Use,,,,,,,,,
,Retail,,gallons,110 110,99,111,103,94,71
,Corporate12,,,690 480,324,296,278,267,217
,Landfilled,,,"36,553,900 31,595,200","21,618,850","13,110,880","6,833,000","5,923,810","4,850,160"
,Recycled,,,"108,515,200 68,509,300","28,198,560","19,599,570","14,621,940","15,866,650","11,464,020"
Waste,Composted,,pounds,"10,397,400 14,567,500","13,737,320","3,006,170",—,—,—
Generation13,Hazardous waste,,,"6,277,800 3,342,700","2,287,320","1,002,300","508,040","70,550","123,460"
,Waste to energy14,,,"1,105,100 645,000",—,—,—,—,—
,Landfill diversion rate,,%,74 71,66,63,68,73,70

Has anyone an idea how to assign the numerical data to the specific scope 1, 2,3?

I'm pretty clueless.