How To Create One Dataframe From Multiple Csv Files In A Folder

November 20, 2023 Post a Comment

I have a list of CSV files(A1.csv, A2.csv........D10.csv) in a folder which contains data two columns but several rows. Basically, I want to extract the values of last row and 2nd

Solution 1:

Here is an option in R.

Step 1: Prepare a vector with file names. If there are too many files in the folder, the list.files function could be useful. Here, I just manually created it. I also assume that all the files are stored in the working directory. Otherwise, you will need to construct the file path.

file_vec <- c("A1.csv", "A2.csv", "A3.csv")

Step 2: Read all CSV file based on file_vec. The key is to use the lapply function to apply read.csv of every element in file_vec.

dt_list <- lapply(file_vec, read.csv, stringsAsFactors =FALSE)

Step 3: Prepare a vector showing file names without .csv

name_vec <- sub(".csv", "", file_vec)

Step 4: Create the data frame. x[nrow(x), 2] is a way to access the last value of the second column.

dt_final <- data.frame(File = name_vec,
                       Value = sapply(dt_list,function(x) x[nrow(x),2]),
                       stringsAsFactors =FALSE)

dt_final is the final output.

Solution 2:

Here's another option using the tidyverse in R:

library(tidyverse)

# In my example, I'm using a folder with 4 Chicago Crime Datasets
setwd("INSERT/PATH/HERE")

files <- list.files()

tibble(files) %>%
  mutate(file_contents = map(files, ~ read_csv(file.path(.), n_max = 10))) %>% 
  unnest(file_contents) %>%
  group_by(files) %>%
  slice(n()) %>% 
  select(1:2)

Which returns:

# A tibble: 4 x 2# Groups:   filename [4]filenameX1<chr><int>1Chicago_Crimes_2001_to_2004.csv49042Chicago_Crimes_2005_to_2007.csv103Chicago_Crimes_2008_to_2011.csv58674Chicago_Crimes_2012_to_2017.csv1891

Note that the n_max = 10 argument isn't needed. I only included this because the files I was working with are pretty large.

For anyone interested, the dataset can be found here.

Also, it's possible that you may want to avoid setting the work directory with setwd(). If this is the case, you can use the additional argument full.names = TRUE in list.files():

path <- "INSERT/PATH/HERE"
files <- list.files(path, full.names = TRUE)

I'd recommend this approach as scripts containing the line setwd() aren't flexible, paths will change from user to user.

Solution 3:

Python Solution

>>>import pandas as pd>>>files = ['A1.csv', 'A2.csv', ... , 'D10.csv']>>>df_final = pd.Dataframe({fname: pd.read_csv(fname).iat[-1, 1] for fname in files})

Solution 4:

This is an easy case for bash and friends. This one-liner

for i in A*.csv B*.csv C*.csv D*.csv; do awk -F , 'END{ print $NF }'"$i"; done

extracts the bottom right field, no matter how many rows or columns, of any number of files that follow the pattern you have given. If all files were in one in one folder, and they were the only .csv files in that folder, and you wanted to save the outcome in a new file, this would do the job:

for i in *.csv; do awk -F , 'END{ print $NF }'"$i"; done > extract.txt

Python Guru

How To Create One Dataframe From Multiple Csv Files In A Folder

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "How To Create One Dataframe From Multiple Csv Files In A Folder"