Comments on: Data Coding 101 – Intro to Bash – ep6 (last episode)

By: Web Scraping Tutorial -- episode #1 -- Scraping a Webpage (with Bash)

Web Scraping Tutorial -- episode #1 -- Scraping a Webpage (with Bash) — Wed, 05 Feb 2020 23:39:00 +0000

[…] Introduction to Bash episode #6 […]

By: Scraping Multiple Webpages with For Loops (Web Scraping Tutorial)

Scraping Multiple Webpages with For Loops (Web Scraping Tutorial) — Wed, 05 Feb 2020 23:35:45 +0000

[…] Introduction to Bash episode #6 […]

By: Tomi Mester

Tomi Mester — Thu, 08 Feb 2018 17:18:41 +0000

In reply to Balint.

hi Balint,

spot on!

Although, my goal was to make the task a bit more difficult than that. 🙂

(Assuming that you’re talking about the “Test Yourself” section), see this part:
“For the sake of practicing, let’s say, you can only use these four columns:
1st column: Year
2nd column: Month
3rd column: Day
15th column: ArrDelay” 😉

Cheers,
Tomi

By: Balint

Balint — Thu, 08 Feb 2018 09:35:39 +0000

Hey,

I was wondering if it wasn’t easier to use column 4 as weekday? so basically:

cut col 4 and 15
use awk to filter first column where it’s 1 (indicating monday)
sum up col15.

By: Tomi Mester

Tomi Mester — Tue, 10 Oct 2017 15:30:02 +0000

In reply to Nishant.

Hey Nishant!
Thanks! Yeah, I meant like “same as it was the -d option for cut for instance”
But you have right, it’s much easier to understand with the example of sort!
So I have updated the article!

Thanks for the input! And glad you like the blog! 😉

Cheers,
Tomi

By: Nishant

Nishant — Tue, 10 Oct 2017 10:52:33 +0000

A small mistake (perhaps):
Excerpt from your article above:
…”define this value with the -t option (same as it was with cut for instance)”…

I suppose you meant ‘sort’ instead of ‘cut’..

Thanks
Nishant

P.S: Needless to say, your blog has been very helpful in getting me started with bash programming. Keep it up! 🙂

By: Variables, if statements and while loops in bash (Data Coding 101)

Variables, if statements and while loops in bash (Data Coding 101) — Mon, 02 Oct 2017 14:12:45 +0000

[…] pet-project and learn more and more by doing and less and less by reading articles! Wanna continue? This way please! If you want to get notified about my upcoming articles, videos or webinars, subscribe to my […]

By: Tomi Mester

Tomi Mester — Tue, 18 Apr 2017 16:30:11 +0000

In reply to Chris.

hey Chris,

thanks for the comment!
Hm, now you made me think.
Let me check this with some other files myself and if I found the same as you, I’ll definitely update my article!

I guess the difference here is in the different computer capacities. Eg. in my tutorials I use a very-very cheap computer with 512Mbyte memory, 20GB SSD and only 1 CPU.
But let me check!

Cheers,
Tomi

By: Chris

Chris — Wed, 12 Apr 2017 18:23:41 +0000

Hi Tomi,

This is a nice series of techniques on basic data processing on the Linux command line, and I’ll definitely point my students to it. However, I was surprised to see that you recommend the use of cut + awk rather than using awk alone when summing up a column of numbers. I’ve always used awk only under this circumstance, and never experienced the inefficiency that you mentioned. To check this out I thought I’d give it a try on some random files I have lying around. Here I’m adding up field #14 (in these files it’s the number of records in the line, so mostly in the range 1-48) in 27 CSV files with a combined total of more than 3 million lines.
This is how long it takes with cut and awk:

$ time cut -d, -f14 MDMR-*.DAT | awk '{sum += $1}END{print sum}'
40376443

real 1m3.617s
user 1m4.436s
sys 0m0.619s

and here it is all done in awk:

$ time awk -F, '{sum += $14}END{print sum}' MDMR-*.DAT
40376443

real 0m13.106s
user 0m12.361s
sys 0m0.743s

So it takes almost 6 times longer when using cut.

In case it matters, which I don’t think it does, I’m using GNU Awk 3.1.5 and cut from GNU coreutils) 5.97.

Filtering records purely in awk (as in /re/{…}) vs using grep and awk is a different story, however, as grep is super well optimised for this particular task.