About Dataset
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The Uncleaned Laptop Price dataset is a collection of laptop product listings scraped from an online e-commerce website. The dataset includes information about various laptop models, such as their brand, screen size, processor, memory, storage capacity, operating system, and price. However, the dataset is uncleaned, meaning that it contains missing values, inconsistent formatting, and other errors that need to be addressed before the data can be used for analysis or modeling.
The dataset contains both categorical and numerical variables, with the majority of variables being categorical, including brand, model name, screen resolution, processor type, and operating system. Some numerical variables include screen size, memory, and storage capacity. The target variable in the dataset is the price, which is a continuous variable.
The dataset contains over 1,300 laptop listings, making it a reasonably sized dataset for analysis and modeling. The dataset may be useful for machine learning projects related to predicting the price of a laptop based on its specifications. However, before using the dataset, it would be necessary to clean and preprocess the data to address the inconsistencies and missing values.
Dataset Links
I am going to clean this dataset with mysql queries . Although it is easy to clean dataset with help of python but as most cases in real world data lies within database so it is better to do initial level data level data cleaning through sql . Here I have used lots of string functions ,DDL and DML commands of mysql
steps to do
- first download dataset from above link
- create a database name project
- import dataset to mysql server through import wizard and name table laptop
Initial look of data
Select * from laptop
| index | Unnamed: 0 | Company | TypeName | Inches | ScreenResolution | Cpu | Ram | Memory | Gpu | OpSys | Weight | Price |
---|
0 | 1 | 0 | Apple | Ultrabook | 13.3 | IPS Panel Retina Display 2560x1600 | Intel Core i5 2.3GHz | 8GB | 128GB SSD | Intel Iris Plus Graphics 640 | macOS | 1.37kg | 71378.6832 |
---|
1 | 2 | 1 | Apple | Ultrabook | 13.3 | 1440x900 | Intel Core i5 1.8GHz | 8GB | 128GB Flash Storage | Intel HD Graphics 6000 | macOS | 1.34kg | 47895.5232 |
---|
2 | 3 | 2 | HP | Notebook | 15.6 | Full HD 1920x1080 | Intel Core i5 7200U 2.5GHz | 8GB | 256GB SSD | Intel HD Graphics 620 | No OS | 1.86kg | 30636.0000 |
---|
count of rows in dataset
select count(*) from laptop
Let's create a backup data
create table laptop_backup like laptop;
insert into laptop_backup select * from laptop;
count of rows in backup dataset
select count(*) from laptop_backup
check for space occupied
select DATA_LENGTH/1024 from information_schema.tables
where table_schema = 'project'
and table_name = 'laptop';
drop unnecessary column
alter table laptop drop column `Unnamed: 0`;
Select * from laptop;
check for null rows
select `index` from laptop
where Company is null and TypeName is null and Inches is null
and ScreenResolution is null and `Cpu` is null and Ram is null
and `Memory` is null and Gpu is null and OpSys is null and
Weight is null and Price is null;
delete all the null row
delete from laptop where `index` in (
select `index` from laptop
where Company is null and TypeName is null and Inches is null
and ScreenResolution is null and `Cpu` is null and Ram is null
and `Memory` is null and Gpu is null and OpSys is null and
Weight is null and Price is null)
check for duplicate row
select Company ,TypeName ,Inches,ScreenResolution ,`Cpu`,Ram,`Memory`,
Gpu,OpSys,Weight,price,count(*) from laptop group by Company ,TypeName ,Inches,
ScreenResolution ,`Cpu`,Ram,`Memory`,Gpu,OpSys,Weight,price having count(*)>1;
delete duplicate row
with base_query as(select min(`index`)from laptopdata group by Company ,
TypeName ,Inches,ScreenResolution ,
`Cpu`,Ram,`Memory`,Gpu,OpSys,Weight,price)
delete from laptop where `index` not in (select * from base_query);
pd.read_sql_query("Select count(*) from laptop",conn)
Inches Column
changing Inches to datatpe to decimal
alter table laptop modify column Inches Decimal(10,1);
Ram Column
removing GB
update laptop set Ram = replace(Ram,'GB','') ;
Changing data type of Ram column to integer
alter table laptop modify column Ram INTEGER;
Weight Column
removing kg
update laptop set Weight = replace(Weight,'kg','') ;
changing data type of Weight column to decimal
alter table laptop modify column Weight Decimal(10,2);
delete record where weight is 0
delete from laptop where Weight = 0;
Price
rounding of price
update laptop set Price = round(Price);
changing price column datatype to integer
alter table laptop modify column Price INTEGER;
OpSYS
changing OpSYS values
there are values like windows 10,windows11,windows 7 ....,we are going to convert all to windows similarly to other
update laptop set OpSYS =
case
when OpSys like '%mac%' then 'Mac'
when OpSys like '%windows%' then 'Windows'
when OpSys like '%Linux%' then 'Linux'
when OpSys like '%chrome%' then 'Chrome'
when OpSys like '%android%' then 'Android'
when OpSys = 'No OS' then 'N/A'
else 'Other'
end;
Gpu
strategy is to extract Brand name from Gpu column and drop the Gpu column
alter table laptop
add column gpu_brand varchar(255) after Gpu;
update laptop set gpu_brand = substring_index(Gpu ,' ',1);
alter table laptop drop column Gpu;
Cpu
strategy is to make make three new column name cpu_brand,cpu_speed,cpu_name from cpu and finaly drop cpu column . also in cpu speed column I am going to remove GHz and convert that column to decimal
alter table laptop
add column cpu_brand varchar(255) after `Cpu`,
add column cpu_name varchar(255) after cpu_brand,
add column cpu_speed decimal(10,1) after cpu_name;
update laptop set cpu_brand = substring_index(`Cpu` ," ",1);
update laptop
set cpu_speed = cast( replace(substring_index(`Cpu`,' ',-1),'GHz',"") as decimal(10,1) );
update laptop
set cpu_name = replace(replace(`Cpu`,cpu_brand,''),substring_index(`Cpu`,' ',-1),'');
update laptop
set cpu_name = substring_index(trim(cpu_name)," ",2);
alter table laptop drop column `Cpu`;
Screen Resolution
strategy is to add four column name resolution_width , resolution_height ,is_touchscreen,is_ips_panel and drop ScreenResolution
alter table laptop
add column resolution_width integer after ScreenResolution,
add column resolution_height integer after resolution_width,
add column is_touchscreen integer after resolution_height,
add column is_ips_panel integer after is_touchscreen;
update laptop
set resolution_width = substring_index(substring_index(ScreenResolution," ",-1),"x",1),
resolution_height = substring_index(substring_index(ScreenResolution," ",-1),"x",-1),
is_touchscreen = ScreenResolution like '%Touch%',
is_ips_panel = ScreenResolution like '%IPS%';
alter table laptop drop column ScreenResolution;
Memory
strategy is to make three new column name memory_type , primary_storage, secondary storage also all the memoryis going to be in Gb
alter table laptop
add column memory_type varchar(255) after Memory,
add column primary_storage integer after memory_type,
add column secondary_storage integer after primary_storage;
update laptop
set memory_type =
case
when Memory like '%hybrid%' then 'Hybrid'
when Memory like '%SSD%' and Memory like '%HDD%' then 'Hybrid'
when Memory like '%Flash Storage%' and Memory like '%HDD%' then 'Hybrid'
when Memory like '%SSD%' then 'SSD'
when Memory like '%HDD%' then 'HDD'
when Memory like '%Flash Storage%' then 'Flash Storage'
else null
end;
update laptop set
primary_storage = regexp_substr(substring_index(Memory,'+',1),'[0-9]+');
update laptop set
secondary_storage =
case
when Memory like '%+%' then regexp_substr(substring_index(Memory,'+',-1),'[0-9]+')
else 0
end;
update laptop set
primary_storage =
case
when primary_storage <=3 then primary_storage*1024
else primary_storage
end,
secondary_storage =
case
when secondary_storage <=3 then secondary_storage*1024
else secondary_storage
end;
alter table laptop drop column Memory;
pd.read_sql_query("Select * from laptop",conn).head(3)
| index | Company | TypeName | Inches | is_ips_panel | is_touchscreen | resolution_width | resolution_height | cpu_brand | cpu_name | cpu_speed | Ram | memory_type | primary_storage | secondary_storage | gpu_brand | OpSys | Weight | Price |
---|
0 | 1 | Apple | Ultrabook | 13.3 | 1 | 0 | 2560 | 1600 | Intel | Core i5 | 2.3 | 8 | SSD | 128.0 | 0 | Intel | Mac | 1.37 | 71379 |
---|
1 | 2 | Apple | Ultrabook | 13.3 | 0 | 0 | 1440 | 900 | Intel | Core i5 | 1.8 | 8 | Flash Storage | 128.0 | 0 | Intel | Mac | 1.34 | 47896 |
---|
2 | 3 | HP | Notebook | 15.6 | 0 | 0 | 1920 | 1080 | Intel | Core i5 | 2.5 | 8 | SSD | 256.0 | 0 | Intel | N/A | 1.86 | 30636 |
---|