A Reinforcement Learning Approach to Service Based User Admission in a Multi-Tier 5G Wireless Networks

The massive connectivity in 5G wireless network is expected to become a challenge to communication network service providers. Many services over the 5G network will be aligned to a particular radio access technology (RAT). As a result, admitting a service-based user to a particular RAT will depend on the most efficient radio access technology selection (RAT). This is because 5G network will adopt multi-tier radio access networks ranging from high power macro base stations to extremely low power Bluetooth connectivity. Selection of a service-oriented RAT is critical because some wireless services have superior quality of service under certain RATs. Maintaining efficient RAT selection by network operators will improve power allocation efficiency, bandwidth allocation efficiency and operation expenditure. The complexity of associating a RAT to service-based user while considering network state such as data rate, the power allocation and hand-off frequency have not been fully explored. In this paper we propose a reinforcement learning approach to user admission based on efficient RAT selection considering wireless services in a cross tier wireless radio access network domain. The proposed algorithm indicates improved RAT selection efficiency considering transmit power, data rate and user hand off while minimizing the computation complexity. We perform extensive simulation using Python dynamic libraries and compare the finding against random association.


INTRODUCTION
The fifth generation (5G) is expected to provide access to a multi-tier wireless access networks with many services being dedicated to certain radio access technologies for optimum performance.The complexity of determining the best RAT for a certain service is still a challenge and solving this problem requires an intelligent algorithm capable of achieving optimum performance.Furthermore, no single RAT can satisfy all the needs of all users based 5G services (Sandoval, Canovas-Carrasco, Garcia-Sanchez, & Garcia-Haro, 2019).The proposal of three main slices in 5G namely: ultra-reliable low latency communication (uRLLC), massive machine type communication (mMTC) and enhanced mobile broadband (eMBB) (Ojijo & Falowo, 2020)(Sunday O. Oladejo & Falowo, 2019;Sunday Oladayo Oladejo & Falowo, 2020)slices clearly depicts the need for service based RAT selection.The architectural design of 5G comprises multi-tier access ranging from a macro base station to a low power Bluetooth.
In this regard a user service will require an association with a specific RAT, for optimum performance.To reap the benefits of 5G network slicing and multi-tier design, a user can automatically be evaluated and connected to a specific base station (Xiang, Peng, Sun, & Yan, 2020).For instance, broadband access user may always be connected to WI-FI for optimal data rate and efficient power consumption and an internet of things user (IoT) may require a connection to a macro base station in order to provide connectivity to the million devices per kilometer feature to 5G mMTC slice.In this regard a reinforcement learning (RL) model for selecting a specific radio access-based user service if formulated for optimal association.Since the performance demand for specific services become stricter, enabling an efficient connectivity in 5G network reduces the cost of operation.The mathematical model of RL is an efficient method of intelligently allowing the radio access network (RAN) controller to associate a user to specific radio access technology (Sandoval et al., 2019) (Sun et al., 2018) based on the network characteristics and user demand.Furthermore, by allowing the RL agent to learn a specific policy π, the network user will be mapped to the optimal RAT based on a specific service request.Under this concept, we consider a user association to a macro base station, micro base station, picocell, femtocell, Wi-Fi, Bluetooth, and device to device (D2D).
The objective is to obtain a matching order for RAT that offers the best data rate under specific network conditions.In terms of services, we consider the subcategory services under the three known slices namely: multimedia, voice over internet protocol, internet of things, mission critical, and large file transfer.In the mentioned subcategories, both static and mobile users are considered.To achieve our mission, we build a finite state space containing all the possible user states.Whenever a mobile service is associated with a particular access technology, a weight is obtained matching how good the state is, this will be subsequently transformed into a reward function in the RL environment.The remainder of the paper is organized as follows.In section II we briefly describe different machine learning techniques III we provide a brief summary of the problem statement in our work.In section IV, the research objectives are clearly outlined.Section V provides a comparison with existing work from a variety of literature.In section VI we provide our system model.Section VII contains the simulation and results; we conclude our paper in section VIII where a brief summary of our work in the conclusion is outlined.

A. Machine Learning
According to Yu & He (2019), machine learning can be grouped into three categories namely: Supervised learning (Zhang, Patras, & Haddadi, 2018) is a technique where the learning agent trains on a labeled data to construct a model for mapping an input data to the output data.Once the training is complete the model can be used to predict an output without relying on the training data.Examples of supervised learning models include decision tree, k-nearest neighbor, support vector machine, neural network, Bayes' theory, hidden Markov models and random forest.
Unsupervised Learning (Eugenio, Cayamcela, & Lim, 2018) is a model where input data are not labeled.The goal of the learning agent is to determine a common pattern with the unlabeled data by clustering the learned data into different groupings.Examples of unsupervised learning models are: self-organizing maps, and k-means.
Reinforcement Learning (Arulkumaran, Deisenroth, Brundage, & Bharath, 2017) provides a method of building a model that solves problems that require multi-stage decision making.It relies on Markov decision process to determine a policy for selecting an optimum action after visiting a state within an environment.Each action selection is rewarded.The goal of the learning agent is to maximize the reward obtained.The value of each action is stored as a Q-Value.All actions with maximum Q-Values are considered as the optimum actions.Examples of reinforcement leaning models include Q-learning, deep Q-learning, SARSA, dynamic programming, temporal difference and Monte-Carlo methods.

B. The Problem
The challenging act of determining which RAT to associate with a service is still an open challenge (Kildal, Vosoogh, & MacI, 2016).The economic aspect of reducing the operating expenditure is also an area of concern to many service providers of the 5G wireless network.Such liabilities arise from inefficient power consumption and bandwidth allocation due inefficient RAT selection.One way of reducing the cost of power consumption is efficient service based RAT selection (Xiang et al., 2020).Furthermore, the mathematical complexity of modeling a wireless network is highly intense; on that note a less complex scheme is widely accepted.Our approach promises a less complexity while maintaining a higher accuracy.

C. Objectives
In this work we intend to achieve the following objectives: i.
Model a reduced complexity environment considering user services for efficient RAT selection.ii.
Model a reinforcement learning environment considering user service requirements.iii.
Simulate and test a service-based RAT selection model in a finite space reinforcement learning environment.

II. LITERATURE REVIEW
Radio access technology selection has been previously studied by researchers; the authors interrogate some of the works already existing in comparison to the work in this paper.The work in (Sandoval et al., 2019) and (Sandoval, Canovas-Carrasco, Garcia-Sanchez, & Garcia-Haro, 2018) considers an IoT based RAT selection using RL.The experimental setting in this scenario is strictly based on static internet of things (IoT) devices and no consideration of mobile users was investigated as network condition generally degrades considerably as nodes become dynamic.Furthermore, IoT network resource allocation only belongs to the mMTC slice limiting the scope of broader investigation into other slices.In this paper, we consider a mobile user a paradigm not investigated in the reviewed paper.
The work in (Passas, Miliotis, Makris, & Korakis, 2019) investigated a distributed RAT selection considering multiple user applications.On a similar note, the author assumed static user environment.While this approach produced some interesting results, many 5G mobile user are non-static and must be considered for conclusive results.Further, the Lagrangian modeling require constraint relaxation for the problem solution which is highly mathematically intensive as compared to RL model where complex constraints can be part of the environment learned by agent without the need to solving the actual objective function.
The authors in (Anany, Elmesalawy, & El Din, 2019) proposed a two scenario RAT selection considering a long term evolution (LTE) and a wireless local area network (WLAN) using a matching game algorithm.While this approach produced some interesting results, the scope was very limited and does not reflect the current multi-tier network adopted in 5G.The study in (Ndashimye, Sarkar, & Ray, 2016) investigated a network selection mechanism for efficient handover.A fuzzy logic inference technique was used to enable seamless handover for vehicle to infrastructure network.The approach however is limited in its ability as it does not consider multitier and non-static or pedestrian users which in most cases form the majority users in a multitier network.The authors in (Perveen, Patwary, & Aneiba, 2019) presented a user admission control and slice allocation strategy where user characteristics such as data rate, bandwidth, priority, revenue to maximize user utility and resource limits.In this study, the authors did not consider a multi-tier environment which takes into consideration a realistic 5G environment.Authors in (Jiang, Condoluci, & Mahmoodi, 2016) proposed an inter and intra slice admission and resource allocation scheme and solved it using a heuristic approach.The study was based on both slice and user priority consideration.Slices and users with high priority were admitted considering resource constraints.This technique however did not consider a mobile user in multi-tier environments.

III. METHODOLOGY
In this research the authors consider a multi-tier 5G network consisting of the following: A macrocell, a microcell, a pico-cell, a femtocell, a Wi-Fi cell, Bluetooth connectivity and device to device (D2D) connectivity.Assuming a user u∈  under a cell c∈  with down link power Pc and a maximum cell data rate dc is connected to any of the cells in Fig. 1.

Fig. 1. 5G multi-tier RATs
A user  is assumed to access an application  ∈  requiring a minimum data rate of    in any cell c.To meet the mobility constraint, the user can be assumed to have a mobility status (m u ∈ {0:1}), where 0 implies a static user and 1 implies a mobile user.The study assume that a mobile user is one in a car at a constant velocity for simplicity reasons while a static user is one who is immobile or walking.Each cell is considered to have a radius wc.The handover (HO) rate can therefore be given by eq.( 1) as: The cost incurred by a cell provider when user is given access can provided by eq. ( 2) as: The price paid by any user accessing a cell  having a downlink data rate    is given by eq.( 3) The cost   will eventually be mapped to the reward function in the RL.To achieve the optimal goal of this study, the aim is associate a user to a cell which offers maximum power, data rate and minimum number of handover.In this regard eq (4)-eq (6) denote optimum conditions for user to cell association.Any problem that can be formulated as a Markov-decision process (MDP) can be solved using reinforcement learning (Sutton & Barto, 2017).To select the most suitable RAT for a user  accessing an application g, the mathematical framework required in RL is to map an action (a ∈ A) (selection a RAT) to the existing user state (s ∈ S) (network access conditions and user demands).Assuming a user has the ability to connect to any of the RAT at any time considering the mobility status m, then the reward (R|s:a) obtained for an action (RAT selection) is given by

𝐦𝐚𝐱
where  is a constant selected to regulate the value of the reward function,   () is the action value after observing any state  and ∆ is a small value ensuring the denominator does not become zero.The action value   () is denoted by where  denotes the action taken.

B. State Space
The novel state space employed in this study comprises a three-dimensional tensor given by where  ,  ,  ,  ,  ,  ,  ,  is the cell radius, user mobility status, cell transmit power and user data rate at a time instant  in an  ×  tensor.At any time, instant  which is equivalent to a particular unit of time in an episode the agent observes the state space which is a  ×  tensor constituting  rows and  columns.In the Q-learning algorithm, the index  is equivalent to the row index  in the tensor.In each iteration the agent runs through all the  =  elements in row .

C. Action space
The action space constitutes the set of all actions the agent will take during the learning process.To reach optimality the agent must learn to take an optimum action after each state observation.The action space is given by the tuple   = [   ,  selection.Given a specific state, the agent takes an action, an optimum action is when a user is associated to a cell with maximum power, maximum data rate and lowest handover rate.

D. Q-Learning
To achieve the objective of determining the optimum actions, the study employ the Q-learning method to evaluate the value of each action considering individual states.The action-state pair with highest Q-value becomes the optimal solution.The iteration to evaluate each action ends up with the update in the Q-table where each action is mapped to a corresponding state.The Algorithm 1 Qlearning employs the classic Bellman's equation given by eq. ( 10) (Arulkumaran et al., 2017) (Bega, Costa-Perez, Gramaglia, Sciancalepore, & Banchs, 2019) (Bega et al., 2020) (Bega, Gramaglia, et al., 2019) (Sun et al., 2018) (, ) = ( − )(, ) + { + +    ′ ( ′ ,  ′ )} (10 where  is the learning rate,  is a discount factor and  is the long-term reward observed at time .The term (, ) represents the previous q-value, ( ′ ,  ′ ) is the maximum Q-value in the Q-table .During the learning process, the Q-table is updated by eq. ( 10) until all episodes are completed.In general, the maximum Q-values per row are obtained and mapped to the corresponding states and actions during the policy retrieval in Algorithm 2. Once this is completed the end result is policy table with only states and optimum actions.

A. Policy Retrieval
The algorithm in Algorithm 2 is how the policy is retrieved

Maximum cell radius 5km
Minimum cell radius 1m

Maximum data rate 1Gbps
Minimum data rate 15Mbps Maximum transmit power 58.5dBm

Minimum transmit power 33dBm
In our simulation the study considered a seven tier 5G cell network each of specific maximum cell power ranging from a minimum of 33dBm in a D2D architecture to 58.5dBm in a macrocell.The study also considers a minimum cell data rate of 15 mbps in a D2D network to a maximum of 1 giga bit per second (GBPS) possible in 5G a microcell.The radii were chosen to range from 4m to a maximum of 10km in a macro-cell.Each user may have a mobility status of mobile or static.After 5000 episodes of learning the results of simulations were found as follows.In Fig. 2 the study present rewards per episode where the agent was observed to converge to the optimal solution after 2500 episodes.
The evaluated policy is represented in Fig. 3.The study paired each user state to the selected RAT.Each red spot in the graph represents the pairing region while the remaining blue areas represent no pairing.In summary, it can be observed that the agent ignored pairing D2D and Bluetooth cells to any user due to the initialized low data rate while pairing a maximum of 4 user states to a microcell considered having the highest data rate.In Fig 4 the study compared the allocated power to any paired user with the cell radius.The proposed algorithm continually increments the cell power as the cell radius increased which is the standard practice in cellular networks.In random power allocation the study evaluated has lower efficiency; this can be observed as small cells were allocated high power.In In conclusion, the study has presented an RL based RAT selection scheme for specific use case.The study shows that the proposed technique has improved efficiency in associating a user to RAT considered the required cell power and data rates.The study presents a novel RL environment and reward function for state-action evaluation.The study presented results compared to the random association, it is observed that the mechanism out performs the random mechanism.The consideration of a finite state space is however a limitation in this study as state spaces may become continuous leading inefficient memory use if Q-learning is considered under what is known as the curse of dimensionality.It therefore recommended that for continuous and large state spaces, a deep Q-learning approach be considered.In this regard function approximators such as neural networks be employed for better performance.